Lessons learned from running 3100+ tests in a single project

Key takeaways

After reading this, the reader will understand:

E2E testing

How parallel, production-representative E2E testing enables 1–5 releases per week while keeping the change failure rate below 2–5%.

Deterministic quality assurance

How deterministic isolation and real integrations can result in 0 critical production bugs in Year 1 and only ~2 trivial, non-impacting defects.

Structured reproducibility

How structured reproducibility reduces root cause analysis to ~10 minutes and issue reproduction to < 30 minutes, rather than hours or days.

Controlled infrastructure

How controlled infrastructure and self-hosted CI can eliminate thousands of paid cloud build hours per month, while maintaining stable parallel execution.

Details

Motivation

Running thousands (3100+) of tests for a SaaS sequentially is not operationally viable at scale. Parallel execution is a deliberate architectural choice driven by the following factors:

Maintain fast and actionable feedback loops
Keep CI execution time within a practical window to support continuous integration and frequent releases.
Validate correctness under concurrency
Ensure the system behaves correctly when multiple servers, database operations, and workflows execute simultaneously.
Simulate production-like load conditions
Exercise realistic parallel activity to uncover race conditions, resource contention, and coordination issues early.
Enforce strict test isolation guarantees
Design tests and infrastructure so they remain deterministic and interference-free even when executed concurrently.
Maximize infrastructure efficiency
Fully utilize available CPU and memory resources to improve throughput and reduce cost per build.
Stress external integrations realistically
Verify rate limiting, retry logic, and resilience mechanisms against third-party services under concurrent load.
Preserve delivery velocity at scale
Ensure that test growth does not degrade engineering productivity or release cadence.

Solution

Key architectural design decisions for production-representative testing

Shared infrastructure with deterministic scoping

To guarantee inference-free execution, where no test relies on pre-existing state or side effects, we enforce the following principles:

Clean-slate setup
Every test case initializes its full scenario from zero: users, accounts, purchases, and all required domain entities are created explicitly within the test.
Scoped database isolation
Each code branch operates within its own logical scope in the database, preventing unintended data visibility across parallel runs.
Query-level result restriction (when required)
For stricter isolation cases, tests restrict queries to return only the entities they created, ensuring deterministic behavior even in shared infrastructure.

Ensure inference-free testing via full environment isolation

To keep tests inference-free, each test case starts from a true zero state at the infrastructure level:

Dedicated runtime per test case
Each case boots its own application servers and required services from scratch (fresh configs, fresh process state).
No shared in-memory state
Isolation is enforced by process boundaries (separate forks), preventing cache, singleton, or actor system leakage.
Deterministic startup and wiring
Servers discover and connect only to the resources assigned to that test (hosts/ports/config), so parallel runs can’t interfere.

In a JVM environment, for example, run forks.

				
					SubProcess(ForkOptions()
  .withRunJVMOptions(scala.Vector(
    "-XX:ActiveProcessorCount=3",
    "-Dscalate.allowCaching=false",
    "-Dscalate.allowReload=false",
    "-Dlog4j.configurationFile=scala-build-tool-resources/log4j2.properties",
    "-Xmx12g"
  )
))

This ensures every test validates a complete user flow in a self-contained environment, independent of execution order or other tests.

Use self-hosted runners to reduce costs and resource bottlenecks

We run our tests on self-hosted runners to bypass the CPU and memory limits of standard CI provider tiers. While this requires us to manage the hardware ourselves, it eliminates expensive subscription fees and prevents vendor lock-in.

How we built a custom CI/CD for flexibility and vendor independence

Classify and retry falky tests

In large-scale, integration-heavy environments, occasional flaky tests are inevitable. The key is managing them without undermining confidence in the pipeline.

Non-critical classification: Tests that depend on external systems or network conditions are explicitly flagged as non-critical.
Automatic retries: These tests may be retried to absorb transient failures (e.g., network glitches, sandbox instability).
Signal versus noise monitoring: Isolated failures are treated as noise; sudden spikes indicate real systemic issues, such as third-party outages.
Operational flexibility: In emergency hotfix scenarios, we retain a controlled mechanism to temporarily bypass non-critical tests.

This approach keeps CI stable and actionable while preventing transient infrastructure issues from blocking delivery.

Centralize host allocation for parallel end-to-end tests

In our end-to-end tests, each suite boots one or more application servers. Every server must bind to a unique host. If two suites attempt to bind the same address concurrently, the result is a port bind failure and flaky CI behavior.

Originally, host conflicts were detected manually by searching for duplicate string literals across the codebase, a brittle, error-prone approach.

We replaced this with a centralized host registry:

				
					object Hosts {
  val `127_0_0_1` = "127.0.0.1"
  val `127_0_0_2` = "127.0.0.2"
  // ...
}

All test suites must allocate hosts from this table. Why does this matter?

Eliminates string duplication errors
Makes host usage visible via IDE reference counts
Enables automated duplicate detection in CI
Simplifies safe parallel execution

This is particularly important for tests simulating production-like, clustered setups (for example, 3–5-node actor system clusters). Each node must bind to a distinct IP to form a proper cluster. Centralized allocation ensures deterministic configuration and prevents accidental overlap across parallel runs.

Automate server and cluster coordination

Once a test defines the number of server instances and their assigned hosts, the rest of the cluster setup is fully automated.

Our testing infrastructure dynamically generates the required configuration so that:

Each server instance starts with the correct host and port bindings.
Nodes are provided with the appropriate seed nodes.
All nodes can discover each other deterministically.
The cluster forms without manual wiring.

Tests only declare the topology (for example, single-node, 3-node, or 5-node). Cluster discovery, seed configuration, and inter-node connectivity are derived automatically. This removes configuration duplication, prevents misconfigured clusters, and ensures reproducible cluster formation under parallel execution.

Prefer end-to-end tests over unit tests

Unit tests validate isolated logic. They do not expose failures in:

Serialization and deserialization boundaries
Middleware chains and request filters
Real database queries and indexing
Third-party integrations and network behavior
Cluster coordination and service discovery

End-to-end tests exercise the system as it actually runs. They validate real processes, real I/O, and real integration points. In distributed systems, correctness emerges from interactions, not from isolated functions.

Build your DSL for end-to-end testing

Writing raw E2E tests quickly leads to duplication and unreadable orchestration code. Instead, we built a domain-specific language (DSL) composed of reusable building blocks.

Low-level operations (HTTP calls, authentication, state transitions) are abstracted into higher-level domain actions. For example, a multi-step user registration flow becomes a single reusable operation that can be invoked across tests.

				
					Client
  .plainUnconfirmed() // Registers a client from scratch.
  .flow.progresses.initialize()
  .stateCheck(_.metadata.name.isEmpty should be true)
  .notes(notes)
  .stateCheck(_.metadata.notes.size shouldEqual notes.size)

This approach:

Keeps tests compact and intention-focused
Eliminates repetitive setup logic
Improves readability and maintainability
Encourages consistent interaction patterns

Tests describe what is being validated, not how the system is driven. The DSL turns complex integration flows into composable, expressive scenarios.

Use real integrations, never mock

Tests that depend on third-party services inevitably encounter rate limits and occasional sandbox instability. Rather than heavily mocking these integrations, we prefer exercising real external systems.

Our approach:

Minimize mocking to preserve true end-to-end behavior
Use dedicated test data to avoid cross-test interference
Actively manage rate limits across parallel executions
Control concurrency to simulate a realistic production load

This allows us to validate retry logic, failure handling, and resilience under real-world conditions, not idealized mocks.

Strict control of logging

As test volume increases, aggregated output can exceed CI log size limits, leading to truncated logs and lost diagnostic detail. To prevent this, we deliberately tune logging for CI environments:

Elevated log thresholds to capture only high-signal events
Reduced verbosity to stay within platform limits
Structured logging to preserve context in fewer lines

Parallel execution adds another layer of complexity. Logs from multiple suites interleave, breaking the natural chronological flow of a single test. This makes failure analysis harder, especially for timing-sensitive or concurrency-related defects that are difficult to reproduce locally.

To mitigate this, we complement logs with structured tracing. Traces preserve execution context across services and threads, enabling control-flow reconstruction even when logs are sparse or interleaved.

Key takeaways for decision makers

When these architectural decisions are consistently applied, we typically observe the following results after production deployment:

Area

Typical vendor projects

With production-representative testing

Critical production defects 1 year

Multiple high-severity incidents are common

0 critical bugs

Minor defects 1 year

Ongoing usability-impacting issues

~2 trivial bugs, no usability impact

Mean time to root cause (MTTR analysis)

Hours to days

~10 minutes via tracing and reproducibility

Time to reproduce issue

Often unclear or non-deterministic

< 30 minutes in controlled test environment

Fix delivery time

Delayed by unstable pipelines

Immediate commit to CI after reproduction

CI infrastructure cost

High cloud build-hour expenses

Avoids thousands of build hours/month via controlled infrastructure

Release frequency

1 release every 2–4 weeks

1–5 releases per week (on-demand possible)

Change failure rate

15–30% of releases require a hotfix or rollback

< 2–5% change failure rate

Onboarding transfer efficiency

2–3 months to reach productive autonomy

2–4 weeks to productive autonomy

Incident stress reduction

2–6 high-severity production incidents/year; frequent after-hours escalation; 2–5 engineers involved per major incident

0–1 high-severity incidents/year; rare after-hours escalation; typically 1–2 engineers resolve within 30–60 minutes

Conclusion

Investing in testing architecture reduces operational risk, accelerates delivery, and lowers long-term cost. Reliability is not overhead. It is a competitive advantage.

Explore our Technology expertise

Lessons learned from running 3100+ tests in a single project

Key takeaways

E2E testing

Deterministic quality assurance

Structured reproducibility

Controlled infrastructure

Motivation

Key architectural design decisions for production-representative testing

Shared infrastructure with deterministic scoping

Ensure inference-free testing via full environment isolation

Use self-hosted runners to reduce costs and resource bottlenecks

Classify and retry falky tests

Centralize host allocation for parallel end-to-end tests

Automate server and cluster coordination

Prefer end-to-end tests over unit tests

Build your DSL for end-to-end testing

Use real integrations, never mock

Strict control of logging

Key takeaways for decision makers

Area

Typical vendor projects

With production-representative testing

Conclusion

Lessons learned from running 3100+ tests in a single project

Scaling development without scaling infrastructure: centralized Couchbase initialization in high-density environments

How we built a custom CI/CD for flexibility and vendor independence

Scaling development without scaling infrastructure: centralized Couchbase initialization in high-density environments

How we built a custom CI/CD for flexibility and vendor independence

Company information

Useful links

Compliance