Key takeaways
After reading this, the reader will understand:
E2E testing
Deterministic quality assurance
Structured reproducibility
Controlled infrastructure
Motivation
Running thousands (3100+) of tests for a SaaS sequentially is not operationally viable at scale. Parallel execution is a deliberate architectural choice driven by the following factors:
-
Maintain fast and actionable feedback loops
Keep CI execution time within a practical window to support continuous integration and frequent releases. -
Validate correctness under concurrency
Ensure the system behaves correctly when multiple servers, database operations, and workflows execute simultaneously. -
Simulate production-like load conditions
Exercise realistic parallel activity to uncover race conditions, resource contention, and coordination issues early. -
Enforce strict test isolation guarantees
Design tests and infrastructure so they remain deterministic and interference-free even when executed concurrently. -
Maximize infrastructure efficiency
Fully utilize available CPU and memory resources to improve throughput and reduce cost per build. -
Stress external integrations realistically
Verify rate limiting, retry logic, and resilience mechanisms against third-party services under concurrent load. -
Preserve delivery velocity at scale
Ensure that test growth does not degrade engineering productivity or release cadence.
Key architectural design decisions for production-representative testing
Shared infrastructure with deterministic scoping
To guarantee inference-free execution, where no test relies on pre-existing state or side effects, we enforce the following principles:
-
Clean-slate setup
Every test case initializes its full scenario from zero: users, accounts, purchases, and all required domain entities are created explicitly within the test. -
Scoped database isolation
Each code branch operates within its own logical scope in the database, preventing unintended data visibility across parallel runs. -
Query-level result restriction (when required)
For stricter isolation cases, tests restrict queries to return only the entities they created, ensuring deterministic behavior even in shared infrastructure.
Ensure inference-free testing via full environment isolation
To keep tests inference-free, each test case starts from a true zero state at the infrastructure level:
-
Dedicated runtime per test case
Each case boots its own application servers and required services from scratch (fresh configs, fresh process state). -
No shared in-memory state
Isolation is enforced by process boundaries (separate forks), preventing cache, singleton, or actor system leakage. -
Deterministic startup and wiring
Servers discover and connect only to the resources assigned to that test (hosts/ports/config), so parallel runs can’t interfere.
In a JVM environment, for example, run forks.
SubProcess(ForkOptions()
.withRunJVMOptions(scala.Vector(
"-XX:ActiveProcessorCount=3",
"-Dscalate.allowCaching=false",
"-Dscalate.allowReload=false",
"-Dlog4j.configurationFile=scala-build-tool-resources/log4j2.properties",
"-Xmx12g"
)
))
This ensures every test validates a complete user flow in a self-contained environment, independent of execution order or other tests.
Use self-hosted runners to reduce costs and resource bottlenecks
We run our tests on self-hosted runners to bypass the CPU and memory limits of standard CI provider tiers. While this requires us to manage the hardware ourselves, it eliminates expensive subscription fees and prevents vendor lock-in.
How we built a custom CI/CD for flexibility and vendor independence
Classify and retry falky tests
In large-scale, integration-heavy environments, occasional flaky tests are inevitable. The key is managing them without undermining confidence in the pipeline.
-
Non-critical classification: Tests that depend on external systems or network conditions are explicitly flagged as non-critical.
-
Automatic retries: These tests may be retried to absorb transient failures (e.g., network glitches, sandbox instability).
-
Signal versus noise monitoring: Isolated failures are treated as noise; sudden spikes indicate real systemic issues, such as third-party outages.
-
Operational flexibility: In emergency hotfix scenarios, we retain a controlled mechanism to temporarily bypass non-critical tests.
This approach keeps CI stable and actionable while preventing transient infrastructure issues from blocking delivery.
Centralize host allocation for parallel end-to-end tests
In our end-to-end tests, each suite boots one or more application servers. Every server must bind to a unique host. If two suites attempt to bind the same address concurrently, the result is a port bind failure and flaky CI behavior.
Originally, host conflicts were detected manually by searching for duplicate string literals across the codebase, a brittle, error-prone approach.
We replaced this with a centralized host registry:
object Hosts {
val `127_0_0_1` = "127.0.0.1"
val `127_0_0_2` = "127.0.0.2"
// ...
}
All test suites must allocate hosts from this table. Why does this matter?
-
Eliminates string duplication errors
-
Makes host usage visible via IDE reference counts
-
Enables automated duplicate detection in CI
-
Simplifies safe parallel execution
This is particularly important for tests simulating production-like, clustered setups (for example, 3–5-node actor system clusters). Each node must bind to a distinct IP to form a proper cluster. Centralized allocation ensures deterministic configuration and prevents accidental overlap across parallel runs.
Automate server and cluster coordination
Once a test defines the number of server instances and their assigned hosts, the rest of the cluster setup is fully automated.
Our testing infrastructure dynamically generates the required configuration so that:
-
Each server instance starts with the correct host and port bindings.
-
Nodes are provided with the appropriate seed nodes.
-
All nodes can discover each other deterministically.
-
The cluster forms without manual wiring.
Tests only declare the topology (for example, single-node, 3-node, or 5-node). Cluster discovery, seed configuration, and inter-node connectivity are derived automatically. This removes configuration duplication, prevents misconfigured clusters, and ensures reproducible cluster formation under parallel execution.
Prefer end-to-end tests over unit tests
Unit tests validate isolated logic. They do not expose failures in:
-
Serialization and deserialization boundaries
-
Middleware chains and request filters
-
Real database queries and indexing
-
Third-party integrations and network behavior
-
Cluster coordination and service discovery
End-to-end tests exercise the system as it actually runs. They validate real processes, real I/O, and real integration points. In distributed systems, correctness emerges from interactions, not from isolated functions.
Build your DSL for end-to-end testing
Writing raw E2E tests quickly leads to duplication and unreadable orchestration code. Instead, we built a domain-specific language (DSL) composed of reusable building blocks.
Low-level operations (HTTP calls, authentication, state transitions) are abstracted into higher-level domain actions. For example, a multi-step user registration flow becomes a single reusable operation that can be invoked across tests.
Client
.plainUnconfirmed() // Registers a client from scratch.
.flow.progresses.initialize()
.stateCheck(_.metadata.name.isEmpty should be true)
.notes(notes)
.stateCheck(_.metadata.notes.size shouldEqual notes.size)
This approach:
-
Keeps tests compact and intention-focused
-
Eliminates repetitive setup logic
-
Improves readability and maintainability
-
Encourages consistent interaction patterns
Tests describe what is being validated, not how the system is driven. The DSL turns complex integration flows into composable, expressive scenarios.
Use real integrations, never mock
Tests that depend on third-party services inevitably encounter rate limits and occasional sandbox instability. Rather than heavily mocking these integrations, we prefer exercising real external systems.
Our approach:
-
Minimize mocking to preserve true end-to-end behavior
-
Use dedicated test data to avoid cross-test interference
-
Actively manage rate limits across parallel executions
-
Control concurrency to simulate a realistic production load
This allows us to validate retry logic, failure handling, and resilience under real-world conditions, not idealized mocks.
Strict control of logging
As test volume increases, aggregated output can exceed CI log size limits, leading to truncated logs and lost diagnostic detail. To prevent this, we deliberately tune logging for CI environments:
-
Elevated log thresholds to capture only high-signal events
-
Reduced verbosity to stay within platform limits
-
Structured logging to preserve context in fewer lines
Parallel execution adds another layer of complexity. Logs from multiple suites interleave, breaking the natural chronological flow of a single test. This makes failure analysis harder, especially for timing-sensitive or concurrency-related defects that are difficult to reproduce locally.
To mitigate this, we complement logs with structured tracing. Traces preserve execution context across services and threads, enabling control-flow reconstruction even when logs are sparse or interleaved.
Key takeaways for decision makers
When these architectural decisions are consistently applied, we typically observe the following results after production deployment:
Area
Typical vendor projects
With production-representative testing
Critical production defects 1 year
Multiple high-severity incidents are common
0 critical bugs
Minor defects 1 year
Ongoing usability-impacting issues
~2 trivial bugs, no usability impact
Mean time to root cause (MTTR analysis)
|
Hours to days |
~10 minutes via tracing and reproducibility
Time to reproduce issue
Often unclear or non-deterministic
< 30 minutes in controlled test environment
Fix delivery time
Delayed by unstable pipelines
Immediate commit to CI after reproduction
CI infrastructure cost
High cloud build-hour expenses
Avoids thousands of build hours/month via controlled infrastructure
Release frequency
1 release every 2–4 weeks
1–5 releases per week (on-demand possible)
Change failure rate
15–30% of releases require a hotfix or rollback
< 2–5% change failure rate
Onboarding transfer efficiency
2–3 months to reach productive autonomy
2–4 weeks to productive autonomy
Incident stress reduction
2–6 high-severity production incidents/year; frequent after-hours escalation; 2–5 engineers involved per major incident
0–1 high-severity incidents/year; rare after-hours escalation; typically 1–2 engineers resolve within 30–60 minutes
Conclusion
Investing in testing architecture reduces operational risk, accelerates delivery, and lowers long-term cost. Reliability is not overhead. It is a competitive advantage.