Load simulation & latency percentiles
Skeema models how your architecture behaves as traffic grows — from 100 to 100,000 concurrent users — and reports the numbers engineers actually use to reason about performance. This page explains those numbers from first principles.
The two questions performance answers
Every performance discussion comes down to two measurements:
- Latency
- How long a single request takes, end to end. Measured in milliseconds (ms). Lower is better.
- Throughput
- How many requests the system handles per unit time — requests per second (
RPS) or queries per second (QPS). Higher is better.
They’re related but distinct: a system can have low latency at low load and still collapse under high throughput. Skeema simulates both — it ramps throughput (the user tier) and reports the resulting latency.
Why averages lie — and percentiles don’t
The average (mean) latency hides the experience of your slowest users. If 99 requests take 50 ms and one takes 5,000 ms, the average is ~100 ms — which no single user experienced. Percentiles describe the distribution instead.
| Percentile | Reads as | Meaning |
|---|---|---|
P50 | Median | Half of requests are faster than this. The “typical” experience. |
P90 | 90th percentile | 9 in 10 requests are faster than this. |
P95 | 95th percentile | Common SLO target; only 1 in 20 requests is slower. |
P99 | 99th percentile | Tail latency. 1 in 100 requests is slower — your worst real experiences. |
Why the tail (P99) matters more than you’d think
A single user action often fans out into many backend calls. If one page makes 20 service calls and each has a 1% chance of being slow, the probability that at least one is slow is about 1 − 0.99²⁰ ≈ 18%. So a “1% tail” at the service level becomes a ~1-in-5 slow page at the user level. This is why teams set SLOs on P95/P99, not averages.
How a bottleneck forms
Every component has a capacity — a ceiling on requests per second. As load climbs toward that ceiling, requests start to queue, and queueing time dominates latency (this is the practical lesson of queueing theory: wait time rises sharply as utilization approaches 100%). The first component to saturate is the bottleneck — and the weakest link sets the throughput of the whole path. Skeema names it explicitly.
How Skeema simulates
The simulation is a transparent heuristic, not a load test:
- •Each node has a base latency by type (e.g. cache ≈ 2 ms, Postgres ≈ 18 ms, external API ≈ 250 ms).
- •A load multiplier scales latency as the user tier rises and utilization climbs.
- •Skeema sums latency along each critical (synchronous) path and adds P95/P99 variance.
- •The node carrying the largest share of load is flagged as the bottleneck with a root cause.
The architecture score (A–F)
Alongside latency, Skeema grades the design across four dimensions and rolls them into a single A–F score:
| Dimension | What it rewards |
|---|---|
| Reliability | Redundancy, replication, no single points of failure |
| Scalability | Load balancing, caching, async decoupling, horizontal scaling |
| Observability | Monitoring, logging, and tracing components |
| Security | Auth, gateways, and isolation boundaries |
For each issue, Skeema proposes a concrete fix — “add a load balancer”, “add a read replica”, “move email to a queue” — and can apply it to the diagram, anchoring the new nodes next to the ones they relate to.
- ✓Latency = how long one request takes; throughput = how many you handle per second.
- ✓Use percentiles, not averages. P95/P99 (the tail) is what users actually feel at scale.
- ✓Bottlenecks form when load nears a component’s capacity and requests queue — the weakest link caps the path.
- ✓Skeema’s simulation is directional: great for comparing designs and finding weak links, not a substitute for a real load test.