Capital-markets event ingest at sub-millisecond latency.

// Re-architected a brittle pipeline to handle 4× the previous peak with zero degradation in P99.

Problem

The platform's market-data ingest had been built over four years by three different teams. It worked — most days. On the days the market didn't behave like the test fixtures, it didn't. The most expensive failure modes were silent: a partition would lag by hundreds of milliseconds, a downstream consumer would miss a tick, and a trader would see a stale book without anyone seeing a red light on a dashboard.

Two structural issues were driving the pain. The pipeline blended hot-path ingest with slow analytical writes on the same consumers, so any analytical work-load — a backfill, an index rebuild, a noisy customer — created head-of-line blocking on the market-data path. And the latency budget had no contract. Every team had a different definition of "acceptable" P99, and none of them were enforced in code.

Approach

We rebuilt the ingest path around three commitments. The hot path does one thing: receive, normalize, fan out. The analytical path runs on its own consumers, against its own broker topics, with its own back-pressure policy. And every component publishes a latency histogram on a known channel, so the budget becomes observable instead of aspirational.

The transport stayed on Kafka — replacing it would have been a year of work for a single-digit-percent latency win. What changed was how it was used: partitioning by instrument family rather than by customer, idempotent producer semantics on every write, and a strict separation between the topics consumed by the trading path and the topics consumed by analytics. ClickHouse handled the analytical sink because its column-store profile fit the access pattern; the trading path never touched it.

Decisions & trade-offs

Kept Kafka, replaced the consumer topology. The cost of moving to a different log was high and the win was speculative. The cost of redesigning consumer responsibilities was a few weeks and the win was deterministic.
Pushed normalization into a single, boring Go service. A previous team had introduced a streaming framework to handle this. It added 1.4ms of P99 and three new failure modes. We replaced it with a 600-line service that did exactly one thing.
Made the latency budget a first-class object. Every service emitted its own histogram against a published budget. Breaches paged. This was the change that made the others stick — once latency was contractual, the team stopped treating it as a feeling.

Outcome

The pipeline now handles four times the previous peak load with P99 inside the published budget. No production incidents in the six months following handoff. The analytical path runs on its own infrastructure, which means the trading path is no longer competing for IOPS with a quarterly backfill job. The team that took over the system from us has been able to extend it without re-introducing the original failure shape — which, in the end, is the metric that mattered.