Invent Labsbook a call
// work · advisory + engineering
infra

Observability cost rescue for a post-Series-B SaaS.

Cut observability spend by two-thirds without losing a single trace; rebuilt the alerting pipeline as a side benefit.

observability spend−68%
P99 trace query4.2s → 280ms
annual savings$1.2M

// Cut observability spend by two-thirds without losing a single trace; rebuilt the alerting pipeline as a side benefit.

Problem

The company had grown into a meaningful observability bill. The vendor contract was up for renewal, the proposed renewal number was a meaningful percentage of one engineer's salary per month, and the engineering org had quietly developed two opposing positions about it. The SRE team wanted to renew — the tooling was familiar and the runbooks were written against it. The platform team wanted to migrate — the bill had become a board-level conversation and they suspected most of the cost was being driven by data nobody actually queried.

Both sides were partly right. Renewing on the existing pricing model meant accepting a cost line that would scale super-linearly with the next round of growth. Migrating without a plan meant a six-month project to recreate dashboards that nobody had ever looked at, on infrastructure that nobody on the team had operated before.

Approach

We started with a usage audit. Two weeks of tagging every query against every dashboard, every alert, every saved view. The output was unsurprising in shape and surprising in degree: 91% of traces had never been queried after ingest. 70% of the alerts had not fired in twelve months. The cost was being driven by retention policies that were, on inspection, defaults that nobody had ever revisited.

With the audit in hand, the migration was no longer a six-month project. It was three buckets. Metrics, where the existing vendor's pricing was actually competitive — we kept them. Logs, where the cost was 4× a managed ClickHouse footprint at the same retention — we moved them. Traces, where the cost was dominated by data nobody read — we moved them too, with aggressive sampling on the hot paths and full fidelity on the long tail.

The toolchain landed on OpenTelemetry as the wire format, ClickHouse as the store, and Grafana as the query surface. None of these were novel choices. The novelty was in the sampling and retention policies, which were derived from the audit and re-checked monthly against actual query patterns.

Decisions & trade-offs

  • Audited before migrating. The audit was two weeks. The migration plan it enabled was three. The migration plan without it would have been six. Two weeks of measurement is almost always cheaper than the project it shapes.
  • Kept metrics on the existing vendor. Migrating the entire stack would have been ideologically satisfying. It would not have been economic. We moved the cost drivers, not the cost line items.
  • Sampled aggressively on hot paths, kept the long tail full-fidelity. A naive sampling policy loses the data you most need. A per-route, per-error-class sampling policy preserves the rare failures and discards the common successes. This is where most of the trace-query latency win came from.
  • Made the policies inspectable. Sampling rates, retention policies, and ingest costs live in the same Grafana folder as the dashboards they support. Anyone on the team can see, today, why a given query is fast or slow.

Outcome

Observability spend down 68% versus the proposed renewal — roughly $1.2M ARR back on the business. Trace query P99 down from 4.2 seconds (vendor) to 280ms (ClickHouse on the new footprint), which has measurably improved the SRE team's debugging cadence. The alerting pipeline got rebuilt as a side effect of the migration — most of the dead alerts simply weren't ported over — and false-positive page volume is down roughly an order of magnitude as a result.

// work · next step

Recognize the shape of this one?

If your team is staring at a problem with this silhouette, that's usually a good signal an engagement would be useful. The first conversation is free and short.