All posts
kubernetesobservabilitygrafanavictoria-metrics

Building a Production Observability Stack on Kubernetes

February 28, 2025 12 min read

A year ago, our observability was a patchwork: Datadog for some services, CloudWatch for AWS infra, a neglected Prometheus installation nobody trusted, and Loki logs that nobody queried because the dashboards were broken. Engineers copy-pasted metrics queries from Slack threads. SLOs were defined in a spreadsheet.

We fixed it. This is the story of how — and what it cost us (and saved us).

The problem with "just use Datadog"

Datadog is excellent software. It's also extraordinarily expensive at scale. When you have hundreds of services emitting thousands of metrics at 15-second intervals, the cardinality bill arrives like a surprise houseguest and refuses to leave.

More importantly: paying for a vendor to own your observability creates a subtle dependency that compounds over time. Dashboards live in their UI. Alerts are configured through their system. When you want to query raw data or build a custom exporter, you're working against the grain.

We wanted an observability stack we owned, understood, and could evolve cheaply.

The stack we landed on

Metrics:  Victoria Metrics (single-node for now, cluster-ready)
Logs:     Loki + Grafana Alloy (collector)
Traces:   Tempo (lightweight, S3-backed)
Dashboards: Grafana (single pane of glass)
Alerting: Grafana Alerting → incident.io → PagerDuty

All running on Kubernetes. All deployed via Helm with GitOps (ArgoCD).

Why Victoria Metrics over Prometheus

Prometheus is the standard, but it has a dirty secret: it's not designed for long-term storage or high cardinality. We were hitting memory pressure and retention limits constantly.

Victoria Metrics gives you:

  • Better compression (typically 5-10x vs Prometheus)
  • Native MetricsQL which is a superset of PromQL
  • Much lower memory footprint
  • Built-in downsampling for long-term retention

The migration was smoother than expected because it's Prometheus-compatible — same scrape configs, same exporters, same alerting rules. We pointed our existing Prometheus remote_write at VictoriaMetrics and ran both in parallel for two weeks before cutting over.

Grafana Alloy as the universal collector

We used to run a zoo of collectors: Prometheus node exporter, Fluent Bit for logs, OTEL collector for traces. Each had its own config format. Each had its own upgrade cycle.

Alloy (formerly Grafana Agent Flow) replaced all of them with a single, composable config language. One agent per node. One config pipeline to maintain. It talks to everything: Prometheus endpoints, OTLP, Loki, Tempo.

The config is declarative and version-controlled, which means our entire collection pipeline is auditable in git.

Structuring logs for queryability

The most common mistake I see with Loki: treating it like Elasticsearch. Loki is optimized for label-indexed log streams, not full-text search over arbitrary fields. If you push JSON logs and filter by any field inside the JSON, you're doing a table scan.

Structure your labels deliberately:

Good labels: environment, cluster, namespace, pod, container
Bad labels: request_id, user_id, trace_id (high cardinality!)

Use structured log lines with a consistent format so LogQL's line_format and label_format pipeline stages can extract fields efficiently.

SLO dashboards that engineers actually use

We used to define SLOs in a Google Sheet. They were aspirational. Nobody looked at them during incidents.

Now every service has a Grafana SLO dashboard with three panels:

  1. Error budget burn rate (hourly and daily)
  2. SLI over the rolling 28-day window
  3. Alert status (is the fast-burn alert currently firing?)

These are generated from a template using Grafonnet (Grafana's Jsonnet library). Adding a new SLO is a 3-line PR.

The alerting pipeline

Our alert routing:

Grafana Alert → PagerDuty API → incident.io (auto-create incident) → Slack #incidents

incident.io handles the ceremony: creating the incident Slack channel, assigning roles, tracking timeline, and generating the post-mortem template. Engineers focus on fixing, not on coordination overhead.

What it cost to build

Honest numbers for a mid-size deployment (~200 services, ~50M active time series):

  • Engineering time to migrate: ~3 months, 2 engineers part-time
  • Infrastructure cost: ~$400/mo (vs ~$8,000/mo Datadog bill)
  • Ongoing maintenance: ~4 hours/week per SRE

The ROI was obvious. The less obvious benefit was ownership: we can now add custom metrics, build custom dashboards, and extend alerting logic without filing a support ticket or worrying about custom metric pricing.

Lessons learned

Start with Grafana as the front-end first. Even if you keep Datadog or CloudWatch as backends, Grafana as a single pane of glass immediately reduces the "which tool do I check" friction.

Don't migrate everything at once. We ran parallel stacks for three months, which felt slow but meant we never had an observability gap during migration.

Cardinality is the enemy of performance. Every time you add a label, ask: "how many unique values can this have?" If the answer is "unbounded," it's a log field, not a metric label.

Alert on symptoms, not causes. "Error rate > 1%" is a symptom. "CPU > 80%" is usually not worth paging for. Build your alert tree starting from SLIs.

The stack runs well. When something breaks, we can see it. When we're on-call, we have the tools to understand it. That's the whole job.