10 Hard-Won Lessons from 5 Years On-Call
Being on-call teaches you things no classroom or certification ever could. Here are the lessons I keep coming back to after incidents that ranged from embarrassing typos to full multi-region outages.
I build and operate reliable infrastructure at scale — multi-cloud (AWS & GCP), Kubernetes, and deep observability stacks. I write about what I learn.
Being on-call teaches you things no classroom or certification ever could. Here are the lessons I keep coming back to after incidents that ranged from embarrassing typos to full multi-region outages.
How we moved from scattered Datadog dashboards to a unified, cost-efficient observability stack using Grafana, Victoria Metrics, Loki, and Alloy — and what we learned along the way.
Generates Grafana dashboards from a simple YAML SLO definition. Supports burn-rate alerts and error budget visualization.
A Kubernetes operator that attaches runbooks to PodDisruptionBudgets and automatically links them in PagerDuty alerts.
CLI tool for migrating Prometheus recording rules and alert rules to Victoria Metrics with cardinality analysis.