trace:edge

region.us-central1

gpu.pool.ready

slo.window.28d

p99 118ms

burn 0.4x

gpu 86%

online: AI infra reliability / staff SRE

Reliability for AI systems
at GPU scale.

profile signal

I'm Sheshank Dudaboina, a SRE - AI Infrastructure building production-grade infrastructure for model serving, Kubernetes, multi-cloud platforms, and the observability loops that keep them calm under load.

Read my writing About me

years SRE

GPU

model serving

AWS/GCP

multi-cloud

SLO

driven ops

inference-fleet/us-central1

healthy

gpu_utilization86%

p99_latency118ms

error_budget96.4%

sheshank@gpu-control-plane ~

❯ kubectl get nodes -l accelerator=nvidia -o wide

NAME GPU STATUS REGION VERSION

a100-pool-1 8 Ready us-central1 v1.29.2

h100-pool-2 8 Ready us-east1 v1.29.2

l4-burst-3 4 Ready us-west1 v1.29.2

❯ cat reliability.yaml # production priorities

role: SRE - AI Infrastructure

focus: ai platforms, observability, automation

currently: building at Baseten Labs

❯

01.Skills & Tools

KubernetesGPU servingAWSGCPGrafanaVictoria MetricsLokiAlloyPrometheusPagerDutyLinuxTerraformArgoCDAI platforms

02.Latest Writing

All posts

Mar 20, 2026grafanagitops

How I Built a GitOps Pipeline for Grafana Dashboard Lifecycle Management

A GitOps workflow for Grafana dashboards that keeps the UI as the authoring surface while adding version control, CI validation, peer review, and an audit trail.

6 min readRead on Medium

Aug 16, 2025grafanamimir

How We Set Up Grafana Mimir (Single-Node) with Prometheus on EC2: A Step-by-Step Guide

A practical guide to running single-node Grafana Mimir with Prometheus on EC2, tuned for observability workloads and local NVMe-backed storage.

4 min readRead on Medium

Aug 13, 2025kubernetesobservability

From “metrics-only” to “event-driven” observability

How Kubernetes events can close the gap between metrics, logs, and root cause analysis by streaming cluster events into Loki for richer incident context.

3 min readRead on Medium

03.Projects

All projects

Grafana Dashboard Generator

Generates Grafana dashboards from a simple YAML SLO definition. Supports burn-rate alerts and error budget visualization.

GoGrafonnetKubernetesHelm

View on GitHub

k8s-runbook-operator

A Kubernetes operator that attaches runbooks to PodDisruptionBudgets and automatically links them in PagerDuty alerts.

Gocontroller-runtimePagerDuty API

View on GitHub

victoria-metrics-migrator

CLI tool for migrating Prometheus recording rules and alert rules to Victoria Metrics with cardinality analysis.

PythonPromQLMetricsQL

View on GitHub

Reliability for AI systemsat GPU scale.

01.Skills & Tools

02.Latest Writing

How I Built a GitOps Pipeline for Grafana Dashboard Lifecycle Management

How We Set Up Grafana Mimir (Single-Node) with Prometheus on EC2: A Step-by-Step Guide

From “metrics-only” to “event-driven” observability

03.Projects

Grafana Dashboard Generator

k8s-runbook-operator

victoria-metrics-migrator

Reliability for AI systems
at GPU scale.