Sheshank Dudaboina
Staff Site Reliability Engineer
I've spent 8+ years making distributed systems more reliable — from bare-metal Linux to multi-cloud Kubernetes at scale. I care about observability, reducing on-call toil, and writing tools that make engineers' lives easier.
01.Experience
Leading reliability efforts for a ML model serving platform. Architecting multi-cloud infrastructure across AWS and GCP, building observability stacks with Grafana + Victoria Metrics, and driving SLO adoption across engineering.
Built and operated Kubernetes clusters at scale. Led the migration from a monolithic monitoring setup to a Prometheus/Grafana stack. Reduced MTTR by 40% through improved runbooks and automated incident triage.
Managed Linux infrastructure, implemented container adoption with Docker, and built CI/CD pipelines. On-call for production services, led incident response process improvements.
02.Skills
- ›AWS (EC2, EKS, RDS, S3, IAM, VPC)
- ›GCP (GKE, Cloud Run, BigQuery)
- ›Terraform
- ›Pulumi
- ›Ansible
- ›Kubernetes (EKS, GKE, self-managed)
- ›Helm
- ›ArgoCD
- ›Docker
- ›Kustomize
- ›Grafana
- ›Victoria Metrics
- ›Prometheus
- ›Loki
- ›Grafana Alloy
- ›Tempo
- ›Datadog
- ›New Relic
- ›incident.io
- ›PagerDuty
- ›SLO/SLI/Error budgets
- ›Post-mortem facilitation
- ›Go
- ›Python
- ›Bash
- ›TypeScript
- ›SQL
- ›PromQL/MetricsQL
- ›LogQL
03.Certifications
I write about what I learn on the job.
SRE · Kubernetes · Observability · Incident Response
Read my writing →