whoami

Sheshank Dudaboina

Staff Site Reliability Engineer

Currently @ Baseten

I've spent 8+ years making distributed systems more reliable — from bare-metal Linux to multi-cloud Kubernetes at scale. I care about observability, reducing on-call toil, and writing tools that make engineers' lives easier.

01.Experience

2022 – Present
Staff Site Reliability Engineer
Baseten

Leading reliability efforts for a ML model serving platform. Architecting multi-cloud infrastructure across AWS and GCP, building observability stacks with Grafana + Victoria Metrics, and driving SLO adoption across engineering.

AWSGCPKubernetesGrafanaVictoria MetricsTerraformincident.io
2019 – 2022
Senior Site Reliability Engineer
Previous Company

Built and operated Kubernetes clusters at scale. Led the migration from a monolithic monitoring setup to a Prometheus/Grafana stack. Reduced MTTR by 40% through improved runbooks and automated incident triage.

KubernetesPrometheusGKEDatadogPagerDutyTerraform
2016 – 2019
Infrastructure Engineer
Previous Company

Managed Linux infrastructure, implemented container adoption with Docker, and built CI/CD pipelines. On-call for production services, led incident response process improvements.

LinuxDockerAWS EC2AnsibleNew RelicJenkins

02.Skills

Cloud & Infrastructure
  • AWS (EC2, EKS, RDS, S3, IAM, VPC)
  • GCP (GKE, Cloud Run, BigQuery)
  • Terraform
  • Pulumi
  • Ansible
Containers & Orchestration
  • Kubernetes (EKS, GKE, self-managed)
  • Helm
  • ArgoCD
  • Docker
  • Kustomize
Observability
  • Grafana
  • Victoria Metrics
  • Prometheus
  • Loki
  • Grafana Alloy
  • Tempo
  • Datadog
  • New Relic
Incident Management
  • incident.io
  • PagerDuty
  • SLO/SLI/Error budgets
  • Post-mortem facilitation
Languages & Tools
  • Go
  • Python
  • Bash
  • TypeScript
  • SQL
  • PromQL/MetricsQL
  • LogQL

03.Certifications

CKA — Certified Kubernetes AdministratorCNCF
AWS Solutions Architect — ProfessionalAWS
Google Professional Cloud ArchitectGCP

I write about what I learn on the job.

SRE · Kubernetes · Observability · Incident Response

Read my writing →