Senior Manager, Site Reliability Engineering (SRE)

at SolarWinds (View all jobs)

Bangalore, India

Req ID: 201930

At SolarWinds, we’re a people-first company. Our purpose is to enrich the lives of the people we serve—including our employees, customers, shareholders, partners, and communities. Join us in our mission to help customers accelerate business transformation with simple, powerful, and secure solutions.

The ideal candidate thrives in an innovative, fast-paced environment and is collaborative, accountable, ready, and empathetic. We’re looking for individuals who believe they can accomplish more as a team and create lasting growth for themselves and others. We hire based on attitude, competency, and commitment. Solarians are ready to advance our world-class solutions in a fast-paced environment and accept the challenge to lead with purpose. If you’re looking to build your career with an exceptional team, you’ve come to the right place. Join SolarWinds and grow with us!

Role Overview:

SolarWinds is looking for a Senior Manager, Site Reliability Engineering (SRE) to lead reliability, scalability, and operational excellence for large-scale, cloud-native, data-intensive SaaS platforms.

This role combines people leadership, technical depth, and operational ownership. You will manage and grow SRE teams responsible for production systems while remaining close to platform architecture, reliability engineering, incident response, and automation strategy.

The ideal candidate has operated distributed systems in production environments and is comfortable guiding teams through complex troubleshooting, reliability improvements, and architectural decisions. This role requires balancing availability, performance, operational efficiency, and engineering velocity across large-scale SaaS services.

Responsibilities:

  • Lead and mentor SRE teams responsible for the reliability, availability, and performance of production SaaS platforms

  • Own and drive production reliability outcomes, including uptime, latency, scalability, capacity planning, and operational readiness

  • Oversee data-intensive distributed systems, including technologies such as ClickHouse, Kafka, ZooKeeper, MySQL, Redis, and Flink

  • Guide and review Kubernetes platform operations at scale, including cluster lifecycle management, upgrades, troubleshooting, and capacity planning

  • Establish and evolve SRE practices, including SLIs/SLOs, alerting strategies, incident management, and post-incident reviews

  • Lead and participate in production incident response, guiding teams through debugging, root cause analysis, and long-term remediation

  • Promote and enforce an automation-first approach, reducing manual operational work through scripting, tooling, and platform improvements

  • Partner with Engineering, Platform, Product, and Security teams to embed reliability into system design and delivery

  • Drive adoption of GitOps, service mesh, and observability practices across teams

  • Lead cloud infrastructure operations across AWS and Azure, ensuring secure, resilient, and cost-effective platform operations

  • Provide technical mentorship and guidance, helping engineers diagnose complex production issues and improve system reliability

Must Have Qualifications

  • Proven experience leading SRE, Platform, or Infrastructure teams supporting production, customer-facing SaaS systems

  • Strong hands-on experience operating Kubernetes clusters in production environments, including:

    • Cluster lifecycle management and upgrades

    • Troubleshooting platform and workload issues

    • Autoscaling and resilience mechanisms (HPA, VPA, KEDA, Cluster Autoscaler, Pod Disruption Budgets)

    • Observability and monitoring (Prometheus, Grafana)

  • Experience operating distributed data platforms in production environments, such as ClickHouse, Kafka, ZooKeeper, MySQL, Redis, or Flink

  • Practical experience with GitOps and service mesh technologies (e.g., Flux, Kustomize, Istio)

  • Strong automation mindset with hands-on experience using Python and/or Go to reduce operational overhead and improve reliability

  • Extensive experience working with AWS and Azure managed services, including EKS/AKS, Aurora, ElastiCache, storage services, load balancers, VPC, and KMS

  • Demonstrated ownership of incident response, root cause analysis, and long-term reliability improvements

  • Ability to collaborate effectively with engineering leadership and cross-functional teams

 

SolarWinds is an Equal Employment Opportunity Employer. SolarWinds will consider all qualified applicants for employment without regard to race, color, religion, sex, age, national origin, sexual orientation, gender identity, marital status, disability, veteran status or any other characteristic protected by law.

All applications are treated in accordance with the SolarWinds Privacy Notice: https://www.solarwinds.com/applicant-privacy-notice