Director, Infrastructure and SRE

Austin, Texas

Req ID: 201956

At SolarWinds, we’re a people-first company. Our purpose is to enrich the lives of the people we serve—including our employees, customers, shareholders, partners, and communities. Join us in our mission to help customers accelerate business transformation with simple, powerful, and secure solutions.

The ideal candidate thrives in an innovative, fast-paced environment and is collaborative, accountable, ready, and empathetic. We’re looking for individuals who believe they can accomplish more as a team and create lasting growth for themselves and others. We hire based on attitude, competency, and commitment. Solarians are ready to advance our world-class solutions in a fast-paced environment and accept the challenge to lead with purpose. If you’re looking to build your career with an exceptional team, you’ve come to the right place. Join SolarWinds and grow with us!

We seek a Director (Infrastructure & Site Reliability Engineering) with extensive experience in AWS, Azure, Kubernetes to lead our Site Reliability Engineering (SRE) team. The ideal candidate will manage a team of SREs and ensure our cloud infrastructure and services availability, reliability, and scalability. The successful candidate will deeply understand SRE practices and have a track record of implementing high-quality site reliability engineering practices (SLAs, SLOs, Proactive Alert Management, Incident Response/Review, Post Mortems, Capacity Planning, Costs Management etc).

Responsibilities:

Manage and lead a team of SREs responsible for ensuring the reliability and availability of our cloud infrastructure and services
Develop and implement site reliability engineering practices to improve service availability, performance, and scalability
Collaborate with cross-functional teams to design and implement new features and services that meet customer needs and business requirements
Develop and implement incident response plans and post-incident reviews to identify root causes and prevent future incidents
Monitor and analyze system performance metrics to identify and resolve performance bottlenecks
Build and maintain relationships with key stakeholders, including internal customers and vendors
Stay up-to-date with industry trends and emerging technologies related to site reliability engineering and cloud infrastructure

Qualifications:

Bachelor’s degree in Computer Science or related field, or equivalent work experience
10+ years of experience in site reliability engineering, infrastructure engineering, or a related field
4+ years of experience managing and leading a global team of SREs or infrastructure engineers
Must have extensive experience with AWS, Kubernetes in a SaaS product with >$100M ARR
Experience with developing and implementing site reliability engineering practices and incident response plans
Programming skills with a high level language like Python/Go and familiarity with IAC tools like Terraform, Pulumi
Expertise with distributed systems for large-scale data processing and stream processing.
Strong analytical and problem-solving skills
Excellent communication and interpersonal skills
Demonstrated ability to build and maintain relationships with internal and external stakeholders

SolarWinds is an Equal Employment Opportunity Employer. SolarWinds will consider all qualified applicants for employment without regard to race, color, religion, sex, age, national origin, sexual orientation, gender identity, marital status, disability, veteran status or any other characteristic protected by law.

All applications are treated in accordance with the SolarWinds Privacy Notice: https://www.solarwinds.com/applicant-privacy-notice