Principal Site Reliability Engineering Manager
Location: Redmond, Washington
Job Type: Full time
Sr. Site Reliability Engineering Manager
Digital Security and Resilience (DSR)
The mission of Microsoft Digital Security & Resilience (DSR) is to enable Microsoft to build the most trusted devices and services, while keeping our company safe and our data protected. As part of Microsoft’s Security, Compliance, Identity, and Management organization, and a steward of Microsoft and our customer’s data, a core function of Microsoft DSR is ensuring the security of every aspect of the business. Microsoft DSR is responsible for company-wide information security and compliance, with a strategic focus on information protection, assessment, awareness, governance, and enterprise business continuity. As customer zero, we deploy and secure these services inside Microsoft and then share best practices with enterprise customers at scale across the globe. We have exciting opportunities for you to innovate, influence, transform, inspire and grow within our organization and we encourage you to apply to learn more!
The Site Reliability Engineering (SRE) team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Site Reliability Engineering Manager, you will lead a team to identify and deliver service improvements using your expertise in services engineering, systems, networks, complexity analysis, Incident management, customer engagement and software know-how, reliability and dependency analysis and scalable system design principles. Strong collaboration skills will be required to work closely with other engineering teams, service owners and support teams to ensure services/systems are highly stable and performant, meeting the expectations of our user base across the company.
SREs are people who take engineering-based approaches to solving operations problems; we like infrastructure, we like seeing how the big, complicated thing works, and most importantly, we gain great satisfaction from making it better. Our Site Reliability engineers are persistent problem solvers, always focused on mitigating issues and owning a problem until resolution is in place. To accomplish this, they work in close collaboration with various engineering teams. They are also involved in automation, developing tools to support DevOps model, and analyzing vast amounts of data to find trends and suggest improvements. Creativity and data-driven decision making is heavily valued in this emerging role. In order to make our services reliable, we need you -- someone who already is, or is interested in becoming, a Site Reliability Engineer (also known as SRE), within our SAS Site Reliability Engineering team.
Site Reliability Engineers team build, monitor, and maintain the systems and infrastructure that ensure our customers can quickly access their data and run workloads whenever they need to. We identify service problems and areas for improvement, and we help implement solutions. Our work is key to the security and credibility of many of the Microsoft services and Microsoft’s credibility. Secure Admin Services provide access to Microsoft’s entire infrastructure and ecosystem in a secure manner.
At Microsoft, we can offer you a strong team, exciting challenges, and a fun place to work. The work environment empowers you to have a positive impact for thousands of users.
With your experience in leading engineering/SRE functions, mentoring and nurturing teams, and knowledge of you will deliver software improvements, with security Mindset, Windows OS, enterprise infrastructure, incident handling, customer experience and overall service health and passion for quality to envision you will champion our own digital transformation and support thousands of employees to keep our identity and environment secure.
The right candidate for this job (is):
- Enjoys new technological challenges and is motivated to solve them
- Excited about making better software and continuously improving the development, integration, and deployment processes
- Smart, highly motivated, self-starter who thrives in a bottoms-up, fast-paced, highly technical environment
- Effective collaborator, experienced in creating technical partnerships across teams
- Unwavering passion for meeting customer demands and delivering a dial tone service
- Provide deep technical leadership to a team of highly passionate and skilled engineers for a cross-functional, highly visible, operations team supporting the secure access services platform for Microsoft’s corporate network.
- Identify opportunities and drive the implementation of automation to improve service health, manageability, reliability, telemetry, and technical documentation.
- Communicate on a deeply technical level with product engineering, project management and operations teams to improve and optimize products, improve infrastructure, and evolve services.
- Remain current on new technologies, methods and procedures including, but not limited to, coding practices such as Test-Driven Development, Continuous Integration, and Continuous Deployment.
- Design, write, and deliver software and infrastructure to solve problems relating to mission critical services, and create solutions to prevent problems from recurring, with the goal being to automate response to all non-exceptional service conditions.
- Influence and collaborate across orgs to bring best practices, architectures, standards, and methods for large-scale distributed systems.
- Coordinate planning and execution with internal engineering teams, business partners and technical leaders across the division
- Own deployment, availability, reliability, performance and customer escalation targets for these environments
- Proactive identification and reduction of issues through design, testing, and implementation of software to Uphold high organizational standard of great employee and team satisfaction
Knowledge, Experience, and skills required:
- 8+ years of Software, Site Reliability, Systems, technical services/infrastructure, or Service Engineering experience.
- 5+ years of managerial experience
- 3+ years of operational experience to improve service reliability, availability, and performance using strong problem-solving skills to drive for results.
Preferred, not required:
- Bachelor/Master’s degree in computer science
- Track record of people management focus; building healthy, diverse, high functioning teams and demonstrable readiness to begin
- Proven track record of improving reliability, available and performance of services
- Excellence in written and verbal communication and ability to partner for success across all levels of organization and technical depths
- Strong problem solving and drive for results.
- Passionate about continuous improvement in process and services. As
- A customer-first and growth mindset to uphold our north star and culture.
- Experience with full-stack troubleshooting skills across network, application, hardware, management fabric, or distributed services layers.
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
- Exposure and familiarity with Agile & SRE principles, automated deployments and build pipelines
- Budgetary, resource and capacity planning/forecasting experience
- Background in delivering security services at scale.
- Excellence in written and verbal communication, presentation and ability to partner for success across all levels of organization and technical depths.
- Ability to drive large, complex programs and solutions
- The ability to analyze problems and make appropriate decisions quickly
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.