Everything we do, we believe in challenging the status quo. The way we challenge the status quo is by thinking differently, stretching ourselves to go all the way to the root of the problem, keeping data in front and center for all our decisions. We just happen to work in Cosmos DB team with a strong belief that Supportability of a product is a key differentiator in today's modern world and our customers deserve a world class support experience.

If you share the same purpose, cause and belief and have passion to follow this pursuit, please read through the rest of the Job description on what we do, and we would love to have you join us!

Azure Cosmos DB is one of the fastest growing Azure services that provides globally distributed, low-latency, massively scalable, multi-model cloud database service. It is designed to enable developers to build planet-scale applications.

We know that the SRE discipline is evolving; we learn from our peers in industry and aim to contribute to this evolution by innovating on SRE within our group and sharing those innovations in public.

We are looking for a self-driven Site Reliability Engineer (SRE) who likes taking engineering-based approaches to solve Supportability problems, with a history of engineering excellence and experience in supporting cloud services. You will be responsible for optimizing and operating supportability improvements in a data-driven manner, working closely with Software Engineers to design and deliver experience that adheres to services best practices, highly available, reliable, scalable, provides a great user experience, and meets our compliance policies and requirements.

You’ll be focused on driving continuous improvements across the lifecycle of our services with automation in mind. You’ll also demonstrate a history of managing multiple priorities, deep technical and online services skills, a focus on using metrics and data, and a strong supportability-first mindset.

Our team focuses on diversity of all types of candidates for our roles and we strive to hire people with different experiences and perspectives into our team. To that end, we know that no candidate has every desired skill and experience, but all of us together make our team strong.

Responsibilities

Responsibilities may include but are not limited to:

Collaborating closely with several engineering teams on building and enhancing tooling and automation solutions for faster resolution of customer issues and avoiding them altogether when possible.
Partnering with external platform teams building the support tooling with the ability to extend those to meet the needs of any special requirements.
Ability to design and implement any changes to service telemetry for the automation to consume if it's not already available.
Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc.
Analyze data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.
Engage and foster opportunities to improve existing planning, processes, and automation.

Qualifications

Required Qualifications:

Bachelor’s degree in Computer Science, Engineering, or related technical field.
3+ years of SRE or SWE experience running large scale online/hybrid services in cloud environments (Azure/AWS/GCP), applying site reliability principles and/or demonstrating sensitivity to operational concerns. Automation-related experience valued.
Experience with any of C#/Java/Python as a primary language.

Preferred Qualifications and Experience:

Fluency in one or more automation languages like PowerShell, Python etc.
Specifically desired is a deep understanding and familiarity with Observability and MELT (Monitoring, Events, Logging, and Tracing) design and implementation patterns for large-scale distributed services.
Experience in hypothesis driven development, test-driven development/behavior driven development desirable.
Familiar with Agile/Scrum/Lean Methodology.

Skills Needed:

Strong problem-solving, troubleshooting, and analytical skills.
Ability to deal with the ambiguity associated with working in a fast-paced and changing environment and aren't afraid to change things to make them better.
Intellectual curiosity and high EQ (emotional intelligence) will serve the successful candidate well.
Great communicator with the ability to analyze and clearly articulate complex issues.
Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.

#AZDAT #ENGGJOBS

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to, the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

Responsibilities

Qualifications

Site Reliability Engineer - Azure Cosmos DB