Applications for this job have closed. Try searching for similar jobs.

Lead Site Reliability Engineer/Cloud Engineer


Location: Springfield, Illinois

Job Type:

Last updated

Be brave, not perfect.
- Reshma Saujani

Discover. A brighter future.

With Discover, you’ll have the chance to make a difference at one of the world’s leading digital banking and payments companies. From Day 1, you’ll do meaningful work you’re passionate about, with the support and resources you need for success. We value what makes each employee unique and provide a collaborative, team-based culture that gives everyone an opportunity to shine. Be the reason millions of people find a brighter financial future, while building the future you want, here at Discover.

Job Description

Responsible for the technical design, deployment, monitoring and ongoing support and maintenance of a diverse set of cloud technologies. The role is a technical, hands-on opportunity with a heavy focus on automation, resilient design and deployment of cloud ready systems and services. This role collaborates with Product teams internal and external to IS to provide world class products and services in support of our application development community, and our business as whole. This is a 'DevOps' position, responsible for the full-stack engineering and support of products that support our hybrid cloud capabilities.

Being a Site Reliability Engineer at Discover is someone who likes to take responsibility for new applications going into production to ensure operational excellence (Availability, latency, performance, efficiency, problem management, monitoring, emergency response and capacity planning). You will participate in anything that prevents a system/app from serving its’ intended purpose. Could be slowness, could be an outage, to understand how we can improve Time to Detect, Time to Fix, and Time to Mitigate issues. You will improve our monitoring solutions and define SLIs/SLOs. You will develop automated solutions using a variety of coding languages. We are organized as a Chapter organization, so you will be expected to lead the SRE mindset across the organization.

In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics.


  • Leads the design, build and maintenance of modern cloud platforms that support agile teams.
  • Partners with key stakeholders as a platform champion for cloud-native systems, and coaches on how to use platform capabilities effectively through appropriate venues.
  • Drives continuous improvement of cloud products & capabilities though internal user groups and external market research.
  • Driving innovation and platform evolution, Scaling cloud infrastructure to support our growing ecosystem
  • Provide reliable, predictable deployment and maintenance of distributed systems Adhering to security best practices
  • Writing and designing automation, monitoring, diagnostics and debug tooling to improve troubleshooting and recovery
  • Participating in production support and on-call rotations
  • Conducting incident management and contribute to associated retrospective/post mortem as needed
  • Responsible for the Stability and Performance of critical Business Services
  • Contribute to associated retrospective/post mortem as needed
  • Participating in Agile Sprints and associated ceremonies

Minimum Qualifications

  • Bachelor’s Degree in Information Technology
  • 6+ years of Application or platform development, Consulting, or related
  • In lieu of education 8+ years of Application or platform development, consulting, or related

Desired Skills

  • 3+ years in a SRE role
  • Well versed with the entire software development lifecycle, DevOps, and SRE practices
  • Experience with operational monitoring tools with a mindset towards predictive analysis
  • Working knowledge of the automation tools such as Ansible, Terraform, or Chef
  • Familiar with Pivotal Cloud Foundry (PCF), OpenShift (OCP), Amazon Web Service (AWS), and Google Cloud Platform (GCP)
  • A solid understanding of working with git
  • Experience with troubleshooting and debugging issues at any level
  • Strong knowledge and understanding of microservices based architectures, APIs, etc.
  • Good understanding of networking including L2 and L3 concepts, including Firewall, Load Balancing, Routing and Switching.
  • A working knowledge of Linux based systems and Virtual Machines (VM) technology
  • Strong scripting skills including ability to write scripts from scratch using Python and/or Bash
  • Can identify and mitigate reliability risks
  • Excellent communication and troubleshooting skills
  • Strong analytical and problem-solving skills
  • Basic knowledge and understanding of Security (CIA Model and PCI compliance) is a plus
  • Experience with Continuous Integration and Continuous Delivery models including Blue/Green and Canary release models is a plus
  • Experience with continuous integration/deployment frameworks such as Jenkins

#LI-SY1 #Remote #BI-Remote

What are you waiting for? Apply today!

The same way we treat our employees is how we treat all applicants – with respect. Discover Financial Services is an equal opportunity employer (EEO is the law). We thrive on diversity & inclusion. You will be treated fairly throughout our recruiting process and without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status in consideration for a career at Discover.