Site Reliability Engineer
Location: Greater London
Job Type: Full time
As a Site Reliability Engineer (SRE), you'll help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems, building infrastructure, and reducing work through automation. You’ll join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks. In this environment, you’ll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE, you’ll be focused on running better production applications and systems.
- Design, code, test, and deliver software to automate manual operational work
- Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents
- Engage with development team throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes
- Identify application patterns and analytics in support of better service level objectives
- Design self-healing and resiliency patterns
- Design automated software and product upgrades, change management, and release management solutions in private cloud and public cloud(AWS)
- Coach or manage teams as applicable
- Build monitoring, observatory tools for AWS application migration and onboarding
- Participate in the 24x7 support coverage as needed
- Bachelor’s degree or equivalent experience in an software engineering discipline
- Expertise in at least one technology stack designing, coding, testing, and delivering software
- Proficiency in one or more technology domains, may be a cross-domain expert able to solve complex and mission critical problems within a business or across the firm
- Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks)
- Working knowledge of AWS cloud deployment, monitoring, and ops analysis tools such as Kubernetes, Prometheus, CloudWatch, Elasticsearch, Grafana, Kibana, Splunk, DynaTrace, etc.
- Working knowledge of automation of key services and functions deployment like EKS, ECS, Fargate, S3, EC2, Route 53, etc. Terraform skill is preferred.
- Excellent debugging and trouble shooting skills