Fusion by J.P. Morgan is an exciting new client-facing product based on Public Cloud that delivers data management solutions built around post-trade services, enabling institutional investors to leverage the power of clean, interoperable data to maximise operational efficiencies. We are providing a client-focused product that enables our clients to seamlessly access normalized and interoperable data through modern distribution methods including APIs and solutions available on Public Cloud.
As a Site Reliability Engineer for Fusion by J.P. Morgan you will be responsible for the overall health of the platform focused on reliability, resiliency and availability. You will define and implement key service level metrics (e.g., SLAs and SLOs), measured through appropriate tooling. The team will rely on your expertise to build out observability for their services. You will lead Root Cause Analysis/Post Mortems all the way through to implementing outcomes.
You do not need financial services experience to apply to this role.
As the Site Reliability Engineer for Fusion by J.P. Morgan, you will:
- Have a strong passion for ensuring systems are observable, healthy, available and resilient.
- Understand how to define and implement SLAs based on appropriate SLOs and SLIs, in collaboration with business users.
- Lead a culture of operational excellence, focusing on appropriate observability tooling (e.g. monitoring, logging, tracing, alerting).
- Partner with application engineering teams to prioritize reliability focused changes either based around observability tooling or application resiliency.
- Care deeply about blameless Root Cause Analysis/Post Mortems as well as following up to implement required changes.
We are looking for someone that has:
- Experience operating, implementing and improving distributed & highly concurrent service-based architectures, including microservices, containerized services, and/or serverless architectures.
- Hands-on experience with container management and orchestration (using tools such as Docker and Kubernetes)
- Proven experience in maintaining scalability and resiliency of a complex environment.Proven experience in implementing advanced observability practices and techniques at scale.
- Experience in building out observability to continuously understand the health of systems, using OpenTelemetry and tools such as Grafana, Prometheus, Datadog, Cloudwatch, Splunk, Jaegar Tracing, X-Ray.
- Excellent understanding of managing a production incident, through to Root Cause Analysis/Post Mortem and implementation of RCA outcomes.
- A mindset geared towards a fantastic end-to-end engineering experience supported by excellent tooling and automation and testing.
- Hands-on experience managing platforms/systems on Public Cloud (AWS preferred) using tooling such as Terraform.
