
Site Reliability Engineer - Ops & Automation
Compensation
Salary undisclosedDescription
About the Role
We are building a high-performance SRE function to support one of the world’s fastest-growing AI inference services, powered by the Wafer-Scale Engine (WSE), helping deliver infrastructure for frontier-class models from leading model builders such as OpenAI.
This role offers immediate ownership of real production systems at a growing scale, direct mentorship from seasoned engineers, and close collaboration with incoming Staff SREs who will focus on long-term automation. After ~1 month of shared hands-on operations with the Staff engineers, you’ll primarily operate the current setup, bring up new capacity in high-stakes environments and help bring new continuous delivery pipelines into production use.
If you thrive in high-ownership SRE roles at scale and want to help shape a team from the ground up in cutting-edge AI Inference infrastructure, this is your chance.
This role does not require 24/7 on-call rotations.
Key Responsibilities
- Remain hands-on with operational execution (releases, capacity changes, cluster upgrades) over the next year as we build robust continuous delivery pipelines and self-service capabilities
- Contribute to the development of self-service CD pipelines for key workflows using our stack: Kubernetes, Bazel, Prometheus/Grafana/InfluxDB, Python, and Go.
- Build reusable automation and internal developer tools that minimize operational toil and cross-team friction
- Develop and extend telemetry, observability and alerting solutions to ensure operational reliability at scale
- Collaborate with Cluster Ops and development teams to identify high-impact automation opportunities and iterate quickly
- Contribute to reliability practices (SLOs, post-mortems, capacity planning)
Required Experience & Skills
- 2-4+ years in SRE with a strong operations or automation focus
- Production Kubernetes experience
- Solid Python or Go for building tools and automation
- Proficiency with Prometheus, Grafana, and observability-driven workflows
- Ability to measure and communicate impact – reliability metrics, operational toil, velocity gains
Nice-to-Have
- Hands-on GitOps expertise, Argo CD / Flux or equivalent, is a plus
- Experience with building continuous delivery pipelines is a strong plus
- Experience with Bazel or similar build systems is a strong plus.
- Familiarity with capacity planning, on-prem or multi-datacenter environments
Location
- SF Bay Area
- Toronto
Stack
- Posted
- Oct 14, 2025
- Last seen
- Jun 25, 2026
- First seen
- Jun 25, 2026
- Status
- active