Kairos
Back to jobs

Site Reliability Engineer - Ops & Automation

On-site
CerebrasSunnyvale, CA, US / Toronto, ON, CA8 months agoWebsite
Fresh
AI Cloud

Compensation

Salary undisclosed
Apply
Share

Description

About the Role 

We are building a high-performance SRE function to support one of the world’s fastest-growing AI inference services, powered by the Wafer-Scale Engine (WSE), helping deliver infrastructure for frontier-class models from leading model builders such as OpenAI. 

This role offers immediate ownership of real production systems at a growing scale, direct mentorship from seasoned engineers, and close collaboration with incoming Staff SREs who will focus on long-term automation. After ~1 month of shared hands-on operations with the Staff engineers, you’ll primarily operate the current setup, bring up new capacity in high-stakes environments and help bring new continuous delivery pipelines into production use. 

If you thrive in high-ownership SRE roles at scale and want to help shape a team from the ground up in cutting-edge AI Inference infrastructure, this is your chance. 

This role does not require 24/7 on-call rotations. 

Key Responsibilities 

  • Remain hands-on with operational execution (releases, capacity changes, cluster upgrades) over the next year as we build robust continuous delivery pipelines and self-service capabilities 
  • Contribute to the development of self-service CD pipelines for key workflows using our stack: Kubernetes, Bazel, Prometheus/Grafana/InfluxDB, Python, and Go. 
  • Build reusable automation and internal developer tools that minimize operational toil and cross-team friction 
  • Develop and extend telemetry, observability and alerting solutions to ensure operational reliability at scale 
  • Collaborate with Cluster Ops and development teams to identify high-impact automation opportunities and iterate quickly 
  • Contribute to reliability practices (SLOs, post-mortems, capacity planning) 

Required Experience & Skills 

  • 2-4+ years in SRE with a strong operations or automation focus 
  • Production Kubernetes experience 
  • Solid Python or Go for building tools and automation 
  • Proficiency with Prometheus, Grafana, and observability-driven workflows 
  • Ability to measure and communicate impact – reliability metrics, operational toil, velocity gains 

Nice-to-Have 

  • Hands-on GitOps expertise, Argo CD / Flux or equivalent, is a plus 
  • Experience with building continuous delivery pipelines is a strong plus 
  • Experience with Bazel or similar build systems is a strong plus. 
  • Familiarity with capacity planning, on-prem or multi-datacenter environments 

Location   

  • SF Bay Area 
  • Toronto 

 

 

Stack

PythonKubernetes
Posted
Oct 14, 2025
Last seen
Jun 25, 2026
First seen
Jun 25, 2026
Status
active