Kairos
Back to jobs

AI Infrastructure Operations Engineer

On-site
CerebrasSunnyvale, CA, US / Toronto, ON, CA2 years agoWebsite
Deployment

Compensation

Salary undisclosed
Apply
Share

Description

About The Role

We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power. 

You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.  

Responsibilities

  • Manage and operate multiple advanced AI compute infrastructure clusters. 
  • Monitor and oversee cluster health, proactively identifying and resolving potential issues. 
  • Maximize compute capacity through optimization and efficient resource allocation. 
  • Deploy, configure, and debug container-based services using Docker. 
  • Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed. 
  • Handle engineering escalations and collaborate with other teams to resolve complex technical challenges. 
  • Contribute to the development and improvement of our monitoring and support processes. 
  • Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies. 

Skills And Requirements

  • 6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing. 
  • Strong proficiency in Python scripting for automation and system administration. 
  • Deep understanding of Linux-based compute systems and command-line tools. 
  • Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM. 
  • Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner. 
  • Experience with monitoring and alerting systems. 
  • Should have a proven track record to own and drive challenges to completion. 
  • Excellent communication and collaboration skills. 
  • Ability to work effectively in a fast-paced environment. 
  • Willingness to participate in a 24/7 on-call rotation. 

Preferred Skills And Requirements

  • Operating large scale GPU clusters.
  • Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
  • Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
  • Familiarity with machine learning frameworks and tools.
  • Experience with cross-functional team projects. 

Location 

  • SF Bay Area.
  • Toronto, Canada.
  • Bangalore, India.

Stack

PythonGPUDistributed SystemsAWSGCPAzureMachine LearningKubernetesDocker
Posted
Mar 6, 2024
Last seen
Jun 25, 2026
First seen
Jun 25, 2026
Status
active
AI Infrastructure Operations Engineer at Cerebras | Kairos