Kairos
Back to jobs

ML Software Tool Development Engineer

On-site
CerebrasSunnyvale, CA, US / Toronto, ON, CA4 months agoWebsite
Software

Compensation

Salary undisclosed
Apply
Share

Description

Responsibilities:

  • Lead the design and implementation of system-level debugging, validation, and observability platforms.
  • Develop automated systems for collecting and analyzing numerical, and execution anomalies.
  • Create visualization and analysis tools to enable efficient root-cause investigation.
  • Build frameworks for failure classification, regression detection, and anomaly monitoring.
  • Extend compilers, runtimes, and programming interfaces to support advanced profiling and instrumentation.
  • Improve system bring-up, low-level debug, and validation workflows.
  • Partner cross-functionally with compiler, hardware, firmware, runtime, and infrastructure teams.
  • Establish best practices for debuggability, reliability, and operational excellence.
  • Lead high-impact initiatives.
  • Support incident response and drive long-term corrective actions.

 

Qualifications: 

  • Strong proficiency in C++ and Python, with a track record of building reliable, high-performance systems and tooling.
  • Demonstrated experience debugging complex hardware/software systems and driving issues to root cause.
  • Experience analyzing system-level data structures, execution graphs, or dependency networks for diagnostics and validation.
  • Proven ability to design and build intuitive visualization and analysis tools for complex technical data.
  • Experience with compiler internals, custom hardware interfaces, or low-level protocol design.
  • Strong written and verbal communication skills, with the ability to explain technical concepts to diverse stakeholders.
  • Ability to work independently and lead complex technical projects end-to-end.

 

Preferred Skills & Qualifications

  • Familiarity with machine learning training and inference pipelines, especially distributed training and large-model scaling.
  • Prior work on high-performance clusters, HPC systems, or custom hardware/software co-design.

 

Stack

PythonC++Machine Learning
Posted
Feb 17, 2026
Last seen
Jun 25, 2026
First seen
Jun 25, 2026
Status
active
ML Software Tool Development Engineer at Cerebras | Kairos