
ML Software Tool Development Engineer
On-site
Software
Compensation
Salary undisclosedDescription
Responsibilities:
- Lead the design and implementation of system-level debugging, validation, and observability platforms.
- Develop automated systems for collecting and analyzing numerical, and execution anomalies.
- Create visualization and analysis tools to enable efficient root-cause investigation.
- Build frameworks for failure classification, regression detection, and anomaly monitoring.
- Extend compilers, runtimes, and programming interfaces to support advanced profiling and instrumentation.
- Improve system bring-up, low-level debug, and validation workflows.
- Partner cross-functionally with compiler, hardware, firmware, runtime, and infrastructure teams.
- Establish best practices for debuggability, reliability, and operational excellence.
- Lead high-impact initiatives.
- Support incident response and drive long-term corrective actions.
Qualifications:
- Strong proficiency in C++ and Python, with a track record of building reliable, high-performance systems and tooling.
- Demonstrated experience debugging complex hardware/software systems and driving issues to root cause.
- Experience analyzing system-level data structures, execution graphs, or dependency networks for diagnostics and validation.
- Proven ability to design and build intuitive visualization and analysis tools for complex technical data.
- Experience with compiler internals, custom hardware interfaces, or low-level protocol design.
- Strong written and verbal communication skills, with the ability to explain technical concepts to diverse stakeholders.
- Ability to work independently and lead complex technical projects end-to-end.
Preferred Skills & Qualifications
- Familiarity with machine learning training and inference pipelines, especially distributed training and large-model scaling.
- Prior work on high-performance clusters, HPC systems, or custom hardware/software co-design.
Stack
PythonC++Machine Learning
- Posted
- Feb 17, 2026
- Last seen
- Jun 25, 2026
- First seen
- Jun 25, 2026
- Status
- active