
Performance & Reliability Engineer
On-site
Fresh
Performance
Compensation
Salary undisclosedDescription
About The Role
Join Cerebras as a Performance & Reliability Engineer within our innovative Co-Design and Next Generation Team. Our groundbreaking CS-3 system has set new benchmarks in high-performance ML training and inference solutions. It leverages a dinner-plate sized chip with 44GB of on-chip memory to surpass traditional hardware capabilities. This role focuses on characterizing and optimizing the performance and reliability of state-of-the-art AI models running on Cerebras' breakthrough hardware.
Responsibilities
- Characterize and enhance the performance and reliability of advanced ML hardware/software systems, with emphasis on reducing power and thermal fluctuations.
- Analyze ML workloads, software kernels, and hardware architecture for power and performance impacts, and synthesize high-level insights across these layers.
- Develop creative software solutions to improve reliability and performance, collaborating cross-functionally to deploy these solutions in production.
- Influence the design of Cerebras' next-generation AI architecture and software stack through rigorous workload analysis and computational efficiency optimization.
- Partner with ML engineers, researchers, and reliability specialists to understand model behavior and drive system-level improvements from a software perspective.
- Collaborate with teams in architecture, silicon, and research to advance our computational platforms and influence future system designs.
Skills & Qualifications
- BS, MS, or PhD in Computer Science, Electrical Engineering, or a related field.
- 3+ years of relevant experience in performance engineering, reliability, computer architecture, and/or software design.
- Proficiency in Python or other scripting languages.
- Experience with C/C++ and assembly programming.
- Demonstrated expertise with system-level performance and reliability optimization.
- Strong verbal and written communication skills.
- Nice to have: Hands-on experience with ML models, ML frameworks, and collective communication.
- Nice to have: Understanding of thermal management principles and power delivery for advanced semiconductors.
Stack
PythonC++Machine Learning
- Posted
- Nov 25, 2025
- Last seen
- Jun 25, 2026
- First seen
- Jun 25, 2026
- Status
- active