Kairos
Back to jobs

LLM Inference Performance & Evals Engineer

On-site
CerebrasToronto, ON, CA11 months agoWebsite
Software

Compensation

Salary undisclosed
Apply
Share

Description

About The Role

Join the inference model team dedicated to bring up the state-of-the-art models, numerically validating and accelerating new model ideas on wafer-scale hardware. You will prototype architectural tweaks, build performance-eval pipelines, and turn hard numbers into changes that land in production.

Key Responsibilities

  • Prototype and benchmark cutting-edge ideas: new attentions, MoE, speculative decoding, and many more innovations as they emerge. 
  • Develop agent-driven automation that designs experiments, schedules runs, triages regressions, and drafts pull-requests. 
  • Work closely with compiler, runtime, and silicon teams: unique opportunity to experience the full stack of software/hardware innovation. 
  • Keep pace with the latest open- and closed-source models; run them first on wafer scale to expose new optimization opportunities. 

Skills And Qualifications 

  • 3 + years building high-performance ML or systems software. 
  • Solid grounding in Transformer math—attention scaling, KV-cache, quantisation—or clear evidence you learn this material rapidly. 
  • Comfort navigating the full AI toolchain: Python modeling code, compiler IRs, performance profiling, etc. 
  • Strong debugging skills across performance, numerical accuracy, and runtime integration. 
  • Prior experience in modeling, compilers or crafting benchmarks or performance studies; not just black-box QA tests. 
  • Strong passion to leverage AI agents or workflow orchestration tools to boost personal productivity.

Assets

  • Hands-on with flash-attention, Triton kernels, linear-attention, or sparsity research.
  • Performance-tuning experience on custom silicon, GPUs, or FPGAs. 
  • Proficiency in C/C++ programming and experience with low-level optimization. 
  • Proven experience in compiler development, particularly with LLVM and/or MLIR. 
  • Publications, repos, or blog posts dissecting model speed-ups. 
  • Contributions to open-source agent frameworks.

Stack

PythonC++GPULLMsAgentic AIMachine LearningTriton
Posted
Jul 24, 2025
Last seen
Jun 25, 2026
First seen
Jun 25, 2026
Status
active
LLM Inference Performance & Evals Engineer at Cerebras | Kairos