Kairos
Back to jobs

Member of Technical Staff, AI Training Infrastructure

On-site
Fireworks AISan Mateo, CA, US1 year agoWebsite
Fresh
Engineering

Compensation

$175,000-$220,000
Apply
Share

Description

The Role: 

As a Training Infrastructure Engineer, you'll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure. You'll collaborate with AI researchers and engineers to create robust training pipelines, optimize distributed training workloads, and ensure reliable model development.

Key Responsibilities:

  • Design and implement scalable infrastructure for large-scale model training workloads
  • Develop and maintain distributed training pipelines for LLMs and multimodal models
  • Optimize training performance across multiple GPUs, nodes, and data centers
  • Implement monitoring, logging, and debugging tools for training operations
  • Architect and maintain data storage solutions for large-scale training datasets
  • Automate infrastructure provisioning, scaling, and orchestration for model training
  • Collaborate with researchers to implement and optimize training methodologies
  • Analyze and improve efficiency, scalability, and cost-effectiveness of training systems
  • Troubleshoot complex performance issues in distributed training environments

Minimum Qualifications:

  • Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
  • 3+ years of experience with distributed systems and ML infrastructure
  • Experience with PyTorch
  • Proficiency in cloud platforms (AWS, GCP, Azure)
  • Experience with containerization, orchestration (Kubernetes, Docker)
  • Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)

Preferred Qualifications:

  • Master's or PhD in Computer Science or related field
  • Experience training large language models or multimodal AI systems
  • Experience with ML workflow orchestration tools
  • Background in optimizing high-performance distributed computing systems
  • Familiarity with ML DevOps practices
  • Contributions to open-source ML infrastructure or related projects

Stack

PyTorchGPULLMsDistributed SystemsAWSGCPAzureMachine LearningKubernetesDocker
Posted
Apr 21, 2025
Last seen
Jun 25, 2026
First seen
Jun 25, 2026
Status
active