
Member of Technical Staff - Research Engineer
Compensation
Salary undisclosedDescription
About Black Forest Labs
We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We’re creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we're just getting started.
Headquartered in Freiburg, Germany with a growing presence in San Francisco, we’re scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity.
Why This Role
Large-scale training is where research ideas become real, and where many of the hardest problems are no longer cleanly separated into “research” or “engineering.” A promising architecture only matters if we can train it stably, efficiently, and correctly across large GPU fleets.
In this role, you will be embedded in production training and help where the hardest systems and performance problems arise: attention performance, custom kernels, low-precision training, profiling, memory behavior, data movement, distributed training stability, and throughput regressions. You will work directly with researchers, but your output will often be code, measurements, kernels, debugging tools, and training-system changes that make better research possible.
We are open to a range of seniority for this role. The common thread is deep technical ownership: you should be able to make progress in ambiguous training-system problems, verify your results, and own the outcome.
What You’ll Work On
- Improve the performance, reliability, and numerical stability of production training runs for large multimodal generative models
- Profile full training steps across model code, attention, kernels, data loading, encoders, communication, optimizer steps, checkpointing, and memory pressure
- Implement and validate GPU-level optimizations: fused kernels, attention paths, low-precision matmuls, quantization kernels, CUDA/Triton/CuTe/CUTLASS experiments, and no-compile alternatives where they make sense
- Push lower-precision training forward, including FP8 / MXFP8 / FP4-style paths, weight and activation quantization, accumulation choices, convergence risk, and quality tradeoffs against baseline training runs
- Work with researchers to translate architecture changes into efficient training implementations, and help distinguish real model-quality progress from changes that only look good in a microbenchmark
- Debug distributed training failures: NaNs, loss spikes, silent numerical drift, memory leaks, stragglers, bad nodes, NCCL issues, and throughput cliffs
- Build benchmarking and profiling harnesses that make performance claims trustworthy across hardware, shapes, sequence lengths, and training configurations
- Help the training team move quickly when an urgent bottleneck appears, while turning repeated failures into better abstractions and tools
What We’re Looking For
- Experience working deeply on large-scale training systems, ideally as part of a training group working closely with researchers
- Strong PyTorch fluency, including comfort reading and modifying low-level training code rather than only using high-level APIs
- Experience with distributed training concepts such as FSDP, tensor/model/context/sequence parallelism, activation checkpointing, NCCL, and overlapping compute and communication
- Hands-on experience improving training throughput, memory footprint, or stability in real training runs
- Experience profiling GPU workloads with tools like Nsight Systems, Nsight Compute, torch profiler, trace viewers, or custom telemetry
- Practical GPU performance judgment: you may use modern coding agents and tools as much as you want, but you need the understanding to verify correctness, numerical behavior, and performance, and to own the result
- Understanding of low-precision training and quantization tradeoffs: FP8, MXFP8, FP4/NVFP4-style formats, scaling, accumulation, numerical validation, and convergence risk
- Good research judgment: you can partner with researchers on ablations, understand what the measurements do and do not prove, and keep optimization work tied to model-quality outcomes
- Comfortable operating in ambiguity: sometimes the task is a clean implementation, sometimes it is a production fire, and sometimes it is figuring out which of three plausible explanations is actually true
We'd be especially excited if you:
- Have supported or co-owned training for a frontier foundation model that shipped or reached a major release
- Have written or substantially improved forward/backward GPU kernels, or have shown you can make progress on kernel-level work with strong measurement and validation discipline
- Have worked on attention performance, variable sequence length training, non-standard attention patterns
- Have experience on Hopper or Blackwell-class GPUs
- Have worked on low-precision training
- Have experience with diffusion, flow matching, DiT, and multimodal generative model training; if your deepest background is autoregressive or LLM training systems, you are excited to learn the diffusion and multimodal modeling stack quickly
- Can move naturally between profiler traces, kernel code, distributed systems failures, and research discussions
How We Work Together
We’re a distributed team with real offices that people actually use. Depending on your role, you’ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We’ll cover reasonable travel costs to make this possible. We think in-person time matters, and we’ve structured things to make it accessible to all. We’ll discuss what this will look like for the role during our interview process.
Everything we do is grounded in four values:
- Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.
- Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.
- Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.
- Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.
If this sounds like work you’d enjoy, we’d love to hear from you.
Base Annual Salary:
US $180,000 - $290,000 + equity
Stack
- Posted
- Jun 29, 2026
- Last seen
- Jun 29, 2026
- First seen
- Jun 29, 2026


