Human Baseliner for Open-Ended ML Research Tasks
Undisclosed employer
Compensation
$75-$90/hr
Description
Overview
We are hiring experienced machine learning engineers and researchers to serve as human baseliners for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated.
What You’ll Do
-
Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial)
-
Work independently in a sandboxed Linux environment with internet access
-
Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT
-
Record your full working session via screen recording
-
Complete a short pre-task and post-task questionnaire
-
Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment
Commitment
-
Minimum 20 hours per week if selected
-
More availability is strongly preferred
Requirements
Candidates must meet all of the following:
-
3+ years of machine learning experience
-
Time spent in a PhD program counts toward this requirement
-
Undergraduate and master’s experience does not count
-
-
Attended a top-100 university or worked at FAANG or a comparable company
-
Experience with at least one major ML framework such as PyTorch, JAX, or TensorFlow
-
Deep, hands-on expertise in at least one of the focus areas below:
-
Pretraining under tight data and compute budgets
-
PPO, reward shaping, custom
gym/gymnasiumenvironments, and throughput tuning -
Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation
-
Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance
-
Architecture design under strict parameter-count or size constraints
-
Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives
-
Contrastive training for embedding or retrieval models
-
Generative vision or video modeling
-
Multilingual or low-resource language experience
-
Image or video data pipelines at scale
-
Experience balancing competing model objectives such as safety and capability
-
Prior work as an ML evaluator, red-teamer, or baseliner
-
Required Domain Expertise
Candidates must have strong practical experience in at least one of the following:
-
Pretraining: training transformer language models from scratch
-
Reinforcement learning: training agents in custom or existing environments
-
Post-training: fine-tuning and aligning LLMs
-
Dataset curation: building and cleaning large text corpora for LLM training
-
Model architecture: designing and modifying neural network architectures
Logistics (work trial requirements)
-
One baseline attempt per contractor per task
-
Each task may only be attempted once by a given contractor
-
All work is confidential and covered by NDA
-
Compute and environment are provided; no personal GPU is required
- Commitment
- Hourly
Skills & categories
- Posted
- Apr 16, 2026
- Slots remaining
- 5
- First seen
- Jun 30, 2026
- Last seen
- Jun 30, 2026