Kairos
Back to gigs

Human Baseliner for Open-Ended ML Research Tasks

Remote

Undisclosed employer

Fresh
Gig
Mercor

Compensation

$75-$90/hr

Apply on Mercor

Description

Overview

We are hiring experienced machine learning engineers and researchers to serve as human baseliners for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated.

What You’ll Do

  • Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial)

  • Work independently in a sandboxed Linux environment with internet access

  • Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT

  • Record your full working session via screen recording

  • Complete a short pre-task and post-task questionnaire

  • Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment

Commitment

  • Minimum 20 hours per week if selected

  • More availability is strongly preferred

Requirements

Candidates must meet all of the following:

  • 3+ years of machine learning experience

    • Time spent in a PhD program counts toward this requirement

    • Undergraduate and master’s experience does not count

  • Attended a top-100 university or worked at FAANG or a comparable company

  • Experience with at least one major ML framework such as PyTorch, JAX, or TensorFlow

  • Deep, hands-on expertise in at least one of the focus areas below:

    • Pretraining under tight data and compute budgets

    • PPO, reward shaping, custom gym / gymnasium environments, and throughput tuning

    • Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation

    • Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance

    • Architecture design under strict parameter-count or size constraints

    • Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives

    • Contrastive training for embedding or retrieval models

    • Generative vision or video modeling

    • Multilingual or low-resource language experience

    • Image or video data pipelines at scale

    • Experience balancing competing model objectives such as safety and capability

    • Prior work as an ML evaluator, red-teamer, or baseliner

Required Domain Expertise

Candidates must have strong practical experience in at least one of the following:

  • Pretraining: training transformer language models from scratch

  • Reinforcement learning: training agents in custom or existing environments

  • Post-training: fine-tuning and aligning LLMs

  • Dataset curation: building and cleaning large text corpora for LLM training

  • Model architecture: designing and modifying neural network architectures

Logistics (work trial requirements)

  • One baseline attempt per contractor per task

  • Each task may only be attempted once by a given contractor

  • All work is confidential and covered by NDA

  • Compute and environment are provided; no personal GPU is required

Commitment
Hourly

Skills & categories

PyTorchData AnalysisGPUEmbeddingsLLMsAgentic AIFine-tuningMachine LearningData EngineeringTensorFlowJAXDeep LearningReinforcement Learning
Posted
Apr 16, 2026
Slots remaining
5
First seen
Jun 30, 2026
Last seen
Jun 30, 2026