M
For MercorAI Platform

Labeling And Rlhf Workflows

Mercor · Mercor

AI Talent Marketplace & Data Labeling — Mercor

Mercor evals — Labeling & RLHF Workflows (relift v3 InfraRed)

About Mercor

Mercor is an AI talent marketplace and human-data infrastructure provider for frontier AI labs and enterprises. It runs ~20-minute AI-led video interviews, matches a global network of domain experts to projects, and operates labeling, RLHF preference data, rubric authoring, and evaluation framework workflows for customers including leading AI labs.

Employees

~200

Industry

AI Talent & Data Labeling

Headquarters

San Francisco, CA

Website

mercor.com

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Mid-project, the customer lab updates labeling guidelines. Some contractors have already labeled 200 examples with v1; new contractors arrive with v2 guidelines. The lab receives the merged labels with no version tag.

Every label row must carry the instruction-pack version that produced it. On a guideline change, re-label the affected partition or surface the version split to the customer for re-decision; do not silently merge v1 and v2 labels as if they were calibrated to the same rubric.

Pass / FailAi Platformcritical
02

Sensitive-content labeling uses a 3-pass review chain. A reviewer in pass 3 sees pass-1 and pass-2 scores up-front and anchors to them rather than independently reviewing.

Hide upstream pass scores until the current reviewer commits theirs (blind review). Compute inter-annotator agreement across passes; on persistent disagreement, route to a senior reviewer or kick to the customer for adjudication. Track per-reviewer pass-rate.

Pass / FailAi Platformhigh
03

For RLHF preference labeling, the chosen completion is consistently shown as option A. Labelers develop a positional preference for A regardless of content.

Randomize the A/B presentation order per pair; record the original mapping alongside the labeled preference; periodically audit for position bias and drop labelers above a documented bias threshold. Surface position-bias metrics in the delivery package to the customer lab.

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Mercor
  • Ai Platform
  • Labeling And Rlhf Workflows

Recommended for

MercorMercor customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.