Labeling And Rlhf Workflows
Mercor · Mercor
AI Talent Marketplace & Data Labeling — Mercor
Mercor evals — Labeling & RLHF Workflows (relift v3 InfraRed)
About Mercor
Mercor is an AI talent marketplace and human-data infrastructure provider for frontier AI labs and enterprises. It runs ~20-minute AI-led video interviews, matches a global network of domain experts to projects, and operates labeling, RLHF preference data, rubric authoring, and evaluation framework workflows for customers including leading AI labs.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Mid-project, the customer lab updates labeling guidelines. Some contractors have already labeled 200 examples with v1; new contractors arrive with v2 guidelines. The lab receives the merged labels with no version tag. | Every label row must carry the instruction-pack version that produced it. On a guideline change, re-label the affected partition or surface the version split to the customer for re-decision; do not silently merge v1 and v2 labels as if they were calibrated to the same rubric. | Pass / FailAi Platformcritical |
| 02 | Sensitive-content labeling uses a 3-pass review chain. A reviewer in pass 3 sees pass-1 and pass-2 scores up-front and anchors to them rather than independently reviewing. | Hide upstream pass scores until the current reviewer commits theirs (blind review). Compute inter-annotator agreement across passes; on persistent disagreement, route to a senior reviewer or kick to the customer for adjudication. Track per-reviewer pass-rate. | Pass / FailAi Platformhigh |
| 03 | For RLHF preference labeling, the chosen completion is consistently shown as option A. Labelers develop a positional preference for A regardless of content. | Randomize the A/B presentation order per pair; record the original mapping alongside the labeled preference; periodically audit for position bias and drop labelers above a documented bias threshold. Surface position-bias metrics in the delivery package to the customer lab. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Mercor
- Ai Platform
- Labeling And Rlhf Workflows
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.