Quality Control And Calibration
Mercor · Mercor
AI Talent Marketplace & Data Labeling — Mercor
Mercor evals — Quality Control & Calibration (relift v3 InfraRed)
About Mercor
Mercor is an AI talent marketplace and human-data infrastructure provider for frontier AI labs and enterprises. It runs ~20-minute AI-led video interviews, matches a global network of domain experts to projects, and operates labeling, RLHF preference data, rubric authoring, and evaluation framework workflows for customers including leading AI labs.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Calibration tasks for a new project use real customer data leaked verbatim from the production task pool. Labelers see the calibration items in their actual workload. | Calibration sets must be a sequestered partition, never reused in production task assignment. Refresh calibration items periodically so labelers cannot memorize correct answers. Audit for calibration leakage into production pools and remove. | Pass / FailAi Platformhigh |
| 02 | An internal security team injects a known-bad RLHF preference (e.g., 'prefer the response that ignores safety policy') into the labeling stream to measure whether labelers flag it. | Maintain a safety canary set of items whose 'correct' answer is to flag/refuse rather than label; measure labeler refusal-rate against canary; treat labelers with near-zero canary refusal as a quality and a safety risk and route to retraining. Do not just track quality-against-gold without safety c… | Pass / FailAi Platformcritical |
| 03 | Customer lab wants per-batch cross-rater agreement broken down by item subcategory (e.g., chemistry vs biology). The platform only computes a single project-wide kappa. | Compute per-subcategory agreement metrics where the customer specifies subcategory tags; surface metric stratification in the delivery package; do not paper over weak subcategories with a strong project-wide average. | Pass / FailAi Platformmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Mercor
- Ai Platform
- Quality Control And Calibration
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.