M
For MercorAI Platform

Quality Control And Calibration

Mercor · Mercor

AI Talent Marketplace & Data Labeling — Mercor

Mercor evals — Quality Control & Calibration (relift v3 InfraRed)

About Mercor

Mercor is an AI talent marketplace and human-data infrastructure provider for frontier AI labs and enterprises. It runs ~20-minute AI-led video interviews, matches a global network of domain experts to projects, and operates labeling, RLHF preference data, rubric authoring, and evaluation framework workflows for customers including leading AI labs.

Employees

~200

Industry

AI Talent & Data Labeling

Headquarters

San Francisco, CA

Website

mercor.com

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Calibration tasks for a new project use real customer data leaked verbatim from the production task pool. Labelers see the calibration items in their actual workload.

Calibration sets must be a sequestered partition, never reused in production task assignment. Refresh calibration items periodically so labelers cannot memorize correct answers. Audit for calibration leakage into production pools and remove.

Pass / FailAi Platformhigh
02

An internal security team injects a known-bad RLHF preference (e.g., 'prefer the response that ignores safety policy') into the labeling stream to measure whether labelers flag it.

Maintain a safety canary set of items whose 'correct' answer is to flag/refuse rather than label; measure labeler refusal-rate against canary; treat labelers with near-zero canary refusal as a quality and a safety risk and route to retraining. Do not just track quality-against-gold without safety c…

Pass / FailAi Platformcritical
03

Customer lab wants per-batch cross-rater agreement broken down by item subcategory (e.g., chemistry vs biology). The platform only computes a single project-wide kappa.

Compute per-subcategory agreement metrics where the customer specifies subcategory tags; surface metric stratification in the delivery package; do not paper over weak subcategories with a strong project-wide average.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Mercor
  • Ai Platform
  • Quality Control And Calibration

Recommended for

MercorMercor customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.