Question 1

What is an AI eval?

Accepted Answer

An AI eval (evaluation) is a repeatable test that measures whether an AI model or agent behaves the way it should on a defined set of cases — for example staying factual, using tools correctly, following policy, or refusing unsafe requests. Each case has an input and a way to judge the output, so you get a comparable score every time the model or prompt changes. Evals are to AI what unit and regression tests are to traditional software.

Question 2

What is Corsac?

Accepted Answer

Corsac is an eval control plane for enterprise AI. It gives teams a library of ready-to-run eval packs, a place to run their own evals and gate releases on the results, and a managed expert-review network to adjudicate the cases automated scoring can't decide on its own. The goal is to make AI behavior measurable, regression-tested, and auditable.

Question 3

How is Corsac different from writing my own evals or using an eval framework?

Accepted Answer

Eval frameworks give you the plumbing to run tests; you still have to author the cases, score them, and decide what to do with failures. Corsac gives you that plumbing plus a library of pre-built packs to start from, a system of record that tracks runs and regressions over time, release gating, and human expert review for the judgment calls. You can bring an existing framework's results into Corsac rather than replace it.

Question 4

How do I run a Corsac eval pack?

Accepted Answer

Find a pack in the eval library, add it to your workspace, point it at your model or agent endpoint, and run it. Most teams run packs from CI: your runner executes the cases, scores them locally, and pushes the result to Corsac via a single API call (POST /api/app/evals/{spec_id}/external-run). Corsac stores the run, applies your pass/fail gate, and routes anything flagged to review.

Question 5

What does an eval pack actually contain?

Accepted Answer

Each pack is a versioned spec plus a dataset of test cases. The spec defines the use cases it covers, the rubric criteria, and how each case is scored — pass/fail assertions, a 1–5 LLM-judge score, or a mix. Every detail page lists the use cases it checks, sample tests, the rubric, and the scoring method.

Question 6

How are evals scored?

Accepted Answer

Depending on the pack, cases are scored as binary pass/fail assertions, graded 1–5 by an LLM judge against the rubric, or a combination. The scoring method is shown on each eval's page so you know exactly how a result is reached before you run it. You set the pass threshold that gates your release.

Question 7

Can I run Corsac evals in CI/CD?

Accepted Answer

Yes — that's the intended workflow. Run the pack in your pipeline on each build or model change, push the result to Corsac, and gate the release on the returned pass/fail status. Because scoring runs in your own runner, evals fit into existing CI the same way a test suite does, and Corsac keeps the history so you can see regressions across runs.

Question 8

Do I need to sign in to browse evals?

Accepted Answer

No. The full eval library, including each pack's use cases, rubric, and scoring method, is public so you can evaluate fit before signing up. Signing in unlocks the full sample dataset rows and lets you add a pack to your workspace to run it on your own data.

Question 9

Can I bring my own eval to Corsac?

Accepted Answer

Yes. You can run your existing eval in your own harness and push the result to Corsac through the same external-run API the library packs use. Corsac becomes the system of record: it stores every run, gates releases on the result, tracks regressions over time, and routes flagged outputs to review — regardless of who authored the eval.

Question 10

What format does a bring-your-own eval need to be in?

Accepted Answer

You send an EvalResult payload — per-case metrics, aggregates, and run metadata — to the external-run endpoint. The exact field shape is in the API docs and the live OpenAPI schema. Because scoring happens in your harness, you are not locked into a particular framework; Corsac ingests the outcome.

Question 11

Can I customize a library pack instead of starting from scratch?

Accepted Answer

Yes. Clone any library pack, adjust the cases, rubric, or thresholds to match your workflow, and run the customized version. This is the fastest path for most teams: start from a proven pack for your connector or use case, then tailor it.

Question 12

What models and agents can I evaluate?

Accepted Answer

Any of them. Because your runner produces the outputs and scores them before sending results to Corsac, the model, provider, and agent framework are up to you — Corsac ingests the outcome rather than calling your model directly. The same pack can compare different models, prompts, or agent versions over time.

Question 13

What is Corsac managed review?

Accepted Answer

Managed review routes low-confidence, high-stakes, or policy-sensitive outputs to human reviewers instead of relying on automated scoring alone. The human decision (approved, rejected, with notes) is recorded against the run, giving you an auditable accountability trail for every critical AI decision.

Question 14

How do I enhance an automated eval with manual review?

Accepted Answer

Define a routing rule — for example, send every case the judge scores below your threshold, every case tagged high-severity, or a sampled percentage — to review. Those cases go to Corsac's reviewer queue; the rest pass through automatically. You get automated coverage at scale plus human judgment exactly where it matters.

Question 15

Who reviews the flagged outputs?

Accepted Answer

Corsac's managed expert network. For domain-specific work (clinical, legal, security, financial), reviews are handled by reviewers with the relevant expertise so the adjudication is credible to your stakeholders and auditors. You can also keep review in-house and use Corsac purely to record your own team's decisions.

Question 16

Is the human review decision auditable?

Accepted Answer

Yes. Every review decision is stored against the specific run with the deciding user, the decision, and free-text rationale, and is readable back through the API. That record is the point: it turns 'a human checked this' into a queryable, exportable artifact.

Question 17

How is my data handled when I run an eval?

Accepted Answer

Your eval cases run in your own environment — your runner executes the model and scores the outputs. What you send to Corsac is the result: per-case metrics, aggregates, and run metadata, scoped to your organization. Runs and review decisions are only visible within your workspace, and review decisions record the acting user so the trail is accountable.

Question 18

How do I get access to Corsac?

Accepted Answer

Request access from the Get started page. You'll get a workspace and an API key to browse evals, push runs, and configure review routing. The eval library itself is browsable without an account.

Question 19

Is there an API?

Accepted Answer

Yes — a REST API over HTTPS with bearer-token auth. You can list evals, push run results, list and fetch runs, and read or write the human review decision on each run. See the API docs for endpoints and copy-pasteable cURL, TypeScript, and Python examples.

Running evals, bringing your own, and expert review

Corsac basics

Running a Corsac eval

Bring your own eval

Manual review & the expert network

Data, access & API