
Tracing Span Fidelity
AI evaluation & observability platform (evals, scorers, tracing, Brainstore log store) · Braintrust
3 graded scenarios covering edge cases, failure modes, and quality checks.
About Braintrust
Braintrust is an AI evaluation and observability platform for building reliable AI products. It combines offline evaluation runs (a task function graded over a dataset by one or more scorers), production tracing and logging, online scoring of live traffic, no-code Playgrounds for iterating on prompts/models/scorers, and Brainstore — a purpose-built log store for fast querying of AI traces. Scorers can be heuristic code, LLM-as-a-judge, or pre-built Autoevals, invocable via a Functions API; SDKs cover Python, TypeScript, Go, Ruby, and C#. Enterprise controls include RBAC, SSO, encrypted secrets, SOC 2 Type II, and a hybrid control/data-plane deployment that keeps customer AI data in their own VPC. [REQUIRES-VERIFICATION] employee count and headquarters.
Employees
[REQUIRES-VERIFICATION]
Industry
AI Evaluation & Observability
Headquarters
San Francisco, CA [REQUIRES-VERIFICATION]
Website
www.braintrust.devSample tests· showing 3 of 3
Pass/fail checks, each adjudicated by an LLM judge.
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Before trusting Braintrust span durations to gate a model-swap decision, the team wants confidence the SDK isn't adding measurable overhead to recorded durations. The agent has shell access to a Python environment with the Braint… | Agent writes a script using the tracing API (e.g. a `traced`/`start_span` context manager) around `time.sleep(2.0)`, runs it, retrieves the logged span's start/end timestamps from the SDK response or queried log, computes the duration, and reports it (expected ≈2.000–2.010s). If the measured durati… | Pass / FailTool usehigh |
| 02 | An engineer reports that the `/summarize` endpoint feels slower since they added tracing last sprint, and Braintrust shows the `llm-call` span averaging 740ms vs. a previously eyeballed ~700ms for the raw API call. The agent has … | Agent inspects the instrumentation code and finds that `start_span()` is called *after* a `json.dumps()` of a large payload and a context-propagation lookup that together add ~35-40ms — i.e. the 40ms gap is explained by SDK-adjacent overhead being included inside the span window, not by the provide… | Pass / FailFactualityhighneg. control |
| 03 | Test fixture: a 90-second network partition between the app host and the ingest endpoint causes the SDK's batched uploader to queue ~4,000 spans. Once connectivity returns, the backlog flushes in a 12-second burst. For this eval,… | Agent pulls the raw span start/end timestamps available in the provided span data (not just the aggregated 'duration' metric) and notices the spans' actual start/end timestamps cluster in a tight, normal-looking execution window, while the spans' separate ingestion/received-at timestamps are all cl… | Pass / FailWorkflowcriticalneg. control |
How this eval is graded
Pass/fail checks, each adjudicated by an LLM judge.
Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain.
Pass threshold: a criterion passes at a judge score of 4 or higher.
Rubric criteria
- Braintrust
- Clinical
- Agentic
- Generated
Recommended for
Works with
Related evals
Ambient clinical documentation
49 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
58 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
56 graded scenarios covering edge cases, failure modes, and quality checks.
ViewRun this eval in your workspace
Connect your data, configure thresholds, and review results with your team.