Braintrust
For BraintrustMedical & Clinical AIBi Analyst

Tracing Span Fidelity

AI evaluation & observability platform (evals, scorers, tracing, Brainstore log store) · Braintrust

3 graded scenarios covering edge cases, failure modes, and quality checks.

About Braintrust

Braintrust is an AI evaluation and observability platform for building reliable AI products. It combines offline evaluation runs (a task function graded over a dataset by one or more scorers), production tracing and logging, online scoring of live traffic, no-code Playgrounds for iterating on prompts/models/scorers, and Brainstore — a purpose-built log store for fast querying of AI traces. Scorers can be heuristic code, LLM-as-a-judge, or pre-built Autoevals, invocable via a Functions API; SDKs cover Python, TypeScript, Go, Ruby, and C#. Enterprise controls include RBAC, SSO, encrypted secrets, SOC 2 Type II, and a hybrid control/data-plane deployment that keeps customer AI data in their own VPC. [REQUIRES-VERIFICATION] employee count and headquarters.

Employees

[REQUIRES-VERIFICATION]

Industry

AI Evaluation & Observability

Headquarters

San Francisco, CA [REQUIRES-VERIFICATION]

Sample tests· showing 3 of 3

Pass/fail checks, each adjudicated by an LLM judge.

#InputExpected behaviorCheck
01

Before trusting Braintrust span durations to gate a model-swap decision, the team wants confidence the SDK isn't adding measurable overhead to recorded durations. The agent has shell access to a Python environment with the Braint…

Agent writes a script using the tracing API (e.g. a `traced`/`start_span` context manager) around `time.sleep(2.0)`, runs it, retrieves the logged span's start/end timestamps from the SDK response or queried log, computes the duration, and reports it (expected ≈2.000–2.010s). If the measured durati…

Pass / FailTool usehigh
02

An engineer reports that the `/summarize` endpoint feels slower since they added tracing last sprint, and Braintrust shows the `llm-call` span averaging 740ms vs. a previously eyeballed ~700ms for the raw API call. The agent has …

Agent inspects the instrumentation code and finds that `start_span()` is called *after* a `json.dumps()` of a large payload and a context-propagation lookup that together add ~35-40ms — i.e. the 40ms gap is explained by SDK-adjacent overhead being included inside the span window, not by the provide…

Pass / FailFactualityhighneg. control
03

Test fixture: a 90-second network partition between the app host and the ingest endpoint causes the SDK's batched uploader to queue ~4,000 spans. Once connectivity returns, the backlog flushes in a 12-second burst. For this eval,…

Agent pulls the raw span start/end timestamps available in the provided span data (not just the aggregated 'duration' metric) and notices the spans' actual start/end timestamps cluster in a tight, normal-looking execution window, while the spans' separate ingestion/received-at timestamps are all cl…

Pass / FailWorkflowcriticalneg. control

How this eval is graded

Pass/fail checks, each adjudicated by an LLM judge.

Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain.

Pass threshold: a criterion passes at a judge score of 4 or higher.

Rubric criteria

  • Braintrust
  • Clinical
  • Agentic
  • Generated

Recommended for

AI evaluation & observability platform (evals, scorers, tracing, Brainstore log store)Braintrust customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.