Eval Library
L
For LangChainAI Platform

Tools And Tool Calling

LangChain (+ LangGraph) · LangChain

LLM Orchestration Framework — LangChain

LangChain evals — Tools & Tool Calling (relift v3 InfraRed)

About LangChain

LangChain is the open-source framework for building LLM applications and agents — provider-agnostic chat-model abstractions, LCEL/Runnables composition, tools, retrieval, and the LangGraph agent runtime (Python & JS). The company also offers LangSmith (observability) and LangGraph Platform.

Employees

~200

Industry

Agent Framework

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Integrator defines a tool function with no type hints and a one-line docstring, then wonders why the model passes wrong-typed arguments.

Decorate with @tool and provide type hints + a clear docstring; LangChain derives the tool name, description, and args schema from them. The description is what the model uses to route, so it must describe purpose and arguments precisely.

Pass / FailAi Platformhigh
02

The model returns two tool_calls in one AIMessage; the integrator returns a single merged ToolMessage without matching tool_call_id.

For each AIMessage.tool_calls entry, execute the tool and return one ToolMessage whose tool_call_id matches the call's id. The next model turn requires every tool_call to have a matching ToolMessage; mismatched or missing ids break the conversation contract.

Pass / FailAi Platformcritical
03

Two tools, get_weather (current) and get_forecast (future), have near-identical descriptions, so the model routes 'will it rain tomorrow?' to the wrong tool.

Write distinct, scope-bounding descriptions: get_weather 'current conditions only; for future dates use get_forecast'. The model routes from descriptions, so disambiguation must live there, not in a system-prompt tie-breaker that biases against one tool.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Langchain
  • Ai Platform
  • Tools And Tool Calling

Recommended for

LangChain (+ LangGraph)LangChain customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.