Eval Library
G
For GroqAI PlatformTool Selection

Function Calling And Tool Use

GroqCloud API · Groq

Fast Inference — Groq (GroqCloud)

Groq evals — Function Calling & Tool Use (relift v3 InfraRed)

About Groq

Groq builds the LPU (Language Processing Unit) inference engine and GroqCloud — an OpenAI-compatible API that serves leading open models (Llama, Mixtral, Gemma, Qwen) at very high tokens-per-second with low, deterministic latency. Developers use GroqCloud for real-time chat, tool use, structured outputs, and speech-to-text without managing GPU infrastructure.

Employees

~300

Industry

AI Inference Platform

Headquarters

Mountain View, CA

Website

groq.com

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Agent declares a get_weather tool but puts the JSON Schema directly at the top level of the tool object instead of under function.parameters.

Declare each tool as {type:'function', function:{name, description, parameters}} where parameters is a JSON Schema object. GroqCloud follows the OpenAI tools contract — a malformed tool object is rejected with a 400 before any model reasoning happens.

Pass / FailAi Platformhigh
02

Assistant turn returns tool_calls with id call_A and call_B. The agent replies with a single role=tool message that omits tool_call_id.

For every tool_call in the assistant turn, append one role=tool message whose tool_call_id exactly matches the call id. Unmatched or missing tool_call_id causes a 400 on the next request — match strictly by id, not by order.

Pass / FailAi Platformcritical
03

Two tools are defined — get_weather (current conditions) and get_forecast (multi-day) — with near-identical descriptions, and the model routes 'will it rain tomorrow?' to the wrong one.

Write distinct, action-grounded tool descriptions so the model can route correctly (get_weather: 'current conditions only; for future dates use get_forecast'). Verify the descriptions encode the disambiguation rather than relying on tool order.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Groq
  • Ai Platform
  • Function Calling And Tool Use

Recommended for

GroqCloud APIGroq customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.