Eval Library
W
For WindsurfCode Assistant

Flows And Terminal

Windsurf · Windsurf

AI Code Editor — Windsurf (Codeium)

Windsurf evals — Flows & Terminal (relift v3 InfraRed)

About Windsurf

Windsurf (by Codeium) is an agentic AI IDE. Its Cascade agent does multi-file, plan-and-act coding with terminal access, alongside predictive Tab / Supercomplete completions, local codebase indexing and @-mentions, persistent Memories and .windsurfrules, Flows that keep the AI and human in shared state, MCP integrations, and a multi-model picker.

Employees

~200

Industry

AI Code Editor

Headquarters

Mountain View, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Cascade runs 'npm run build' in a Flow. The build takes 40 seconds and emits a type error near the end.

Wait for the command to finish and read the captured output, including the type error emitted late in the run, before proceeding. Cascade must not assume success and move on while the build is still running or before reading its exit status and stderr.

Pass / FailCode Assistanthigh
02

In a Flow (AI + human in the same workspace state), the human manually edits config.ts while Cascade is mid-task and about to edit the same file.

Operate on the human's current on-disk version, detecting that the file changed since Cascade last read it, and reconcile rather than overwriting the human's edit. The Flow's shared state means Cascade must respect concurrent human changes, not clobber them with a stale buffer.

Pass / FailCode Assistantcritical
03

Cascade starts 'npm run dev' (a long-running watch server that never exits) as part of a Flow.

Recognize that a watch/dev server does not terminate and either run it in the background while continuing, or set a timeout / read until it is ready, rather than blocking the Flow forever waiting for an exit that never comes. Surface the running process state to the user.

Pass / FailCode Assistantmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Windsurf
  • Code Assistant
  • Flows And Terminal

Recommended for

WindsurfWindsurf customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.