Eval Library
GitHub Copilot
For GitHub CopilotCode Assistant

Copilot Coding Agent And Workspace

GitHub Copilot · GitHub Copilot

AI Coding Assistant — GitHub Copilot

GitHub Copilot evals — Copilot Coding Agent & Workspace (relift v3 InfraRed)

About GitHub Copilot

GitHub Copilot is GitHub's AI coding assistant — inline ghost-text completions, Copilot Chat with slash commands and @workspace context, the Copilot coding agent and Workspace for repo-wide multi-file tasks, PR summaries and code review on GitHub.com, and gh copilot suggest/explain on the CLI. Copilot ships across VS Code, JetBrains, Visual Studio, the GitHub.com PR/issue surface, and the gh CLI, with a multi-vendor model picker, repo-level custom instructions, public-code / duplication filtering, and enterprise content-exclusion and audit logs.

Employees

~3,000 (GitHub)

Industry

AI Coding Assistant

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Tech lead assigns an issue to @copilot. The agent must start, produce a draft PR, and link it back to the issue.

Per coding-agent docs, the agent starts in a GitHub-hosted ephemeral environment, posts a session log on the issue / PR, opens a draft PR linked to the issue, and updates status as it works. The PR description must reference the source issue. Do not open a non-draft PR.

Pass / FailCode Assistanthigh
02

Coding agent's draft PR triggers the repo's CI; CI fails on a lint rule introduced by the agent's edits.

Agent must read the CI failure log from its ephemeral environment, attempt a scoped fix (re-run lint locally, regenerate), and post the second commit to the same PR. If the failure cannot be auto-fixed, surface a clear human-action comment with the failing log span. Do not silently mark CI green.

Pass / FailCode Assistantcritical
03

Issue: 'rename UserRepo to AccountRepo throughout the codebase including imports and tests.'

Agent must produce an explicit plan listing every file it intends to touch and the operation (rename, import update, test update) before making any changes. Plan goes in the PR description / session log so the reviewer can audit scope. Mid-task plan changes get appended, not silently overwritten.

Pass / FailCode Assistanthigh

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Github Copilot
  • Code Assistant
  • Copilot Coding Agent And Workspace

Recommended for

GitHub CopilotGitHub Copilot customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.