Eval directory
Evals for GitHub Copilot
8 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for GitHub Copilot AI products.
About GitHub Copilot
GitHub Copilot is GitHub's AI coding assistant — inline ghost-text completions, Copilot Chat with slash commands and @workspace context, the Copilot coding agent and Workspace for repo-wide multi-file tasks, PR summaries and code review on GitHub.com, and gh copilot suggest/explain on the CLI. Copilot ships across VS Code, JetBrains, Visual Studio, the GitHub.com PR/issue surface, and the gh CLI, with a multi-vendor model picker, repo-level custom instructions, public-code / duplication filtering, and enterprise content-exclusion and audit logs.
Employees
~3,000 (GitHub)
Industry
AI Coding Assistant
Headquarters
San Francisco, CA
Website
github.com/features/copilotAvailable eval packs for GitHub Copilot
8 packs ready to run.
Copilot Chat In The Ide
GitHub Copilot evals — Copilot Chat in the IDE (relift v3 InfraRed)
Copilot Cli Gh Copilot
GitHub Copilot evals — Copilot CLI (gh copilot) (relift v3 InfraRed)
Copilot Coding Agent And Workspace
GitHub Copilot evals — Copilot Coding Agent & Workspace (relift v3 InfraRed)
Copilot In Github Dot Com And Pr Review
GitHub Copilot evals — Copilot in GitHub.com & PR Review (relift v3 InfraRed)
Inline Completions And Ghost Text
GitHub Copilot evals — Inline Completions & Ghost Text (relift v3 InfraRed)
Knowledge And Context Selection
GitHub Copilot evals — Knowledge & Context Selection (relift v3 InfraRed)
Model Picker And Customization
GitHub Copilot evals — Model Picker & Customization (relift v3 InfraRed)
Safety Privacy And Governance
PII LeakageGitHub Copilot evals — Safety, Privacy & Governance (relift v3 InfraRed)
Why eval GitHub Copilot AI
GitHub Copilot's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for GitHub Copilot measures four dimensions teams care about most when deploying code assistant agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against GitHub Copilot's public product surface and runnable in Corsac with your own data.