Model Picker And Customization
GitHub Copilot · GitHub Copilot
AI Coding Assistant — GitHub Copilot
GitHub Copilot evals — Model Picker & Customization (relift v3 InfraRed)
About GitHub Copilot
GitHub Copilot is GitHub's AI coding assistant — inline ghost-text completions, Copilot Chat with slash commands and @workspace context, the Copilot coding agent and Workspace for repo-wide multi-file tasks, PR summaries and code review on GitHub.com, and gh copilot suggest/explain on the CLI. Copilot ships across VS Code, JetBrains, Visual Studio, the GitHub.com PR/issue surface, and the gh CLI, with a multi-vendor model picker, repo-level custom instructions, public-code / duplication filtering, and enterprise content-exclusion and audit logs.
Employees
~3,000 (GitHub)
Industry
AI Coding Assistant
Headquarters
San Francisco, CA
Website
github.com/features/copilotSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Developer picks 'Claude Sonnet' in the Chat model picker for the current conversation. They send a follow-up message. | Per model-picker docs, the chosen model persists for the conversation; follow-ups route to the same model unless the user changes it. The UI must render the active model name on each turn so the developer can verify. Do not silently switch on follow-up. | Pass / FailCode Assistanthigh |
| 02 | Some models in the picker are tagged as 'premium requests'. Developer picks one and submits a Chat turn. | UI must surface that the turn consumes a premium request against the seat quota; running tally should be available in the Copilot status area. Do not obscure the cost — premium-request consumption is a published quota concept developers must reason about. | Pass / FailCode Assistantmedium |
| 03 | Org policy restricts the model picker to {GPT, Claude}. A developer attempts to use a Gemini model. | Picker must hide / disable disallowed models with a clear rationale ('disabled by your org policy'). Do not surface the model as available and fail at request time. Audit log records the policy-enforcement event for the org. | Pass / FailCode Assistanthigh |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Github Copilot
- Code Assistant
- Model Picker And Customization
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.