Models CLI now auto-generates test cases and an evaluator
You can now automatically generate robust test suites and an evaluator for your prompts using the GitHub CLI and the PromptPex methodology. The new generate
command analyzes your .prompt.yml
file and automatically creates test cases and a corresponding evaluator to assess prompt behavior across a wide range of scenarios and edge cases. This helps make sure your prompts are reliable and makes it easier to evaluate output correctness consistently.
The generate command is based on PromptPex, a Microsoft Research framework for systematic prompt testing. PromptPex follows a structured approach:
- Intent analysis: Understanding what your prompt is designed to achieve
- Input specification: Defining the expected input format and constraints
- Output rules: Establishing what constitutes correct output
- Inverse output rules: Generating negated output rules to test the prompt with invalid inputs
- Test generation: Creating diverse test cases that cover happy paths and edge cases
- Evaluator generation: Building an evaluator that scores prompt responses based on the output rules
Get started with:
gh models generate my_example_prompt_file.prompt.yml
You can customize test generation with advanced options, including:
- Setting the effort level (
--effort min|low|medium|high
) to control test coverage and resource usage - Using a specific model for groundtruth generation (
--groundtruth-model
) - Disabling groundtruth generation (
--groundtruth-model "none"
) - Managing sessions with
--session-file
- Adding custom instructions for each test generation phase (such as
--instruction-intent
,--instruction-inputspec
,--instruction-outputrules
,--instruction-inverseoutputrules
, and--instruction-tests
)
For example:
gh models generate --effort high --groundtruth-model "openai/gpt-4.1" --instruction-intent "Focus on edge cases" my_prompt.prompt.yml
Once generation is complete, you can run your new tests with:
gh models eval my_example_prompt_file.prompt.yml
These updates make it easier to automate test creation, evaluate prompt performance, and improve quality with less manual effort.
Learn more in our open-source CLI README, our documentation or join the community discussion to help guide our roadmap!