Models CLI now auto-generates test cases and an evaluator

You can now automatically generate robust test suites and an evaluator for your prompts using the GitHub CLI and the PromptPex methodology. The new generate command analyzes your .prompt.yml file and automatically creates test cases and a corresponding evaluator to assess prompt behavior across a wide range of scenarios and edge cases. This helps make sure your prompts are reliable and makes it easier to evaluate output correctness consistently.

The generate command is based on PromptPex, a Microsoft Research framework for systematic prompt testing. PromptPex follows a structured approach:

Intent analysis: Understanding what your prompt is designed to achieve
Input specification: Defining the expected input format and constraints
Output rules: Establishing what constitutes correct output
Inverse output rules: Generating negated output rules to test the prompt with invalid inputs
Test generation: Creating diverse test cases that cover happy paths and edge cases
Evaluator generation: Building an evaluator that scores prompt responses based on the output rules

Get started with:

gh models generate my_example_prompt_file.prompt.yml

You can customize test generation with advanced options, including:

Setting the effort level (--effort min|low|medium|high) to control test coverage and resource usage
Using a specific model for groundtruth generation (--groundtruth-model)
Disabling groundtruth generation (--groundtruth-model "none")
Managing sessions with --session-file
Adding custom instructions for each test generation phase (such as --instruction-intent, --instruction-inputspec, --instruction-outputrules, --instruction-inverseoutputrules, and --instruction-tests)

For example:

gh models generate --effort high --groundtruth-model "openai/gpt-4.1" --instruction-intent "Focus on edge cases" my_prompt.prompt.yml

Once generation is complete, you can run your new tests with:

gh models eval my_example_prompt_file.prompt.yml

These updates make it easier to automate test creation, evaluate prompt performance, and improve quality with less manual effort.

Learn more in our open-source CLI README, our documentation or join the community discussion to help guide our roadmap!

Models CLI now auto-generates test cases and an evaluator

Nov.07 Retired

Oct.31 Retired

Sep.22 Release

Sep.11 Retired

Sep.05 Release

Sep.04 Release

Aug.26 Improvement

Aug.22 Retired

Aug.13 Improvement

Related Posts