You can now automatically generate robust test suites and an evaluator for your prompts using the GitHub CLI and the PromptPex methodology. The new generate command analyzes your .prompt.yml file and automatically creates test cases and a corresponding evaluator to assess prompt behavior across a wide range of scenarios and edge cases. This helps make sure your prompts are reliable and makes it easier to evaluate output correctness consistently.

The generate command is based on PromptPex, a Microsoft Research framework for systematic prompt testing. PromptPex follows a structured approach:

  • Intent analysis: Understanding what your prompt is designed to achieve
  • Input specification: Defining the expected input format and constraints
  • Output rules: Establishing what constitutes correct output
  • Inverse output rules: Generating negated output rules to test the prompt with invalid inputs
  • Test generation: Creating diverse test cases that cover happy paths and edge cases
  • Evaluator generation: Building an evaluator that scores prompt responses based on the output rules

Get started with:

gh models generate my_example_prompt_file.prompt.yml

You can customize test generation with advanced options, including:

  • Setting the effort level (--effort min|low|medium|high) to control test coverage and resource usage
  • Using a specific model for groundtruth generation (--groundtruth-model)
  • Disabling groundtruth generation (--groundtruth-model "none")
  • Managing sessions with --session-file
  • Adding custom instructions for each test generation phase (such as --instruction-intent, --instruction-inputspec, --instruction-outputrules, --instruction-inverseoutputrules, and --instruction-tests)

For example:

gh models generate --effort high --groundtruth-model "openai/gpt-4.1" --instruction-intent "Focus on edge cases" my_prompt.prompt.yml

Once generation is complete, you can run your new tests with:

gh models eval my_example_prompt_file.prompt.yml

These updates make it easier to automate test creation, evaluate prompt performance, and improve quality with less manual effort.

Learn more in our open-source CLI README, our documentation or join the community discussion to help guide our roadmap!