Language models and prompts are magic in a world of deterministic software. As prompts change and use cases evolve, it can be difficult to continue to have confidence in the output of a model. Building a library of example inputs for your model+prompt combination with annotated outputs is critical to evolving the prompt in a controlled way, ensuring performance and outcomes don’t drift or regress as you try and improve your overall performance.