
Testing prompts, metrics, and iterative improvement.
Systematic methods for evaluating prompts and pipelines: metric definitions, test harnesses, A/B testing, human annotation, and debug workflows to iteratively improve outputs. Emphasizes scientific approach to prompt iteration.
Choosing metrics aligned with task goals.
Building test sets and gold-standard labels.
Running controlled experiments and interpreting results.
Human evaluation protocols and inter-rater reliability basics.
Debugging prompt pipelines: isolation, reproduction, and fixes.
Documentation & changelogs for iterative improvement.
Activities
Run an end-to-end experiment comparing two prompt designs on a 50–100 item test set; present results with simple statistical summaries and recommendations.
📦 Deliverable
Experiment report, dataset, and recommended next-step plan.
Evaluation script templates, examples of human-eval forms, reading on small-sample experiment interpretation.
Modules 1–5.
Demonstrates that student work is measured and improved using data — tangible evidence of learning and reliability.
APPLY TODAY FOR THE 2025/2026 ACADEMIC SESSION.