Catches the regressions that aggregate metrics hide.
Success rate up, cost down. Every dashboard would green-light this deploy.
Per-task breakdown reveals them. New easy tasks masked the damage in the average.
The quality gate caught it. Exit code 1. Deploy blocked.
Groups traces by task ID and compares success rates. Finds the 5 tasks that broke even when the aggregate improved. The thing dashboards don't show you.
Bootstrap confidence intervals and z-tests on every metric. Separates real changes from random variance.
Exit 1 on violation. Markdown PR comments. GitHub Action included.
Flat or nested. --suggest scans field names and gives you a config.
Same data, same results. No LLM-as-judge.
2 deps. No API keys. Installs in seconds.