You changed the prompt.
Did it actually get better?

Catches the regressions that aggregate metrics hide.

Terminal
Kalibra Compare ────────────────────────────────────────────────────────── Baseline 100 traces (baseline.jsonl) Current 100 traces (current.jsonl) Gates ✗ 1/2 failed Success rate 50.0% → 75.0% +25.0 pp Cost $0.036 → $0.021 median -40.5% Duration 7.6s → 15.2s median +99.1% Error rate 0.2% → 4.3% +4.1 pp Trace breakdown ~ Per trace 20 matched — ✓ 10 improved, ✗ 5 regressed Quality gates [ OK ] success_rate_delta >= -5 [FAIL] duration_delta_pct <= 30 ────────────────────────────────────────────────────────── FAILED — quality gate violation (exit code 1)
+25pp

The aggregate looks great

Success rate up, cost down. Every dashboard would green-light this deploy.

5/5 → 0/5

Five tasks completely broke

Per-task breakdown reveals them. New easy tasks masked the damage in the average.

+99%

Duration doubled

The quality gate caught it. Exit code 1. Deploy blocked.

Per-task regression detection

Groups traces by task ID and compares success rates. Finds the 5 tasks that broke even when the aggregate improved. The thing dashboards don't show you.

Statistical significance, not just deltas

Bootstrap confidence intervals and z-tests on every metric. Separates real changes from random variance.

Quality gates

Exit 1 on violation. Markdown PR comments. GitHub Action included.

Any JSONL

Flat or nested. --suggest scans field names and gives you a config.

Deterministic

Same data, same results. No LLM-as-judge.

Lightweight

2 deps. No API keys. Installs in seconds.

$ pip install kalibra