v0.2.3 · MIT · 2 dependencies

The diff tool for AI agent runs.

Kalibra is the CLI that catches what the dashboard misses. Compare two trace files. Get a statistical verdict. Exit 1 when something regressed.

$ pip install kalibra copy

View on GitHub →

baseline.jsonl main

traces100

success rate50.0%

cost (median)$0.036

duration (median)7.6s

→

current.jsonl PR #482

traces100

success rate75.0%

cost (median)$0.021

duration (median)15.2s

$ kalibra compare baseline.jsonl current.jsonl

        Kalibra Compare
        ──────────────────────────────────────────────────────────
        ▲ Success rate      50.0% → 75.0%   +25.0 pp
        ▲ Cost              $0.036 → $0.021 −40.5%
        ▼ Duration          7.6s → 15.2s    +99.1%
         
        Trace breakdown  20 matched — ✓ 10 improved, ✗ 5 regressed
         
        Quality gates
          [ OK ]  success_rate_delta >= −5
          [FAIL]  duration_delta_pct <= 30
        ──────────────────────────────────────────────────────────
        FAILED — quality gate violation (exit code 1)
      

What just happened

The aggregate looked great. Five tasks completely broke.

+25 pp

Dashboard says ship it

Success rate up. Cost down. Every aggregate metric green-lit this deploy.

5/5 → 0/5

Five tasks died silently

Per-task breakdown surfaced them. New easy tasks lifted the average and hid the failures.

+99%

Quality gate held

duration_delta_pct ≤ 30 violated. PR blocked. Exit code 1. Merge stopped.

"Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems."

— Hamel Husain, Your AI Product Needs Evals

Statistically transparent

Two-proportion z-test success / error rate

Percentile bootstrap (n=1000) cost / tokens / duration

Per-task z-test grid finds hidden regressions

BH FDR correction on the roadmap

BCa bootstrap on the roadmap

Audience

Built for the layer below the eval.

For you if

You run agent evals in CI and want a real regression gate

You've been burned by averages hiding regressions

You prefer a CLI and a config file over another UI to log into

You already use Phoenix, OTel GenAI, Langfuse — and just want to diff two runs

Reach for something else if

You need to judge whether outputs are correct. Kalibra measures change between two runs, not quality — pair it with an LLM judge or rule-based scorer.

You need real-time monitoring or alerts on live traffic. Kalibra is offline: it compares two trace files at a moment in time.

You need trace storage, search, or a hosted team UI. Phoenix, Langfuse, Braintrust do this — Kalibra reads what they export.

Reads the trace format you already have

Auto-detected. No field mapping for standard formats.

openinference.*