# Performance Harness

This page describes how to profile and benchmark `calibrated-explanations`.
The repository ships several purpose-built scripts; choose the one that matches
your goal.

---

## Quick-start: micro benchmarks

`scripts/perf/run_micro_benchmarks.py` uses `WrapCalibratedExplainer` (CE-first
pattern) to time `fit`, `calibrate`, and `predict` on both classification and
regression tasks.

```bash
python scripts/perf/run_micro_benchmarks.py --output reports/perf/micro.json --pretty
```

---

## Full baseline snapshot

`scripts/perf/collect_baseline.py` records import time, RSS memory (`psutil`),
`tracemalloc` peak, public API symbol inventory, and runtime benchmarks.  Run
this before and after a change to get a before/after pair.

```bash
python scripts/perf/collect_baseline.py \
    --output tests/benchmarks/baseline_$(date +%Y%m%d).json \
    --pretty
```

Windows PowerShell:

```powershell
python scripts/perf/collect_baseline.py `
    --output tests/benchmarks/baseline_$(Get-Date -Format yyyyMMdd).json `
    --pretty
```

---

## Regression check

`scripts/perf/check_perf_regression.py` compares a saved baseline against
current metrics using the thresholds in `tests/benchmarks/perf_thresholds.json`.

```bash
python scripts/perf/check_perf_regression.py \
    --baseline tests/benchmarks/baseline_YYYYMMDD.json \
    --thresholds tests/benchmarks/perf_thresholds.json
```

Exits non-zero if any threshold is exceeded or if public API symbols were
removed.

---

## Legacy vs modern pipeline comparison

`evaluation/scripts/compare_explain_performance.py` benchmarks five strategy
variants (legacy, modern, cached, parallel, cache+parallel) across
classification and regression and prints speedup tables.  This is the source
of the numbers in `evaluation/explain_performance.md`.

```bash
python evaluation/scripts/compare_explain_performance.py
# optional: save results
python evaluation/scripts/compare_explain_performance.py \
    --output reports/perf/pipeline_comparison.json
```

Prerequisites: the same environment used for the rest of the test suite
(`pip install -e .[dev]`).

---

## Streaming serialization benchmark

`scripts/perf/stream_benchmark.py` measures elapsed time and peak memory for
serialising `N` synthetic explanations through the streaming API.

```bash
python scripts/perf/stream_benchmark.py --n 10000 --chunk 256 --format jsonl
```

---

## Committed baselines

Baseline snapshots live in `tests/benchmarks/`.  The thresholds in
`tests/benchmarks/perf_thresholds.json` govern import time and explanation
latency.  Update the baseline file and commit it when a deliberate performance
change lands.

---

## When to run

| Scenario | Recommended script |
|---|---|
| Quick sanity check after a change | `run_micro_benchmarks.py` |
| Before/after comparison for a PR | `collect_baseline.py` + `check_perf_regression.py` |
| Reproducing pipeline speedup numbers | `compare_explain_performance.py` |
| Streaming throughput check | `stream_benchmark.py` |