Performance Harness¶
This page describes how to profile and benchmark calibrated-explanations.
The repository ships several purpose-built scripts; choose the one that matches
your goal.
Quick-start: micro benchmarks¶
scripts/perf/run_micro_benchmarks.py uses WrapCalibratedExplainer (CE-first
pattern) to time fit, calibrate, and predict on both classification and
regression tasks.
python scripts/perf/run_micro_benchmarks.py --output reports/perf/micro.json --pretty
Full baseline snapshot¶
scripts/perf/collect_baseline.py records import time, RSS memory (psutil),
tracemalloc peak, public API symbol inventory, and runtime benchmarks. Run
this before and after a change to get a before/after pair.
python scripts/perf/collect_baseline.py \
--output tests/benchmarks/baseline_$(date +%Y%m%d).json \
--pretty
Windows PowerShell:
python scripts/perf/collect_baseline.py `
--output tests/benchmarks/baseline_$(Get-Date -Format yyyyMMdd).json `
--pretty
Regression check¶
scripts/perf/check_perf_regression.py compares a saved baseline against
current metrics using the thresholds in tests/benchmarks/perf_thresholds.json.
python scripts/perf/check_perf_regression.py \
--baseline tests/benchmarks/baseline_YYYYMMDD.json \
--thresholds tests/benchmarks/perf_thresholds.json
Exits non-zero if any threshold is exceeded or if public API symbols were removed.
Legacy vs modern pipeline comparison¶
evaluation/scripts/compare_explain_performance.py benchmarks five strategy
variants (legacy, modern, cached, parallel, cache+parallel) across
classification and regression and prints speedup tables. This is the source
of the numbers in evaluation/explain_performance.md.
python evaluation/scripts/compare_explain_performance.py
# optional: save results
python evaluation/scripts/compare_explain_performance.py \
--output reports/perf/pipeline_comparison.json
Prerequisites: the same environment used for the rest of the test suite
(pip install -e .[dev]).
Streaming serialization benchmark¶
scripts/perf/stream_benchmark.py measures elapsed time and peak memory for
serialising N synthetic explanations through the streaming API.
python scripts/perf/stream_benchmark.py --n 10000 --chunk 256 --format jsonl
Committed baselines¶
Baseline snapshots live in tests/benchmarks/. The thresholds in
tests/benchmarks/perf_thresholds.json govern import time and explanation
latency. Update the baseline file and commit it when a deliberate performance
change lands.
When to run¶
Scenario |
Recommended script |
|---|---|
Quick sanity check after a change |
|
Before/after comparison for a PR |
|
Reproducing pipeline speedup numbers |
|
Streaming throughput check |
|