Performance Harness

This page describes how to profile and benchmark calibrated-explanations. The repository ships several purpose-built scripts; choose the one that matches your goal.


Quick-start: micro benchmarks

scripts/perf/run_micro_benchmarks.py uses WrapCalibratedExplainer (CE-first pattern) to time fit, calibrate, and predict on both classification and regression tasks.

python scripts/perf/run_micro_benchmarks.py --output reports/perf/micro.json --pretty

Full baseline snapshot

scripts/perf/collect_baseline.py records import time, RSS memory (psutil), tracemalloc peak, public API symbol inventory, and runtime benchmarks. Run this before and after a change to get a before/after pair.

python scripts/perf/collect_baseline.py \
    --output tests/benchmarks/baseline_$(date +%Y%m%d).json \
    --pretty

Windows PowerShell:

python scripts/perf/collect_baseline.py `
    --output tests/benchmarks/baseline_$(Get-Date -Format yyyyMMdd).json `
    --pretty

Regression check

scripts/perf/check_perf_regression.py compares a saved baseline against current metrics using the thresholds in tests/benchmarks/perf_thresholds.json.

python scripts/perf/check_perf_regression.py \
    --baseline tests/benchmarks/baseline_YYYYMMDD.json \
    --thresholds tests/benchmarks/perf_thresholds.json

Exits non-zero if any threshold is exceeded or if public API symbols were removed.


Legacy vs modern pipeline comparison

evaluation/scripts/compare_explain_performance.py benchmarks five strategy variants (legacy, modern, cached, parallel, cache+parallel) across classification and regression and prints speedup tables. This is the source of the numbers in evaluation/explain_performance.md.

python evaluation/scripts/compare_explain_performance.py
# optional: save results
python evaluation/scripts/compare_explain_performance.py \
    --output reports/perf/pipeline_comparison.json

Prerequisites: the same environment used for the rest of the test suite (pip install -e .[dev]).


Streaming serialization benchmark

scripts/perf/stream_benchmark.py measures elapsed time and peak memory for serialising N synthetic explanations through the streaming API.

python scripts/perf/stream_benchmark.py --n 10000 --chunk 256 --format jsonl

Committed baselines

Baseline snapshots live in tests/benchmarks/. The thresholds in tests/benchmarks/perf_thresholds.json govern import time and explanation latency. Update the baseline file and commit it when a deliberate performance change lands.


When to run

Scenario

Recommended script

Quick sanity check after a change

run_micro_benchmarks.py

Before/after comparison for a PR

collect_baseline.py + check_perf_regression.py

Reproducing pipeline speedup numbers

compare_explain_performance.py

Streaming throughput check

stream_benchmark.py