Parallel execution playbook (opt-in)¶
Experienced teams can squeeze additional throughput from calibrated explanations by tuning the parallel executor. This playbook distils the ADR-004-complete heuristics so you can decide when the extra complexity is worthwhile. Pair it with the configuration steps in Tune runtime performance (opt-in) and roll back to the sequential baseline whenever the gains are marginal.
Tip
Keep these controls off until you have baseline explanations, calibration checks, and governance sign-off. Parallel overhead can exceed any speed-up on small workloads, especially on Windows.
Quick decision matrix¶
Stay sequential when
n_instances × n_features < 50,000or when Windows deployments cannot rely on thread-safe payloads.Use instance-parallel when instances dominate (
n_instances ≥ 4 × workers × 64) and feature counts stay below ~64; aim for chunks between 256 and 1,024 rows.Use feature-parallel (Deprecated) - Feature-parallel execution is deprecated and falls back to instance-parallel.
Avoid process pools on Windows unless work items take hundreds of milliseconds and payloads are shared via memory mapping.
Strategy checkpoints¶
Sequential (SequentialExplainExecutor)¶
Sequential execution remains the reference path. It wins by default on small or
medium datasets and whenever CE_PARALLEL stays disabled. Use it to validate
correctness before enabling any executor.
Instance-parallel (InstanceParallelExplainExecutor)¶
Best for wide batches of instances with relatively few features.
Chunk size –
min_batch_sizedoubles as the chunk size. Targetmax(256, ceil(n_instances / (workers × 3)))and keep the lower bound at 128 to avoid thrashing.Workers – favour threads on Windows. On Linux or macOS, threads work well when NumPy dominates; processes only help when Python control flow is the bottleneck and payloads serialise cheaply.
Fallback – switch back to sequential when chunk sizes shrink below 128 or when feature counts top 128 (feature-parallel usually performs better there).
Backend recommendations¶
Windows – use the thread backend. Joblib defaults to processes and inherits heavy spawn costs.
Linux/macOS – threads work best when NumPy releases the GIL; processes help only when the workload is Python-bound and payloads are light or shared.
Joblib – consider it experimental until ADR-004 rolls out batching. Move immutable arrays to module-level globals to let joblib memmap instead of pickling per task.
Parameter tuning checklist¶
Estimate total work (
n_instances × n_features).Choose the strategy using the decision matrix above.
Select worker counts (logical cores for threads, physical cores for processes).
Derive chunk sizes from the recommended bounds; clamp instance chunks to 256–1,024 rows.
(Deprecated) Feature-parallel tuning is no longer relevant as it falls back to instance-parallel.
Benchmark against sequential; adopt the executor only when you see ≥1.2× speed-up.
Operational guardrails¶
autostrategy now looks at workload hints. Providework_itemswhen calling domain wrappers (e.g. instance × feature counts) to let ADR-004’s chooser bias toward threads for light work and processes for large, instance-heavy payloads.Use
task_size_hint_bytesto steer away from process backends when payloads exceed ~10MB per task; the auto chooser will returnsequentialfor very small batches to keep overhead low.Telemetry emits duration, worker counts, and work-item hints per execution so you can validate parallel wins before rolling out broadly.