Parallel execution playbook (opt-in)¶

Experienced teams can squeeze additional throughput from calibrated explanations by tuning the parallel executor. This playbook distils the ADR-004-complete heuristics so you can decide when the extra complexity is worthwhile. Pair it with the configuration steps in Tune runtime performance (opt-in) and roll back to the sequential baseline whenever the gains are marginal.

Tip

Keep these controls off until you have baseline explanations, calibration checks, and governance sign-off. Parallel overhead can exceed any speed-up on small workloads, especially on Windows.

Quick decision matrix¶

Stay sequential when n_instances × n_features < 50,000 or when Windows deployments cannot rely on thread-safe payloads.
Use instance-parallel when instances dominate (n_instances ≥ 4 × workers × 64) and feature counts stay below ~64; aim for chunks between 256 and 1,024 rows.
Use feature-parallel (Deprecated) - Feature-parallel execution is deprecated and falls back to instance-parallel.
Avoid process pools on Windows unless work items take hundreds of milliseconds and payloads are shared via memory mapping.

Strategy checkpoints¶

Sequential (`SequentialExplainExecutor`)¶

Sequential execution remains the reference path. It wins by default on small or medium datasets and whenever CE_PARALLEL stays disabled. Use it to validate correctness before enabling any executor.

Instance-parallel (`InstanceParallelExplainExecutor`)¶

Best for wide batches of instances with relatively few features.

Chunk size – min_batch_size doubles as the chunk size. Target max(256, ceil(n_instances / (workers × 3))) and keep the lower bound at 128 to avoid thrashing.
Workers – favour threads on Windows. On Linux or macOS, threads work well when NumPy dominates; processes only help when Python control flow is the bottleneck and payloads serialise cheaply.
Fallback – switch back to sequential when chunk sizes shrink below 128 or when feature counts top 128 (feature-parallel usually performs better there).

Backend recommendations¶

Windows – use the thread backend. Joblib defaults to processes and inherits heavy spawn costs.
Linux/macOS – threads work best when NumPy releases the GIL; processes help only when the workload is Python-bound and payloads are light or shared.
Joblib – consider it experimental until ADR-004 rolls out batching. Move immutable arrays to module-level globals to let joblib memmap instead of pickling per task.

Parameter tuning checklist¶

Estimate total work (n_instances × n_features).
Choose the strategy using the decision matrix above.
Select worker counts (logical cores for threads, physical cores for processes).
Derive chunk sizes from the recommended bounds; clamp instance chunks to 256–1,024 rows.
(Deprecated) Feature-parallel tuning is no longer relevant as it falls back to instance-parallel.
Benchmark against sequential; adopt the executor only when you see ≥1.2× speed-up.

Operational guardrails¶

auto strategy now looks at workload hints. Provide work_items when calling domain wrappers (e.g. instance × feature counts) to let ADR-004’s chooser bias toward threads for light work and processes for large, instance-heavy payloads.
Use task_size_hint_bytes to steer away from process backends when payloads exceed ~10MB per task; the auto chooser will return sequential for very small batches to keep overhead low.
Telemetry emits duration, worker counts, and work-item hints per execution so you can validate parallel wins before rolling out broadly.

Parallel execution playbook (opt-in)¶

Quick decision matrix¶

Strategy checkpoints¶

Sequential (SequentialExplainExecutor)¶

Instance-parallel (InstanceParallelExplainExecutor)¶

Backend recommendations¶

Parameter tuning checklist¶

Operational guardrails¶

Related resources¶

Sequential (`SequentialExplainExecutor`)¶

Instance-parallel (`InstanceParallelExplainExecutor`)¶