# Parallel execution playbook (opt-in) Experienced teams can squeeze additional throughput from calibrated explanations by tuning the parallel executor. This playbook distils the ADR-004-complete heuristics so you can decide when the extra complexity is worthwhile. Pair it with the configuration steps in {doc}`../../foundations/how-to/tune_runtime_performance` and roll back to the sequential baseline whenever the gains are marginal. ```{tip} Keep these controls off until you have baseline explanations, calibration checks, and governance sign-off. Parallel overhead can exceed any speed-up on small workloads, especially on Windows. ``` ## Quick decision matrix - **Stay sequential** when ``n_instances × n_features < 50,000`` or when Windows deployments cannot rely on thread-safe payloads. - **Use instance-parallel** when instances dominate (``n_instances ≥ 4 × workers × 64``) and feature counts stay below ~64; aim for chunks between 256 and 1,024 rows. - **Use feature-parallel** (Deprecated) - Feature-parallel execution is deprecated and falls back to instance-parallel. - **Avoid process pools on Windows** unless work items take hundreds of milliseconds and payloads are shared via memory mapping. ## Strategy checkpoints ### Sequential (`SequentialExplainExecutor`) Sequential execution remains the reference path. It wins by default on small or medium datasets and whenever ``CE_PARALLEL`` stays disabled. Use it to validate correctness before enabling any executor. ### Instance-parallel (`InstanceParallelExplainExecutor`) Best for wide batches of instances with relatively few features. - **Chunk size** – ``min_batch_size`` doubles as the chunk size. Target ``max(256, ceil(n_instances / (workers × 3)))`` and keep the lower bound at 128 to avoid thrashing. - **Workers** – favour threads on Windows. On Linux or macOS, threads work well when NumPy dominates; processes only help when Python control flow is the bottleneck and payloads serialise cheaply. - **Fallback** – switch back to sequential when chunk sizes shrink below 128 or when feature counts top 128 (feature-parallel usually performs better there). ## Backend recommendations - **Windows** – use the thread backend. Joblib defaults to processes and inherits heavy spawn costs. - **Linux/macOS** – threads work best when NumPy releases the GIL; processes help only when the workload is Python-bound and payloads are light or shared. - **Joblib** – consider it experimental until ADR-004 rolls out batching. Move immutable arrays to module-level globals to let joblib memmap instead of pickling per task. ## Parameter tuning checklist 1. Estimate total work (`n_instances × n_features`). 2. Choose the strategy using the decision matrix above. 3. Select worker counts (logical cores for threads, physical cores for processes). 4. Derive chunk sizes from the recommended bounds; clamp instance chunks to 256–1,024 rows. 5. (Deprecated) Feature-parallel tuning is no longer relevant as it falls back to instance-parallel. 6. Benchmark against sequential; adopt the executor only when you see ≥1.2× speed-up. ## Operational guardrails - ``auto`` strategy now looks at workload hints. Provide ``work_items`` when calling domain wrappers (e.g. instance × feature counts) to let ADR-004’s chooser bias toward threads for light work and processes for large, instance-heavy payloads. - Use ``task_size_hint_bytes`` to steer away from process backends when payloads exceed ~10MB per task; the auto chooser will return ``sequential`` for very small batches to keep overhead low. - Telemetry emits duration, worker counts, and work-item hints per execution so you can validate parallel wins before rolling out broadly. ## Related resources - {doc}`../../foundations/how-to/tune_runtime_performance` – enable/disable the executor and cache. - {doc}`../../foundations/concepts/telemetry` – instrument runtime metrics when you opt in to parallel execution. - `docs/improvement/parallel_execution_improvement_plan.md` – internal task breakdown tracking the ADR-004 remediation.