# Difficulty-Normalized Reject for Conformal Classification This page is the research-facing summary for CE's experimental difficulty-normalized reject-option conformal classification work. It documents what has been implemented and evaluated without promoting a new public nonconformity function (NCF) beyond the stable `default` and `ensured` choices. The companion practitioner guide is {doc}`../../practitioner/advanced/reject-policy`. ## What CE reject-option classification does For classification calls with reject enabled, CE builds conformal prediction sets from calibrated class probabilities. The prediction-set geometry determines whether the instance can be routed automatically or should be rejected: | Prediction set | CE route | Interpretation | | --- | --- | --- | | Singleton set (`|S(x)| = 1`) | accept | One class remains plausible at the selected confidence. | | Empty set (`|S(x)| = 0`) | novelty reject | No class is supported strongly enough; treat as non-covered or novel. | | Multi-label set (`|S(x)| >= 2`) | ambiguity reject | More than one class remains plausible; defer or review. | These routes are exposed through reject metadata including `reject_rate`, `ambiguity_rate`, `novelty_rate`, `prediction_set_size`, `ambiguity_mask`, and `novelty_mask`. The aggregate relation is `reject_rate = ambiguity_rate + novelty_rate`. ## Existing difficulty-aware Venn-Abers path CE already has a difficulty-aware classification path before direct reject-score normalization is selected: 1. `difficulty_estimator` is accepted by `CalibratedExplainer`. 2. The interval plugin context carries the estimator to the built-in legacy interval plugin. 3. The built-in plugin passes the estimator into `VennAbers`. 4. `VennAbers` applies difficulty through probability scaling before Venn-Abers calibration. 5. The reject framework then computes `default` or `ensured` reject scores from the calibrated probabilities. This path is difficulty-aware indirectly. It changes the probabilities consumed by reject, but the reject nonconformity formulas themselves remain the public `default` or `ensured` formulas. ## New direct reject-score normalization The experimental direct strategy is selected through the reject strategy registry: ```python strategy="experimental.difficulty_normalized" ``` It changes the reject score definition by normalizing calibration and test nonconformity scores with per-instance difficulty before conformal p-values and prediction sets are computed. This keeps the CE-first API and `RejectPolicy` contracts unchanged while making the experimental behavior explicit and opt-in. The novelty-aware research variant is: ```python strategy="experimental.ambiguity_normalized_novelty_penalized" ``` That variant remains diagnostic. It combines ambiguity difficulty normalization with an additional novelty penalty to explore whether empty-set novelty rejects can be separated from multi-label ambiguity rejects more cleanly. ## Why normalization must happen before p-values Difficulty normalization is part of the nonconformity score definition. The conformal p-values and prediction sets are calibrated from the score distribution, so the normalization must be applied to both calibration and test scores before p-value computation. Applying difficulty only as a post-hoc reject threshold is a different operation. It moves a final decision cutoff but leaves the conformal scores, p-values, and prediction sets unchanged. That may be a useful heuristic, but it is not difficulty-normalized conformal scoring. ## When to use each mode Use `ncf="default"` when you need the stable public baseline and want to preserve current CE reject behavior. Use `ncf="ensured"` when interval-width-aware scoring is desired and the blend weight `w` is part of the experiment or operating-point tuning. This is still a stable public NCF mode. Use `strategy="experimental.difficulty_normalized"` for research and ablation runs where difficulty should directly shape reject scoring. Keep VA difficulty off for the cleanest primary contrast unless you are explicitly studying double-counting. Use `strategy="experimental.ambiguity_normalized_novelty_penalized"` only for diagnostics of ambiguity-vs-novelty separation. Current Scenario 10 evidence did not justify promoting it over the simpler difficulty-normalized strategy. ## Validity and methodology caveats - Fit and freeze the difficulty estimator before reject calibration alphas are estimated. - Avoid fitting the estimator on calibration labels or calibration residuals unless explicit cross-fitting is used. - Treat empirical selectivity, accepted accuracy, and difficulty alignment as empirical usefulness metrics, not new formal coverage guarantees. - Coverage claims remain tied to the conformal score pipeline and its exchangeability assumptions. Difficulty normalization changes that pipeline. - Experimental strategies are not public NCF contract expansion; `default` and `ensured` remain the public-facing NCF modes. ## Evaluation summary: Scenarios 8-11 The evaluation artifacts live under `evaluation/reject/artifacts/`. ### Scenario 8: existing indirect difficulty effect Scenario 8 measured the existing path `difficulty_estimator -> VennAbers probability scaling -> reject NCF`. For `default`, enabling VA difficulty changed accept rate by -22.2 percentage points, rejected-error capture by +11.5 percentage points, and accepted accuracy by -10.9 percentage points. For `ensured`, accept rate changed by -9.7 percentage points, rejected-error capture by +4.9 percentage points, and accepted accuracy by -12.2 percentage points. Interpretation: the indirect VA difficulty path acted mainly as a stricter reject gate in this run. It captured more errors but accepted far fewer instances and reduced accepted accuracy. ### Scenario 9: direct difficulty-normalized reject scores Scenario 9 compared six arms. The primary research contrast was A vs C: `builtin.default`, `ncf=default`, no VA difficulty, against `experimental.difficulty_normalized`, `ncf=default`, no VA difficulty. Direct normalization changed reject rate by +1.08 percentage points, the rejected-minus-accepted difficulty gap by +0.3416, and `difficulty_reject_auc` by +0.2012. Matched reject-rate bins showed a mean accepted-accuracy delta of -0.0089 for C minus A, so the evidence favors difficulty-aligned routing rather than a blanket accepted-accuracy improvement. Arms with both VA difficulty and direct score normalization were diagnostic for double-counting. The artifact recommends arm C as the next experimental baseline because it gives the cleanest direct-normalization contrast without VA double-count risk. ### Scenario 10: ambiguity-normalized novelty-penalized variant Scenario 10 compared the built-in baseline, arm C, and the novelty-aware arm G. Relative to C, G changed novelty rate by +0.0019, ambiguity rate by +0.0047, and accepted accuracy by +0.0037. It also reduced novelty reject AUC by -0.0371. Interpretation: the novelty-aware variant did not clearly improve ambiguity-vs-novelty separation. Arm C remains the simpler recommended experimental baseline. ### Scenario 11: matched operating-point selection Scenario 11 selected confidence values closest to target reject rates `0.10`, `0.20`, `0.30`, and `0.40` instead of averaging across the confidence grid. This is the decision-gate scenario for public API promotion. For A vs C, accepted-accuracy deltas by target reject rate were +0.0012, -0.0029, -0.0070, and -0.0089. The best matched operating point was target `0.10`, but the overall evidence was mixed. Across targets, C minus A mean `difficulty_reject_auc` was -0.0040. For C vs G, the novelty-aware variant increased novelty and empty-set rates by +0.0084 on average, changed accepted accuracy by -0.0005, and increased novelty reject AUC by +0.0845 while reducing ambiguity rate by -0.0114. Interpretation: Scenario 11 does not justify public API promotion. Direct difficulty normalization and the novelty-aware strategy should both remain experimental. ## Contribution framing Development contribution: : CE now has difficulty-aware reject routing that preserves the CE-first lifecycle, `RejectPolicy` output contracts, and the existing plugin architecture. Research contribution: : CE now has an experimental difficulty-normalized nonconformity strategy for reject-option conformal classification, evaluated against the built-in difficulty-aware VA path and a novelty-aware diagnostic variant. ## Open questions - Ambiguity vs novelty separation: can empty-set novelty and multi-label ambiguity be separated without harming accepted decision quality? - Double-counting difficulty: when, if ever, should VA difficulty scaling and direct reject-score normalization be combined? - Conditional validity and Mondrian variants: can subgroup-aware calibration improve reliability without sacrificing useful reject selectivity? - Finite-sample behavior: how stable are the observed effects across small calibration sets, high confidence regimes, and heterogeneous datasets? ## Minimal CE-first experiment snippet ```python from calibrated_explanations import RejectPolicySpec, WrapCalibratedExplainer wrapper = WrapCalibratedExplainer(model) wrapper.fit(x_train, y_train) wrapper.calibrate( x_cal, y_cal, feature_names=feature_names, difficulty_estimator=difficulty_estimator, ) result = wrapper.predict( x_test, reject_policy=RejectPolicySpec.flag(ncf="default"), confidence=0.95, strategy="experimental.difficulty_normalized", ) print(result.metadata["reject_rate"]) print(result.metadata["ambiguity_rate"], result.metadata["novelty_rate"]) ``` The experimental difficulty strategies fail fast with `ConfigurationError` when `difficulty_estimator` is omitted. This prevents a missing estimator from silently turning the run into the built-in reject score. ## Experimental API stability contract (RT-12) The strategy identifiers `"experimental.difficulty_normalized"` and `"experimental.ambiguity_normalized_novelty_penalized"` are experimental and carry **no compatibility guarantee** until explicitly promoted to a stable tier. - Any rename or removal will emit a `DeprecationWarning` for at least one minor version before taking effect. - Do not hard-code experimental strategy strings in production callers. Wrap them in a constant or configuration key so they can be updated without searching the codebase. - Promotion to a public API tier will be recorded in `CHANGELOG.md` with a concrete target milestone. Until then, treat experimental strategy strings as internal-only identifiers subject to change. ## Known limitations and research directions (RT-13) **Fairness and subgroup effects.** Difficulty normalization concentrates rejection on instances the difficulty estimator scores as hard. If the estimator correlates with protected attributes (e.g., age, gender, race), it may systematically over-reject instances from certain demographic subgroups. None of the 46 datasets in Scenarios 8–11 were analyzed by subgroup. Before deploying difficulty-normalized reject in fairness-sensitive contexts, practitioners should: 1. Audit reject rates stratified by relevant demographic attributes. 2. Check for correlation between difficulty scores and protected features. 3. Apply Mondrian conformal prediction or post-hoc calibration if subgroup reject-rate disparities are detected. This is an open research direction. Conditional validity and Mondrian variants are listed under Open Questions above. Entry-point tier: Tier 3