Difficulty-Normalized Reject for Conformal Classification

This page is the research-facing summary for CE’s experimental difficulty-normalized reject-option conformal classification work. It documents what has been implemented and evaluated without promoting a new public nonconformity function (NCF) beyond the stable default and ensured choices.

The companion practitioner guide is Reject Policy Guide.

What CE reject-option classification does

For classification calls with reject enabled, CE builds conformal prediction sets from calibrated class probabilities. The prediction-set geometry determines whether the instance can be routed automatically or should be rejected:

Prediction set

CE route

Interpretation

Singleton set (`

S(x)

= 1`)

Empty set (`

S(x)

= 0`)

Multi-label set (`

S(x)

>= 2`)

These routes are exposed through reject metadata including reject_rate, ambiguity_rate, novelty_rate, prediction_set_size, ambiguity_mask, and novelty_mask. The aggregate relation is reject_rate = ambiguity_rate + novelty_rate.

Existing difficulty-aware Venn-Abers path

CE already has a difficulty-aware classification path before direct reject-score normalization is selected:

  1. difficulty_estimator is accepted by CalibratedExplainer.

  2. The interval plugin context carries the estimator to the built-in legacy interval plugin.

  3. The built-in plugin passes the estimator into VennAbers.

  4. VennAbers applies difficulty through probability scaling before Venn-Abers calibration.

  5. The reject framework then computes default or ensured reject scores from the calibrated probabilities.

This path is difficulty-aware indirectly. It changes the probabilities consumed by reject, but the reject nonconformity formulas themselves remain the public default or ensured formulas.

New direct reject-score normalization

The experimental direct strategy is selected through the reject strategy registry:

strategy="experimental.difficulty_normalized"

It changes the reject score definition by normalizing calibration and test nonconformity scores with per-instance difficulty before conformal p-values and prediction sets are computed. This keeps the CE-first API and RejectPolicy contracts unchanged while making the experimental behavior explicit and opt-in.

The novelty-aware research variant is:

strategy="experimental.ambiguity_normalized_novelty_penalized"

That variant remains diagnostic. It combines ambiguity difficulty normalization with an additional novelty penalty to explore whether empty-set novelty rejects can be separated from multi-label ambiguity rejects more cleanly.

Why normalization must happen before p-values

Difficulty normalization is part of the nonconformity score definition. The conformal p-values and prediction sets are calibrated from the score distribution, so the normalization must be applied to both calibration and test scores before p-value computation.

Applying difficulty only as a post-hoc reject threshold is a different operation. It moves a final decision cutoff but leaves the conformal scores, p-values, and prediction sets unchanged. That may be a useful heuristic, but it is not difficulty-normalized conformal scoring.

When to use each mode

Use ncf="default" when you need the stable public baseline and want to preserve current CE reject behavior.

Use ncf="ensured" when interval-width-aware scoring is desired and the blend weight w is part of the experiment or operating-point tuning. This is still a stable public NCF mode.

Use strategy="experimental.difficulty_normalized" for research and ablation runs where difficulty should directly shape reject scoring. Keep VA difficulty off for the cleanest primary contrast unless you are explicitly studying double-counting.

Use strategy="experimental.ambiguity_normalized_novelty_penalized" only for diagnostics of ambiguity-vs-novelty separation. Current Scenario 10 evidence did not justify promoting it over the simpler difficulty-normalized strategy.

Validity and methodology caveats

  • Fit and freeze the difficulty estimator before reject calibration alphas are estimated.

  • Avoid fitting the estimator on calibration labels or calibration residuals unless explicit cross-fitting is used.

  • Treat empirical selectivity, accepted accuracy, and difficulty alignment as empirical usefulness metrics, not new formal coverage guarantees.

  • Coverage claims remain tied to the conformal score pipeline and its exchangeability assumptions. Difficulty normalization changes that pipeline.

  • Experimental strategies are not public NCF contract expansion; default and ensured remain the public-facing NCF modes.

Evaluation summary: Scenarios 8-11

The evaluation artifacts live under evaluation/reject/artifacts/.

Scenario 8: existing indirect difficulty effect

Scenario 8 measured the existing path difficulty_estimator -> VennAbers probability scaling -> reject NCF.

For default, enabling VA difficulty changed accept rate by -22.2 percentage points, rejected-error capture by +11.5 percentage points, and accepted accuracy by -10.9 percentage points. For ensured, accept rate changed by -9.7 percentage points, rejected-error capture by +4.9 percentage points, and accepted accuracy by -12.2 percentage points.

Interpretation: the indirect VA difficulty path acted mainly as a stricter reject gate in this run. It captured more errors but accepted far fewer instances and reduced accepted accuracy.

Scenario 9: direct difficulty-normalized reject scores

Scenario 9 compared six arms. The primary research contrast was A vs C: builtin.default, ncf=default, no VA difficulty, against experimental.difficulty_normalized, ncf=default, no VA difficulty.

Direct normalization changed reject rate by +1.08 percentage points, the rejected-minus-accepted difficulty gap by +0.3416, and difficulty_reject_auc by +0.2012. Matched reject-rate bins showed a mean accepted-accuracy delta of -0.0089 for C minus A, so the evidence favors difficulty-aligned routing rather than a blanket accepted-accuracy improvement.

Arms with both VA difficulty and direct score normalization were diagnostic for double-counting. The artifact recommends arm C as the next experimental baseline because it gives the cleanest direct-normalization contrast without VA double-count risk.

Scenario 10: ambiguity-normalized novelty-penalized variant

Scenario 10 compared the built-in baseline, arm C, and the novelty-aware arm G. Relative to C, G changed novelty rate by +0.0019, ambiguity rate by +0.0047, and accepted accuracy by +0.0037. It also reduced novelty reject AUC by -0.0371.

Interpretation: the novelty-aware variant did not clearly improve ambiguity-vs-novelty separation. Arm C remains the simpler recommended experimental baseline.

Scenario 11: matched operating-point selection

Scenario 11 selected confidence values closest to target reject rates 0.10, 0.20, 0.30, and 0.40 instead of averaging across the confidence grid. This is the decision-gate scenario for public API promotion.

For A vs C, accepted-accuracy deltas by target reject rate were +0.0012, -0.0029, -0.0070, and -0.0089. The best matched operating point was target 0.10, but the overall evidence was mixed. Across targets, C minus A mean difficulty_reject_auc was -0.0040.

For C vs G, the novelty-aware variant increased novelty and empty-set rates by +0.0084 on average, changed accepted accuracy by -0.0005, and increased novelty reject AUC by +0.0845 while reducing ambiguity rate by -0.0114.

Interpretation: Scenario 11 does not justify public API promotion. Direct difficulty normalization and the novelty-aware strategy should both remain experimental.

Contribution framing

Development contribution: : CE now has difficulty-aware reject routing that preserves the CE-first lifecycle, RejectPolicy output contracts, and the existing plugin architecture.

Research contribution: : CE now has an experimental difficulty-normalized nonconformity strategy for reject-option conformal classification, evaluated against the built-in difficulty-aware VA path and a novelty-aware diagnostic variant.

Open questions

  • Ambiguity vs novelty separation: can empty-set novelty and multi-label ambiguity be separated without harming accepted decision quality?

  • Double-counting difficulty: when, if ever, should VA difficulty scaling and direct reject-score normalization be combined?

  • Conditional validity and Mondrian variants: can subgroup-aware calibration improve reliability without sacrificing useful reject selectivity?

  • Finite-sample behavior: how stable are the observed effects across small calibration sets, high confidence regimes, and heterogeneous datasets?

Minimal CE-first experiment snippet

from calibrated_explanations import RejectPolicySpec, WrapCalibratedExplainer

wrapper = WrapCalibratedExplainer(model)
wrapper.fit(x_train, y_train)
wrapper.calibrate(
    x_cal,
    y_cal,
    feature_names=feature_names,
    difficulty_estimator=difficulty_estimator,
)

result = wrapper.predict(
    x_test,
    reject_policy=RejectPolicySpec.flag(ncf="default"),
    confidence=0.95,
    strategy="experimental.difficulty_normalized",
)

print(result.metadata["reject_rate"])
print(result.metadata["ambiguity_rate"], result.metadata["novelty_rate"])

The experimental difficulty strategies fail fast with ConfigurationError when difficulty_estimator is omitted. This prevents a missing estimator from silently turning the run into the built-in reject score.

Experimental API stability contract (RT-12)

The strategy identifiers "experimental.difficulty_normalized" and "experimental.ambiguity_normalized_novelty_penalized" are experimental and carry no compatibility guarantee until explicitly promoted to a stable tier.

  • Any rename or removal will emit a DeprecationWarning for at least one minor version before taking effect.

  • Do not hard-code experimental strategy strings in production callers. Wrap them in a constant or configuration key so they can be updated without searching the codebase.

  • Promotion to a public API tier will be recorded in CHANGELOG.md with a concrete target milestone. Until then, treat experimental strategy strings as internal-only identifiers subject to change.

Known limitations and research directions (RT-13)

Fairness and subgroup effects. Difficulty normalization concentrates rejection on instances the difficulty estimator scores as hard. If the estimator correlates with protected attributes (e.g., age, gender, race), it may systematically over-reject instances from certain demographic subgroups. None of the 46 datasets in Scenarios 8–11 were analyzed by subgroup.

Before deploying difficulty-normalized reject in fairness-sensitive contexts, practitioners should:

  1. Audit reject rates stratified by relevant demographic attributes.

  2. Check for correlation between difficulty scores and protected features.

  3. Apply Mondrian conformal prediction or post-hoc calibration if subgroup reject-rate disparities are detected.

This is an open research direction. Conditional validity and Mondrian variants are listed under Open Questions above.

Entry-point tier: Tier 3