Phase 12 — window-stability verdict

Phase 12 ran the empirical-bound calibration on shuffled_target and applied the window-stability gate to the L50 dropout candidate from Phase 11.E. Both findings were structural enough to ship as v0.6.

12.A diagnostic — L53 empirical bound

The shuffled_target bound 2 * _sharpe_se(n_holdout, n_stocks) is too tight at small n. Empirical seed std on noise data is 6× the theoretical SE: at n_holdout=21 × 46 stocks, theoretical 2·SE = 1.0 but empirical 95% interval across 10 seeds is [-6.3, +5.0] (std=3.15). Only 20% of seeds pass the theoretical bound on PURE shuffled noise.

L26 confirmed at scale. Fix direction: bootstrap-calibrate the bound per (n_holdout, n_stocks, hp) regime instead of using the closed-form SE. At n_holdout=180 the bound starts holding (shuffled_target = +0.07 on the dropout=0.5 candidate).

12.A window check — L54 the dropout candidate is an artifact

The dropout=0.5 + position-floor candidate that appeared to pass 5/6 leakage tests at n=21 (Sharpe ~5) is a window-position artifact, NOT real edge.

window_robustness_test shows:

Holdout	Sharpe
`sharpe_short(n=21)`	+2.44
`sharpe_long(n=180)`	-0.14

Per-ticker position analysis confirms: positions correlate negatively with same-day returns (-0.26) but near-zero with next-day returns (+0.03) — the model is doing in-sample mean-reversion on the in-sample window that doesn't generalise.

With the corrected window-stability gate as a 7th leakage test in series, the candidate fails honestly. Same family as L33 / L34 / S26 pre-Phase-6.

v0.6 release-ready

After Phase 12, the methodology is the deliverable. Best honest results:

nifty_50_linear_momentum + position-floor — Sharpe -0.35, mean_pnl -1.4e-3, static_features PASS (+0.35), all classic leakage tests PASS. Cleanest small-honest-Sharpe leakage-clean result the project has produced. Linear can't construct per-stock biases.
nifty_50_per_stock_momentum + negative_sortino (post-L47 fix) — Sharpe -0.51, leakage-clean, non-degenerate.

Best apparent-but-artifact results — all flagged by window-stability or shuffled_target empirical-bound recalibration — are catalogued on What didn't work.

L23 vindicated: the realistic post-cost Sharpe ceiling of 0.4-0.7 held up. Every result above that band has now been traced to a harness gap or a scale artifact. The harness gaps got fixed (L24, L27, L30, L31, L32, L37, L44, L45, L47, L53). The scale artifacts get caught (L33, L34, L41, L42, L43, L54).

Phase 12 — window-stability verdict ​

12.A diagnostic — L53 empirical bound ​

12.A window check — L54 the dropout candidate is an artifact ​

v0.6 release-ready ​

Phase 12 — window-stability verdict

12.A diagnostic — L53 empirical bound

12.A window check — L54 the dropout candidate is an artifact

v0.6 release-ready