Harness evolution

The methodology shipped in v0.6 is the result of patching the harness ten times in response to ten different classes of failure the original Spike 9 design did not catch. Each entry below names the learning, summarises the gap, and points at the phase that introduced the fix.

L24 — Causal standardisation is mandatory (Phase 5 / S21)

Raw features at magnitudes well above the planted cheat (fundamentals 5-30 vs return 0.01) cause look_ahead_cheat to falsely PASS. Spike 21 added per-feature causal standardisation using [:n_train] stats; the cheat went from sharpe=1.0 to sharpe=8.2. Mandatory before merging any strategy.

L30 — Cheat columns must travel through the same transform (Phase 6.A)

After L24, the opposite failure appeared: cheat columns at raw scale were dwarfed by z-scored honest features, so the harness under-lifted cheats and silently false-negatived leaks. Phase 6.A made the cheat tests accept an optional transform_fn so cheat columns are standardised on the same footing as honest features. _causal_standardise lifted to lib/preprocessing.py as a returns-the-transform function.

L31 — Mean-PnL ratio co-condition (Phase 6.A)

Sharpe alone could not distinguish "real lift from cheat" from "degenerate-to-degenerate". Added |mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10 as a co-condition. Catches the negative_sharpe scale-invariance pathology where positions collapse to near-zero.

L32 — Position-aware losses (Phase 6.E)

loss_fn(pnl) is structurally pnl-only — no loss can regulate position magnitude when it doesn't see positions. walk_forward now uses signature inspection (_call_loss) to pass pos to losses that declare it. Position-floor regularisation belongs in the harness contract, not in any single loss function. This was the structural blocker visible across 160 rejected nodes.

L37 — Position-floor strength softened (Phase 6.E retry)

sharpe_with_position_floor(α=5.0) was too aggressive — forced position size beyond what feature signal supported and produced artifact Sharpe = +3.897. Softened to α=0.5 (10× gentler). Penalty when positions degenerate is now ~0.025, comparable to small Sharpe magnitudes — a meaningful tiebreaker without overwhelming the Sharpe objective.

L44 — Per-stock rolling z-score (Phase 8.A)

Cross-stock memorisation (L43) at Nifty 50 scale. per_stock_rolling_zscore(features, window=60) in lib/preprocessing.py removes per-(stock, feature) baseline by causal 60-day rolling z-score. Cuts shuffled_target Sharpe by ~50% (5.0 → 2.5) and triples mean_pnl_mean (0.0008 → 0.0025). Direction validated, magnitude insufficient on its own — pair with L45.

L45 — `static_features_test` (Phase 8.A)

5th leakage test that replaces every feature column with its per-stock time-mean. A time-edge model produces near-zero positions on constant inputs; a ticker-memoriser keeps its baseline allocations. Directly catches ticker-identity memorisation that shuffled_target only indirectly hints at.

L47 — Per-stock z-score bug + raw-features fix (Phase 9.A)

_per_stock_zscore used sigma.clamp(min=1e-8) which blew up for warmup-period features (vol_60d ≈ 0 for first 60d → tiny std → 10^7+ z-scores). LayerNorm inside StrategyNet absorbed the scaling for the honest run (so it looked OK at -0.4 Sharpe) but DROWNED the cheat column (~1.0 scale) inside look_ahead_cheat_test. Plus _run_per_stock passed already-z-scored features to leakage tests, causing transform_fn to corrupt the head columns. Both fixed: clamp(min=1e-3), and pass RAW features to cheat tests. Post-fix: look_ahead_cheat = +30 (was -2.28); harness is now trustworthy for per-stock evaluation.

L53 — Empirical-bound recalibration on `shuffled_target` (Phase 12.A)

The 2 * _sharpe_se(n_holdout, n_stocks) bound is too tight at small n. Empirical seed std on noise data is 6× the theoretical SE at n_holdout=21 × 46 stocks — 2·SE = 1.0 vs empirical 95% interval [-6.3, +5.0] (std=3.15). Only 20% of seeds pass the theoretical bound on pure shuffled noise. Bootstrap-calibrate the bound per regime; at n_holdout=180 the closed-form bound starts holding.

L54 — Window-position artifact, the dropout=0.5 candidate (Phase 12.A)

The dropout=0.5 + position-floor candidate that appeared to pass 5/6 leakage tests at n=21 (Sharpe ~5) is a window-position artifact, not real edge. window_robustness_test shows sharpe_short(n=21) = +2.44 collapsing to sharpe_long(n=180) = -0.14. Same family as L33/L34/S26 pre-Phase-6. Per-ticker position analysis confirms: positions correlate negatively with same-day returns (-0.26) but near-zero with next-day returns (+0.03) — the model is doing in-sample mean-reversion that doesn't generalise. With the corrected window-stability gate as a 7th leakage test in series, the candidate fails honestly.

The pattern: each layer of leakage testing reveals the next escape route the optimiser takes. That whack-a-mole behaviour is itself the L42 finding — expected when the universe lacks genuine cross-sectional time-edge.

Harness evolution ​

L24 — Causal standardisation is mandatory (Phase 5 / S21) ​

L30 — Cheat columns must travel through the same transform (Phase 6.A) ​

L31 — Mean-PnL ratio co-condition (Phase 6.A) ​

L32 — Position-aware losses (Phase 6.E) ​

L37 — Position-floor strength softened (Phase 6.E retry) ​

L44 — Per-stock rolling z-score (Phase 8.A) ​

L45 — static_features_test (Phase 8.A) ​

L47 — Per-stock z-score bug + raw-features fix (Phase 9.A) ​

L53 — Empirical-bound recalibration on shuffled_target (Phase 12.A) ​

L54 — Window-position artifact, the dropout=0.5 candidate (Phase 12.A) ​