What didn't work

A catalogue of the strategies that looked like edge and turned out to be artifacts. Each entry below is one of the harness "tells" — a class of failure the methodology learned to detect.

Scale-invariance pathology (S20, S21, S26 pre-Phase-6)

The strategy reports a non-zero Sharpe, but actual P&L is zero to seven decimal places. Positions have collapsed to near-zero. negative_sharpe is scale-invariant, so the Sharpe is numerically defined but meaningless.

S21 quality-tilted low-volatility: +0.76 ± 4.96 Sharpe across 10 seeds, but mean_pnl ≈ 0 to 7 decimals.
negative_sharpe family: ~160 rejected orchestrator nodes all decomposed to mean_pnl = 0 + sharpe_std >> sharpe_mean.

Caught by: the L31 mean-PnL ratio co-condition (|mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10). Fixed structurally by: position-aware losses (L32, Phase 6.E).

Ticker memorisation (L41-L43)

At 46 stocks, the model learns "always go long high-density tickers" — features like vol_60d, fund_*, and gkg_n_articles_t carry persistent stock-identity that correlates with shuffled returns because density doesn't change under time shuffling. Every Nifty 50 overnight run in Phase 7 produced shuffled_target test value ≈ 4-5 (vs bound 1.02).

Caught by: the static_features_test (L45, Phase 8.A) and permutation_invariance test (Phase 11.B). Partially mitigated by: per-stock rolling z-score (L44, Phase 8.A). Not yet fully fixed — sector-demean alone is insufficient (L52); combined sector-demean + ticker-dropout untested.

News-channel contrarian over-fit (L46, L48, L51)

GDELT daily news on Indian large-caps is a weak contrarian signal, not momentum (corr(gkg_avg_tone_t, next_ret) = -0.0164, strengthening to -0.11 at strong tone). When given an explicit gkg_neg_tone_t = -1 * gkg_avg_tone_t feature, the model goes deep contrarian and fails most leakage tests (sharpe = -4.99, mean_pnl = -2.3e-3).

Reading: L48 signal is real, but the model overweights a small contrarian edge into noise-amplification territory. Use neg_tone with stronger regularisation, not as a dominant signal.

Window-position artifacts (S26 pre-Phase-6, L54 dropout=0.5)

The strategy looks great on a 180-day evaluation window but produces near-zero or opposite-sign Sharpe on a 21-day window over the same data. A real edge must be consistent across nested windows.

Candidate	Sharpe at n=21	Sharpe at n=180
S26 PEAD (post-L24)	+0.16 ± 3.59	+1.34 ± 0.10
dropout=0.5 + position-floor (L54)	+2.44	-0.14

Caught by: window_robustness_test (L33, Phase 6.B). The dropout=0.5 candidate was the most misleading result of v0.5 — it passed 5/6 leakage tests at n=21 and looked like the project's first plausibly-real edge until the window-stability gate ran.

Feature-magnitude artifacts (S20 pre-L24)

S20 raw features span >5 orders of magnitude — gkg_density_x_tone reaches ~1800 while planted cheat returns are ~0.01. The planted signal is swamped, so the cheat tests can't detect it, and the model's first-layer linear weights are dominated by the largest-magnitude feature regardless of what it actually predicts. S20 honest Sharpe collapsed from +2.55 to ~+1.0 with mean_pnl = 0 once features were properly standardised post-L24.

Caught by: L24 causal standardisation. Generalised by: L30 transform_fn on cheat columns so they travel through the same standardisation.

Empirical-bound mis-calibration (L53)

At small n, the closed-form 2·SE bound on shuffled_target is too tight by ~6× compared to empirical seed std. At n_holdout=21 × 46 stocks, only 20% of seeds pass the theoretical bound on PURE shuffled noise. This means honest candidates were sometimes failing shuffled_target for noise reasons alone.

Fix direction: bootstrap-calibrate the bound per regime instead of using closed-form. At n_holdout=180 the closed-form bound starts holding. Not yet shipped — L53 is on the open Phase 13 to-do list.

The pattern (L42)

Each layer of leakage testing reveals the next escape route the optimiser takes:

Layer	Caught	Surfaced
Standardise features	scale artifacts	`mean_pnl = 0` pathology (L24/L30)
Ratio co-condition	scale-invariance	loss pathology (L31)
Position-floor	degenerate positions	cross-stock memorisation (L41)
Larger universe	sample-size limit	memorisation persists (L43)
Per-stock z-score	level memorisation	covariance memorisation (L44)
`static_features_test`	constant-input edge	linear is the only safe arch (L45/L49)
Window-stability	in-sample mean-reversion	dropout candidate is artifact (L54)

That whack-a-mole behaviour is itself the L42 finding — expected when the universe lacks genuine cross-sectional time-edge.

What didn't work ​

Scale-invariance pathology (S20, S21, S26 pre-Phase-6) ​

Ticker memorisation (L41-L43) ​

News-channel contrarian over-fit (L46, L48, L51) ​

Window-position artifacts (S26 pre-Phase-6, L54 dropout=0.5) ​

Feature-magnitude artifacts (S20 pre-L24) ​

Empirical-bound mis-calibration (L53) ​

The pattern (L42) ​