What didn't work
A catalogue of the strategies that looked like edge and turned out to be artifacts. Each entry below is one of the harness "tells" — a class of failure the methodology learned to detect.
Scale-invariance pathology (S20, S21, S26 pre-Phase-6)
The strategy reports a non-zero Sharpe, but actual P&L is zero to seven decimal places. Positions have collapsed to near-zero. negative_sharpe is scale-invariant, so the Sharpe is numerically defined but meaningless.
- S21 quality-tilted low-volatility: +0.76 ± 4.96 Sharpe across 10 seeds, but
mean_pnl ≈ 0 to 7 decimals. negative_sharpefamily: ~160 rejected orchestrator nodes all decomposed tomean_pnl = 0+sharpe_std >> sharpe_mean.
Caught by: the L31 mean-PnL ratio co-condition (|mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10). Fixed structurally by: position-aware losses (L32, Phase 6.E).
Ticker memorisation (L41-L43)
At 46 stocks, the model learns "always go long high-density tickers" — features like vol_60d, fund_*, and gkg_n_articles_t carry persistent stock-identity that correlates with shuffled returns because density doesn't change under time shuffling. Every Nifty 50 overnight run in Phase 7 produced shuffled_target test value ≈ 4-5 (vs bound 1.02).
Caught by: the static_features_test (L45, Phase 8.A) and permutation_invariance test (Phase 11.B). Partially mitigated by: per-stock rolling z-score (L44, Phase 8.A). Not yet fully fixed — sector-demean alone is insufficient (L52); combined sector-demean + ticker-dropout untested.
News-channel contrarian over-fit (L46, L48, L51)
GDELT daily news on Indian large-caps is a weak contrarian signal, not momentum (corr(gkg_avg_tone_t, next_ret) = -0.0164, strengthening to -0.11 at strong tone). When given an explicit gkg_neg_tone_t = -1 * gkg_avg_tone_t feature, the model goes deep contrarian and fails most leakage tests (sharpe = -4.99, mean_pnl = -2.3e-3).
Reading: L48 signal is real, but the model overweights a small contrarian edge into noise-amplification territory. Use neg_tone with stronger regularisation, not as a dominant signal.
Window-position artifacts (S26 pre-Phase-6, L54 dropout=0.5)
The strategy looks great on a 180-day evaluation window but produces near-zero or opposite-sign Sharpe on a 21-day window over the same data. A real edge must be consistent across nested windows.
| Candidate | Sharpe at n=21 | Sharpe at n=180 |
|---|---|---|
| S26 PEAD (post-L24) | +0.16 ± 3.59 | +1.34 ± 0.10 |
| dropout=0.5 + position-floor (L54) | +2.44 | -0.14 |
Caught by: window_robustness_test (L33, Phase 6.B). The dropout=0.5 candidate was the most misleading result of v0.5 — it passed 5/6 leakage tests at n=21 and looked like the project's first plausibly-real edge until the window-stability gate ran.
Feature-magnitude artifacts (S20 pre-L24)
S20 raw features span >5 orders of magnitude — gkg_density_x_tone reaches ~1800 while planted cheat returns are ~0.01. The planted signal is swamped, so the cheat tests can't detect it, and the model's first-layer linear weights are dominated by the largest-magnitude feature regardless of what it actually predicts. S20 honest Sharpe collapsed from +2.55 to ~+1.0 with mean_pnl = 0 once features were properly standardised post-L24.
Caught by: L24 causal standardisation. Generalised by: L30 transform_fn on cheat columns so they travel through the same standardisation.
Empirical-bound mis-calibration (L53)
At small n, the closed-form 2·SE bound on shuffled_target is too tight by ~6× compared to empirical seed std. At n_holdout=21 × 46 stocks, only 20% of seeds pass the theoretical bound on PURE shuffled noise. This means honest candidates were sometimes failing shuffled_target for noise reasons alone.
Fix direction: bootstrap-calibrate the bound per regime instead of using closed-form. At n_holdout=180 the closed-form bound starts holding. Not yet shipped — L53 is on the open Phase 13 to-do list.
The pattern (L42)
Each layer of leakage testing reveals the next escape route the optimiser takes:
| Layer | Caught | Surfaced |
|---|---|---|
| Standardise features | scale artifacts | mean_pnl = 0 pathology (L24/L30) |
| Ratio co-condition | scale-invariance | loss pathology (L31) |
| Position-floor | degenerate positions | cross-stock memorisation (L41) |
| Larger universe | sample-size limit | memorisation persists (L43) |
| Per-stock z-score | level memorisation | covariance memorisation (L44) |
static_features_test | constant-input edge | linear is the only safe arch (L45/L49) |
| Window-stability | in-sample mean-reversion | dropout candidate is artifact (L54) |
That whack-a-mole behaviour is itself the L42 finding — expected when the universe lacks genuine cross-sectional time-edge.