Phase 7 — Nifty 50 universe expansion
Spike 17 expanded the universe from 5 stocks to 46 (Nifty 50 minus 4 with missing OHLCV or zero news matches). Added GDELT GKG via a single BQ-aggregated query (1.68M events, 17 MB parquet). Built unified features (742, 46, 17) with 0 NaNs. Wired the orchestrator to consume the new universe via per-spike tickers + artifacts_dir overrides on SPIKE_DEFAULTS. Seeded 15 diverse nodes across loss / architecture / features teams.
What we found
Position-floor works at scale (L40). mean_pnl_mean moved from a flat 0.0 (rigid pathology at 5-bench) to 5e-4..8e-4 (non-degenerate) across every overnight run. Confirms L32 + L37: position-aware losses plus a larger universe lets the optimiser find real positions.
But: shuffled_target now fails consistently (L41). Every single overnight node produced Sharpe ≥ 1.8 with shuffled_target leakage test value ≈ 4-5 (vs bound 1.02). The model has learned to exploit structural cross-stock differences (vol, news density, fundamentals) that persist under time permutation. This is structural leakage, not time-leakage — even time-permuted returns leave the cross-sectional level alive.
No proven_accepted strategy at scale. Goal not met. 257 nodes explored, 176 proven_rejected, 76 inconclusive, 1 pending, 4 blocked.
The bigger pattern (L42)
Each layer of leakage testing reveals the next escape route the optimiser takes:
- Standardise features →
mean_pnl=0pathology (L24/L30) - Ratio co-condition → loss pathology exposed (L31)
- Position-floor → cross-stock memorisation exposed (L41)
- Larger universe → memorisation persists
The project's binding constraint is no longer "universe size" — it's structural feature design. Features that carry persistent stock-identity (vol_60d, fund_*, gkg_n_articles_t baseline rates) are exploitable as cross-stock structural signal regardless of time alignment.
L43 — the structural blocker named
The remaining structural blocker at 46 stocks is cross-stock memorisation: features like vol_60d, fund_*, and even gkg_n_articles_t (which has wildly different baseline rates per ticker — LT 15k/day vs HDFCBANK 2/day) carry persistent stock-identity signal. The model learns "always go long high-density tickers" which correlates with shuffled returns because density doesn't change under time shuffling.
Fix directions named at the end of Phase 7:
- Per-stock causal z-scoring of features (Phase 8)
- Cross-stock contrastive regulariser
- Ticker dropout during training (Phase 11.E)
- Single-stock baseline (Phase 9)
- Permutation-invariance audit as a 5th leakage test (Phase 11.B)
- Even-larger universe (Nifty 100 or BSE 500) — not yet done
What it produced
L40, L41, L42, L43 — the four learnings that scoped Phases 8-12. Phase 7 also established the unified Nifty 50 artifacts that every later phase consumes.