Phase 7 — Nifty 50 universe expansion

Spike 17 expanded the universe from 5 stocks to 46 (Nifty 50 minus 4 with missing OHLCV or zero news matches). Added GDELT GKG via a single BQ-aggregated query (1.68M events, 17 MB parquet). Built unified features (742, 46, 17) with 0 NaNs. Wired the orchestrator to consume the new universe via per-spike tickers + artifacts_dir overrides on SPIKE_DEFAULTS. Seeded 15 diverse nodes across loss / architecture / features teams.

What we found

Position-floor works at scale (L40). mean_pnl_mean moved from a flat 0.0 (rigid pathology at 5-bench) to 5e-4..8e-4 (non-degenerate) across every overnight run. Confirms L32 + L37: position-aware losses plus a larger universe lets the optimiser find real positions.

But: shuffled_target now fails consistently (L41). Every single overnight node produced Sharpe ≥ 1.8 with shuffled_target leakage test value ≈ 4-5 (vs bound 1.02). The model has learned to exploit structural cross-stock differences (vol, news density, fundamentals) that persist under time permutation. This is structural leakage, not time-leakage — even time-permuted returns leave the cross-sectional level alive.

No proven_accepted strategy at scale. Goal not met. 257 nodes explored, 176 proven_rejected, 76 inconclusive, 1 pending, 4 blocked.

The bigger pattern (L42)

Each layer of leakage testing reveals the next escape route the optimiser takes:

Standardise features → mean_pnl=0 pathology (L24/L30)
Ratio co-condition → loss pathology exposed (L31)
Position-floor → cross-stock memorisation exposed (L41)
Larger universe → memorisation persists

The project's binding constraint is no longer "universe size" — it's structural feature design. Features that carry persistent stock-identity (vol_60d, fund_*, gkg_n_articles_t baseline rates) are exploitable as cross-stock structural signal regardless of time alignment.

L43 — the structural blocker named

The remaining structural blocker at 46 stocks is cross-stock memorisation: features like vol_60d, fund_*, and even gkg_n_articles_t (which has wildly different baseline rates per ticker — LT 15k/day vs HDFCBANK 2/day) carry persistent stock-identity signal. The model learns "always go long high-density tickers" which correlates with shuffled returns because density doesn't change under time shuffling.

Fix directions named at the end of Phase 7:

Per-stock causal z-scoring of features (Phase 8)
Cross-stock contrastive regulariser
Ticker dropout during training (Phase 11.E)
Single-stock baseline (Phase 9)
Permutation-invariance audit as a 5th leakage test (Phase 11.B)
Even-larger universe (Nifty 100 or BSE 500) — not yet done

What it produced

L40, L41, L42, L43 — the four learnings that scoped Phases 8-12. Phase 7 also established the unified Nifty 50 artifacts that every later phase consumes.

Phase 7 — Nifty 50 universe expansion ​

What we found ​

The bigger pattern (L42) ​

L43 — the structural blocker named ​

What it produced ​

Phase 7 — Nifty 50 universe expansion

What we found

The bigger pattern (L42)

L43 — the structural blocker named

What it produced