Learnings — L24 through L54 (31 entries)

Auto-generated from docs/decisions/003-phase-5-stage1-2-status.md at 2026-05-17 05:58:47 UTC.

Each learning is a numbered carry-forward from a concrete dead-end. The full table appears below; the methodology pages cross-reference these by number. L1-L23 predate the status doc's tabular learnings section and live inside docs/decisions/003-phase-5-stage1-2-status.md narrative — see the source file for those.

#	Learning	Source
L24	Leakage_tests REQUIRE per-feature causal standardisation using `[:n_train]` stats. Raw features at magnitudes well above the planted cheat (e.g. fundamentals 5-30 vs return 0.01) cause `look_ahead_cheat` to falsely PASS. S21 added this and the cheat went from sharpe=1.0 to sharpe=8.2. Mandatory before merging any strategy.	S21
L25	T from `feature_engineering_v2.build_unified_features` can be >> the price-axis days (S13 forward-fills across all dates in S1's window). Our run had T=742 not the assumed T=81. Sharpe SE is bound by `n_holdout × S`, not T. Use that for window-sizing decisions.	S21
L26	Per-seed Sharpe std (4.96) is ~3× theoretical SE (1.55). Walk-forward retrains amplify seed sensitivity beyond IID-sample theory. Always bootstrap; never trust closed-form SE alone.	S21
L27	`future_news_cheat` threshold (5.0) was calibrated for 768-dim dense text embeddings (Spike 9 era). For sparse event-driven features (count, tone, flags), the cheat detects the leak (3.93 vs honest ~0) but doesn't cross 5.0. Replace absolute threshold with a relative metric (`sharpe_with_cheat / sharpe_honest`).	S26
L28	ITC's SUE max = 114.87 (hotels-business divestiture caused a one-off Net Profit jump). Raw Net Profit creates "mechanical surprises" that don't carry the slow-drift signal PEAD targets. Fix: use earnings-from-core-operations OR winsorise SUE at ±5σ.	S26
L29	HDFCBANK SUE range is 50× tighter than ITC's (banks' Net Profit σ is small in denominator). PEAD applied uniformly across the universe is questionable for Nifty composition with conglomerates + banks + pure-plays. Sector-grouped SUE z-scoring is the principled fix.	S26
L30	Phase 6.A — cheat tests must apply the same `causal_standardise` transform to augmented features as honest features got. Without `transform_fn`, raw-scale cheat columns are dwarfed by z-scored honest features → cheat tests under-lift → false negatives. Fixed in `leakage_tests.py`.	Phase 6.A
L31	Phase 6.A — `\|mean_pnl_cheat\| / max(\|mean_pnl_honest\|, 1e-8) >= 10` co-condition catches the `negative_sharpe` scale-invariance pathology that Sharpe-only thresholds missed. When honest positions are ~0, ratio test correctly flags cheat as FAIL regardless of Sharpe value.	Phase 6.A
L32	Phase 6.E — `loss_fn(pnl)` is structurally pnl-only. NO loss can regulate position magnitude when it doesn't see positions. walk_forward now uses signature inspection (`_call_loss`) to pass `pos` to losses that declare it. Position-floor regularization belongs in the harness contract, NOT in any single loss function.	Phase 6.E
L33	Phase 6.D orchestrator — backward rules MUST (a) early-return on `inconclusive` verdicts and (b) cap depth at ≤3 generations from seed. Otherwise: tool-execution failures cascade infinitely, and proven_rejected in a small loss pool ping-pongs forever. Both bugs hit production within minutes. Fixed via `_depth(parent.id)` guard.	Phase 6.D
L34	At 5 stocks × 3y, the `negative_sharpe` scale-invariance pathology is STRUCTURAL — verified across 160 proven_rejected orchestrator nodes spanning 5 axes (loss swaps, feature ablations, architecture variants). Every high-Sharpe result decomposes to `mean_pnl=0` + `sharpe_std >> sharpe_mean`. Fix is harness-level (position regularizer in walk_forward), not strategy-level.	Phase 6.D
L35	Phase 6.D found the first leakage-clean strategies in the codebase: `arch-r1-hidden-8` family (StrategyNet with hidden=8). All 4 leakage tests pass; honest Sharpe = +0.124 with `mean_pnl_mean = 9e-7` (non-degenerate positions). Marked `proven_rejected` only because `post_cost_proxy = max(sharpe - 0.5, 0) = 0` falls below the 0.4 threshold. The methodology is sound at this scale; the universe is the blocker.	Phase 6.D arch team
L36	Architecture variants `hidden ∈ {8, 16, 32}` all converge to similar near-zero honest Sharpe with leakage-clean verdicts. This is evidence that feature signal is exhausted at 5 stocks × 3y, not loss/arch quality. The architecture team genuinely covered the ground available.	Phase 6.D arch team
L37	`sharpe_with_position_floor(α=5.0)` is too aggressive — forces position size beyond what feature signal supports, producing artifact Sharpe = +3.897 (above L23 bug-suspect ceiling). Softened to `α=0.5` (10× gentler). Penalty when positions degenerate is now ~0.025 instead of 0.25, comparable to small Sharpe magnitudes — meaningful tiebreaker without overwhelming the Sharpe objective.	Phase 6.E retry
L38	`post_cost_sharpe_proxy = max(sharpe_mean - 0.5, 0)` is a coarse proxy that rejects honest small-positive Sharpe results (e.g. +0.124 leakage-clean strategies). Real research output is real even when the proxy says zero. Future refinement: use a more nuanced cost model (e.g. `max(sharpe - 1.5×turnover_avg×rho_per_step, 0)` or empirical TCA).	Phase 6.D verdict review
L39	`negative_sortino` on `spike_22` (raw returns-scale features) produces a leakage-clean verdict with non-degenerate positions: Sharpe = -0.507, mean_pnl = -1.24e-5. Another data point that the universe — not the loss family — is the binding constraint at 5×3y. Anti-edge of -0.5 is well within the realistic [-0.7, +0.7] honest noise band.	Phase 6.D loss team
L40	Position-floor loss breaks the scale-invariance pathology at scale (Nifty 50 = 46 tickers): `mean_pnl_mean` moves from a flat 0.0 at 5-bench to non-degenerate 5e-4..8e-4 across 15+ overnight runs. Confirms L32+L37: position-aware losses + larger universe = optimiser finds real positions.	S17 overnight
L41	At 46-stock universe, `shuffled_target` leakage fails CONSISTENTLY across all loss/architecture/feature axes. Model produces Sharpe ≈ +5 on time-permuted returns — meaning the network locks onto cross-stock structural relationships (e.g. RELIANCE-vs-ITC vol differential) that survive time shuffling. This is structural leakage, not time-leakage.	S17 overnight
L42	The four "fixes" in Phase 5/6/S17 each surfaced a new class of overfit: standardise→mean_pnl=0, ratio-gate→exposed loss pathology, position-floor→exposed memorization at scale, larger universe→still memorizes. Each layer of leakage testing reveals the optimiser's next escape route. This is whack-a-mole behaviour expected when the universe lacks genuine cross-sectional time-edge.	S17 overnight (synthesis)
L43	The remaining structural blocker at 46 stocks is cross-stock memorisation: features like `vol_60d`, `fund_*`, and even `gkg_n_articles_t` (which has wildly different baseline rates per ticker — LT 15k/day vs HDFCBANK 2/day from L37) carry persistent stock-identity signal. The model learns "always go long high-density tickers" which correlates with shuffled returns because density doesn't change under time shuffling. Fix direction: time-variation-only features (per-stock z-scoring across rolling windows), or ticker-agnostic features only.	S17 overnight
L44	Phase 8 — `per_stock_rolling_zscore(features, window=60)` (in `lib/preprocessing.py`) removes per-(stock, feature) baseline by causal 60-day rolling z-score. At Nifty 50 scale this cuts `shuffled_target` Sharpe by ~50% (5.0 → 2.5, still FAIL vs bound 1.0) and triples `mean_pnl_mean` (0.0008 → 0.0025, positions no longer pinned near zero). Direction validated, magnitude insufficient. Per-stock z-scoring removes the level component of cross-stock memorisation (L43) but the model still extracts ticker-identity signal from feature covariance structure. Z-scoring is a necessary ingredient in the L43 fix, not the whole fix — pair it with the static-features test (L45) and stronger feature de-identification before declaring the regime clean.	Phase 8 zscored Nifty 50 nodes
L45	Phase 8 — 5th leakage test `static_features_test` (in `stability_tests.py`) replaces every feature column with its per-(stock) time-mean, holding it constant across the holdout. A time-edge model should produce ~zero positions on constant inputs; a ticker-memoriser keeps its baseline allocations. Pass iff `\|sharpe\| < 2·SE`. Directly catches ticker-identity memorisation that `shuffled_target` only indirectly hints at (shuffled-target leaves the cross-sectional level alive; static-features kills it outright). First measurements across zscored Nifty 50 nodes: best `\|sharpe\| = 1.66`, all FAIL. Confirms the model still extracts edge from constant stock-identity signal even after per-stock z-scoring — feature engineering still leaks identity into the model's input.	Phase 8
L46	Phase 8 — across several `p8-zscored-news-pf-` runs (position-floor on z-scored news features), 5 seeds produced consistent negative* Sharpe ≈ −3.4 with `mean_pnl_mean = −1.4e-3` (non-degenerate positions, no scale-invariance pathology). Tight cluster across seeds rules out random noise. Two viable readings: (a) real anti-edge in news-driven Indian large-caps that this codebase has been finding the wrong sign of all along — historically every news-channel result has been near-zero or negative (S20 −3.90 at 21d, post-L24 S20 collapse, etc.), and a consistent sign-flip across this many independent attempts points at signed-feature physics; (b) feature-engineering sign-flip bug in something like `gkg_avg_tone_t` (positive tone → wrong-sign return). Worth a hand-walk audit of one (date, ticker) cell from raw GKG → feature_engineering_v2 → model input before discarding the news channel.	Phase 8 zscored-news-pf seeds
L47	Phase 9.A bug — `_per_stock_zscore` used `sigma.clamp(min=1e-8)` which blew up for warmup-period features (vol_60d ≈ 0 for first 60d → tiny std → 10^7+ z-scores). LayerNorm inside StrategyNet absorbed the scaling for the honest run (so it looked OK at -0.4 Sharpe) but DROWNED the cheat column (~1.0 scale) inside `look_ahead_cheat_test`. Plus `_run_per_stock` passed already-z-scored features to leakage tests, causing transform_fn to corrupt the head columns. Both fixed: clamp(min=1e-3), and pass RAW features to cheat tests. Post-fix: `look_ahead_cheat = +30` (was -2.28); harness is now trustworthy for per-stock evaluation.	Phase 9.A diagnostic
L48	Phase 9.B news audit — `corr(gkg_avg_tone_t, next_ret)` = -0.0164 overall (N=25,795), strengthening to -0.11 at \|tone\| > 5 and \|tone\| > 8. Per-ticker mean correlation = -0.023 (median -0.021). Sign-alignment at strong-tone cells (\|tone\| > 8): 44.4% (vs 50% noise). NO code bug — feature_engineering_v2 preserves sign+magnitude+date alignment (Audit 2 + 4). The L46 -3.4 Sharpe cluster was the model correctly extracting this weak contrarian signal, just at magnitude too large to be honest at this sample size. GDELT daily news on Indian large-caps is a weak contrarian signal, not momentum. Plausible mechanism: news reports react to returns rather than predict them — by the time GDELT indexes a story, the move has happened, so the next day mean-reverts. Practical implication: explicit `-1 * tone` features would produce honest Sharpe ~0.0-0.1 after costs, in noise band. Not enough to be the project's edge, but a real signed finding worth preserving as a feature transform when news is used downstream.	News audit on full Nifty 50 sample
L49	Phase 11.A — pure-linear baseline (LinearStrategy, 1 layer, cross-sectionally de-meaned) at Nifty 50 with per-stock z-score + position-floor produces honest small Sharpe = -0.347 with `static_features` PASS (sharpe=+0.353). First leakage-clean-on-static-features verdict at Nifty 50 scale. Linear models structurally cannot construct the per-stock biases that MLPs use to fail the static_features test. `shuffled_target` still fails (2.76, vs 2-3 for MLP variants).	Phase 11.A
L50	Phase 11.E — ticker_dropout=0.3 during training (per-step random masking of ~30% tickers with re-de-mean) at Nifty 50 + per-stock z-score + position-floor produces sharpe=+6.29 (L23 bug-suspect) but `static_features` PASS (-0.137) AND `permutation_invariance` PASS (+0.351). 4 of 6 leakage tests pass — most-comprehensive leakage pass yet for an MLP at scale. The remaining failures (shuffled_target +3.18, sharpe artifact) suggest the dropout-induced gradient noise inflates Sharpe even when ticker-identity leakage is removed. Tighter weight_decay or smaller hidden size on top of dropout is the natural followup.	Phase 11.E
L51	Phase 11.C — explicit `gkg_neg_tone_t = -1 * gkg_avg_tone_t` feature (L48 action) on Nifty 50 + position-floor produces sharpe=-4.99 with mean_pnl=-2.3e-3 — model goes deep contrarian and fails most leakage tests (static_features=5.14, shuffled_target=-3.92). L48 signal is real (corr=-0.02 to -0.10) but explicit feature lets the model overweight a small contrarian edge into noise-amplification territory. Use neg_tone with stronger regularisation or as a minority ingredient in a larger feature set, not as a dominant signal.	Phase 11.C
L52	Phase 11.F — sector_demean preprocessing (lib.sectors.sector_demean applied to features BEFORE per-stock z-score) at Nifty 50 + position-floor: sharpe=+2.09, mean_pnl=+1.07e-3, but `shuffled_target=+4.93 FAIL (worse than baseline)`, `static_features=+3.81 FAIL`, `permutation_invariance PASS (+1.07)`. Sector-demean alone does not fix the L43 leakage; it removes sector-mean structure but residual within-sector ticker identity still leaks. Combined sector_demean + ticker_dropout may be the next natural test.	Phase 11.F sector_demean smoke
L53	Phase 12.A — shuffled_target bound `2 * _sharpe_se(n_holdout, n_stocks)` is too tight at small n. Empirical seed std on noise data is 6× the theoretical SE: at n_holdout=21 × 46 stocks, theoretical 2·SE = 1.0 but empirical 95% interval across 10 seeds is [-6.3, +5.0] (std=3.15). Only 20% of seeds pass the theoretical bound on PURE shuffled noise. L26 confirmed at scale. Fix: bootstrap-calibrate the bound per (n_holdout, n_stocks, hp) regime instead of using closed-form SE. At n_holdout=180 the bound starts holding (shuffled_target = +0.07 on the dropout=0.5 candidate).	Phase 12.A diagnostic
L54	Phase 12.A — the dropout=0.5 + position-floor candidate that appeared to pass 5/6 leakage tests at n=21 (Sharpe ~5) is a window-position artifact, NOT real edge. `window_robustness_test` shows sharpe_short(n=21) = +2.44 collapsing to sharpe_long(n=180) = -0.14. Same family as L33/L34/S26 pre-Phase-6. Per-ticker position analysis confirms: positions correlate negatively with same-day returns (-0.26) but near-zero with next-day returns (+0.03) — the model is doing mean-reversion on the in-sample window that doesn't generalise. With the corrected window-stability gate as a 6th leakage test in series, the candidate fails honestly.	Phase 12.A window check

Learnings — L24 through L54 (31 entries) ​

Learnings — L24 through L54 (31 entries)