Skip to content

Learnings — L24 through L54 (31 entries)

Auto-generated from docs/decisions/003-phase-5-stage1-2-status.md at 2026-05-17 05:58:47 UTC.

Each learning is a numbered carry-forward from a concrete dead-end. The full table appears below; the methodology pages cross-reference these by number. L1-L23 predate the status doc's tabular learnings section and live inside docs/decisions/003-phase-5-stage1-2-status.md narrative — see the source file for those.

#LearningSource
L24Leakage_tests REQUIRE per-feature causal standardisation using [:n_train] stats. Raw features at magnitudes well above the planted cheat (e.g. fundamentals 5-30 vs return 0.01) cause look_ahead_cheat to falsely PASS. S21 added this and the cheat went from sharpe=1.0 to sharpe=8.2. Mandatory before merging any strategy.S21
L25T from feature_engineering_v2.build_unified_features can be >> the price-axis days (S13 forward-fills across all dates in S1's window). Our run had T=742 not the assumed T=81. Sharpe SE is bound by n_holdout × S, not T. Use that for window-sizing decisions.S21
L26Per-seed Sharpe std (4.96) is ~3× theoretical SE (1.55). Walk-forward retrains amplify seed sensitivity beyond IID-sample theory. Always bootstrap; never trust closed-form SE alone.S21
L27future_news_cheat threshold (5.0) was calibrated for 768-dim dense text embeddings (Spike 9 era). For sparse event-driven features (count, tone, flags), the cheat detects the leak (3.93 vs honest ~0) but doesn't cross 5.0. Replace absolute threshold with a relative metric (sharpe_with_cheat / sharpe_honest).S26
L28ITC's SUE max = 114.87 (hotels-business divestiture caused a one-off Net Profit jump). Raw Net Profit creates "mechanical surprises" that don't carry the slow-drift signal PEAD targets. Fix: use earnings-from-core-operations OR winsorise SUE at ±5σ.S26
L29HDFCBANK SUE range is 50× tighter than ITC's (banks' Net Profit σ is small in denominator). PEAD applied uniformly across the universe is questionable for Nifty composition with conglomerates + banks + pure-plays. Sector-grouped SUE z-scoring is the principled fix.S26
L30Phase 6.A — cheat tests must apply the same causal_standardise transform to augmented features as honest features got. Without transform_fn, raw-scale cheat columns are dwarfed by z-scored honest features → cheat tests under-lift → false negatives. Fixed in leakage_tests.py.Phase 6.A
L31Phase 6.A — |mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10 co-condition catches the negative_sharpe scale-invariance pathology that Sharpe-only thresholds missed. When honest positions are ~0, ratio test correctly flags cheat as FAIL regardless of Sharpe value.Phase 6.A
L32Phase 6.E — loss_fn(pnl) is structurally pnl-only. NO loss can regulate position magnitude when it doesn't see positions. walk_forward now uses signature inspection (_call_loss) to pass pos to losses that declare it. Position-floor regularization belongs in the harness contract, NOT in any single loss function.Phase 6.E
L33Phase 6.D orchestrator — backward rules MUST (a) early-return on inconclusive verdicts and (b) cap depth at ≤3 generations from seed. Otherwise: tool-execution failures cascade infinitely, and proven_rejected in a small loss pool ping-pongs forever. Both bugs hit production within minutes. Fixed via _depth(parent.id) guard.Phase 6.D
L34At 5 stocks × 3y, the negative_sharpe scale-invariance pathology is STRUCTURAL — verified across 160 proven_rejected orchestrator nodes spanning 5 axes (loss swaps, feature ablations, architecture variants). Every high-Sharpe result decomposes to mean_pnl=0 + sharpe_std >> sharpe_mean. Fix is harness-level (position regularizer in walk_forward), not strategy-level.Phase 6.D
L35Phase 6.D found the first leakage-clean strategies in the codebase: arch-r1-hidden-8 family (StrategyNet with hidden=8). All 4 leakage tests pass; honest Sharpe = +0.124 with mean_pnl_mean = 9e-7 (non-degenerate positions). Marked proven_rejected only because post_cost_proxy = max(sharpe - 0.5, 0) = 0 falls below the 0.4 threshold. The methodology is sound at this scale; the universe is the blocker.Phase 6.D arch team
L36Architecture variants hidden ∈ {8, 16, 32} all converge to similar near-zero honest Sharpe with leakage-clean verdicts. This is evidence that feature signal is exhausted at 5 stocks × 3y, not loss/arch quality. The architecture team genuinely covered the ground available.Phase 6.D arch team
L37sharpe_with_position_floor(α=5.0) is too aggressive — forces position size beyond what feature signal supports, producing artifact Sharpe = +3.897 (above L23 bug-suspect ceiling). Softened to α=0.5 (10× gentler). Penalty when positions degenerate is now ~0.025 instead of 0.25, comparable to small Sharpe magnitudes — meaningful tiebreaker without overwhelming the Sharpe objective.Phase 6.E retry
L38post_cost_sharpe_proxy = max(sharpe_mean - 0.5, 0) is a coarse proxy that rejects honest small-positive Sharpe results (e.g. +0.124 leakage-clean strategies). Real research output is real even when the proxy says zero. Future refinement: use a more nuanced cost model (e.g. max(sharpe - 1.5×turnover_avg×rho_per_step, 0) or empirical TCA).Phase 6.D verdict review
L39negative_sortino on spike_22 (raw returns-scale features) produces a leakage-clean verdict with non-degenerate positions: Sharpe = -0.507, mean_pnl = -1.24e-5. Another data point that the universe — not the loss family — is the binding constraint at 5×3y. Anti-edge of -0.5 is well within the realistic [-0.7, +0.7] honest noise band.Phase 6.D loss team
L40Position-floor loss breaks the scale-invariance pathology at scale (Nifty 50 = 46 tickers): mean_pnl_mean moves from a flat 0.0 at 5-bench to non-degenerate 5e-4..8e-4 across 15+ overnight runs. Confirms L32+L37: position-aware losses + larger universe = optimiser finds real positions.S17 overnight
L41At 46-stock universe, shuffled_target leakage fails CONSISTENTLY across all loss/architecture/feature axes. Model produces Sharpe ≈ +5 on time-permuted returns — meaning the network locks onto cross-stock structural relationships (e.g. RELIANCE-vs-ITC vol differential) that survive time shuffling. This is structural leakage, not time-leakage.S17 overnight
L42The four "fixes" in Phase 5/6/S17 each surfaced a new class of overfit: standardise→mean_pnl=0, ratio-gate→exposed loss pathology, position-floor→exposed memorization at scale, larger universe→still memorizes. Each layer of leakage testing reveals the optimiser's next escape route. This is whack-a-mole behaviour expected when the universe lacks genuine cross-sectional time-edge.S17 overnight (synthesis)
L43The remaining structural blocker at 46 stocks is cross-stock memorisation: features like vol_60d, fund_*, and even gkg_n_articles_t (which has wildly different baseline rates per ticker — LT 15k/day vs HDFCBANK 2/day from L37) carry persistent stock-identity signal. The model learns "always go long high-density tickers" which correlates with shuffled returns because density doesn't change under time shuffling. Fix direction: time-variation-only features (per-stock z-scoring across rolling windows), or ticker-agnostic features only.S17 overnight
L44Phase 8 — per_stock_rolling_zscore(features, window=60) (in lib/preprocessing.py) removes per-(stock, feature) baseline by causal 60-day rolling z-score. At Nifty 50 scale this cuts shuffled_target Sharpe by ~50% (5.0 → 2.5, still FAIL vs bound 1.0) and triples mean_pnl_mean (0.0008 → 0.0025, positions no longer pinned near zero). Direction validated, magnitude insufficient. Per-stock z-scoring removes the level component of cross-stock memorisation (L43) but the model still extracts ticker-identity signal from feature covariance structure. Z-scoring is a necessary ingredient in the L43 fix, not the whole fix — pair it with the static-features test (L45) and stronger feature de-identification before declaring the regime clean.Phase 8 zscored Nifty 50 nodes
L45Phase 8 — 5th leakage test static_features_test (in stability_tests.py) replaces every feature column with its per-(stock) time-mean, holding it constant across the holdout. A time-edge model should produce ~zero positions on constant inputs; a ticker-memoriser keeps its baseline allocations. Pass iff |sharpe| < 2·SE. Directly catches ticker-identity memorisation that shuffled_target only indirectly hints at (shuffled-target leaves the cross-sectional level alive; static-features kills it outright). First measurements across zscored Nifty 50 nodes: best |sharpe| = 1.66, all FAIL. Confirms the model still extracts edge from constant stock-identity signal even after per-stock z-scoring — feature engineering still leaks identity into the model's input.Phase 8
L46Phase 8 — across several p8-zscored-news-pf-* runs (position-floor on z-scored news features), 5 seeds produced consistent negative Sharpe ≈ −3.4 with mean_pnl_mean = −1.4e-3 (non-degenerate positions, no scale-invariance pathology). Tight cluster across seeds rules out random noise. Two viable readings: (a) real anti-edge in news-driven Indian large-caps that this codebase has been finding the wrong sign of all along — historically every news-channel result has been near-zero or negative (S20 −3.90 at 21d, post-L24 S20 collapse, etc.), and a consistent sign-flip across this many independent attempts points at signed-feature physics; (b) feature-engineering sign-flip bug in something like gkg_avg_tone_t (positive tone → wrong-sign return). Worth a hand-walk audit of one (date, ticker) cell from raw GKG → feature_engineering_v2 → model input before discarding the news channel.Phase 8 zscored-news-pf seeds
L47Phase 9.A bug — _per_stock_zscore used sigma.clamp(min=1e-8) which blew up for warmup-period features (vol_60d ≈ 0 for first 60d → tiny std → 10^7+ z-scores). LayerNorm inside StrategyNet absorbed the scaling for the honest run (so it looked OK at -0.4 Sharpe) but DROWNED the cheat column (~1.0 scale) inside look_ahead_cheat_test. Plus _run_per_stock passed already-z-scored features to leakage tests, causing transform_fn to corrupt the head columns. Both fixed: clamp(min=1e-3), and pass RAW features to cheat tests. Post-fix: look_ahead_cheat = +30 (was -2.28); harness is now trustworthy for per-stock evaluation.Phase 9.A diagnostic
L48Phase 9.B news audit — corr(gkg_avg_tone_t, next_ret) = -0.0164 overall (N=25,795), strengthening to -0.11 at |tone| > 5 and |tone| > 8. Per-ticker mean correlation = -0.023 (median -0.021). Sign-alignment at strong-tone cells (|tone| > 8): 44.4% (vs 50% noise). NO code bug — feature_engineering_v2 preserves sign+magnitude+date alignment (Audit 2 + 4). The L46 -3.4 Sharpe cluster was the model correctly extracting this weak contrarian signal, just at magnitude too large to be honest at this sample size. GDELT daily news on Indian large-caps is a weak contrarian signal, not momentum. Plausible mechanism: news reports react to returns rather than predict them — by the time GDELT indexes a story, the move has happened, so the next day mean-reverts. Practical implication: explicit -1 * tone features would produce honest Sharpe ~0.0-0.1 after costs, in noise band. Not enough to be the project's edge, but a real signed finding worth preserving as a feature transform when news is used downstream.News audit on full Nifty 50 sample
L49Phase 11.A — pure-linear baseline (LinearStrategy, 1 layer, cross-sectionally de-meaned) at Nifty 50 with per-stock z-score + position-floor produces honest small Sharpe = -0.347 with static_features PASS (sharpe=+0.353). First leakage-clean-on-static-features verdict at Nifty 50 scale. Linear models structurally cannot construct the per-stock biases that MLPs use to fail the static_features test. shuffled_target still fails (2.76, vs 2-3 for MLP variants).Phase 11.A
L50Phase 11.E — ticker_dropout=0.3 during training (per-step random masking of ~30% tickers with re-de-mean) at Nifty 50 + per-stock z-score + position-floor produces sharpe=+6.29 (L23 bug-suspect) but static_features PASS (-0.137) AND permutation_invariance PASS (+0.351). 4 of 6 leakage tests pass — most-comprehensive leakage pass yet for an MLP at scale. The remaining failures (shuffled_target +3.18, sharpe artifact) suggest the dropout-induced gradient noise inflates Sharpe even when ticker-identity leakage is removed. Tighter weight_decay or smaller hidden size on top of dropout is the natural followup.Phase 11.E
L51Phase 11.C — explicit gkg_neg_tone_t = -1 * gkg_avg_tone_t feature (L48 action) on Nifty 50 + position-floor produces sharpe=-4.99 with mean_pnl=-2.3e-3 — model goes deep contrarian and fails most leakage tests (static_features=5.14, shuffled_target=-3.92). L48 signal is real (corr=-0.02 to -0.10) but explicit feature lets the model overweight a small contrarian edge into noise-amplification territory. Use neg_tone with stronger regularisation or as a minority ingredient in a larger feature set, not as a dominant signal.Phase 11.C
L52Phase 11.F — sector_demean preprocessing (lib.sectors.sector_demean applied to features BEFORE per-stock z-score) at Nifty 50 + position-floor: sharpe=+2.09, mean_pnl=+1.07e-3, but shuffled_target=+4.93 FAIL (worse than baseline), static_features=+3.81 FAIL, permutation_invariance PASS (+1.07). Sector-demean alone does not fix the L43 leakage; it removes sector-mean structure but residual within-sector ticker identity still leaks. Combined sector_demean + ticker_dropout may be the next natural test.Phase 11.F sector_demean smoke
L53Phase 12.A — shuffled_target bound 2 * _sharpe_se(n_holdout, n_stocks) is too tight at small n. Empirical seed std on noise data is 6× the theoretical SE: at n_holdout=21 × 46 stocks, theoretical 2·SE = 1.0 but empirical 95% interval across 10 seeds is [-6.3, +5.0] (std=3.15). Only 20% of seeds pass the theoretical bound on PURE shuffled noise. L26 confirmed at scale. Fix: bootstrap-calibrate the bound per (n_holdout, n_stocks, hp) regime instead of using closed-form SE. At n_holdout=180 the bound starts holding (shuffled_target = +0.07 on the dropout=0.5 candidate).Phase 12.A diagnostic
L54Phase 12.A — the dropout=0.5 + position-floor candidate that appeared to pass 5/6 leakage tests at n=21 (Sharpe ~5) is a window-position artifact, NOT real edge. window_robustness_test shows sharpe_short(n=21) = +2.44 collapsing to sharpe_long(n=180) = -0.14. Same family as L33/L34/S26 pre-Phase-6. Per-ticker position analysis confirms: positions correlate negatively with same-day returns (-0.26) but near-zero with next-day returns (+0.03) — the model is doing mean-reversion on the in-sample window that doesn't generalise. With the corrected window-stability gate as a 6th leakage test in series, the candidate fails honestly.Phase 12.A window check

OpenBracket v0.6 — methodology release-ready; v1 forecaster in active build.