Phase 6 — harness recalibration
Phase 6 turned the regime trade-off finding from Phase 5 into five concrete patches (6.A through 6.E), then launched the autograd-style orchestrator (6.D) that ran every subsequent phase. A huddle (Nina + Suren + Wei) drove the punch list.
6.A — transform_fn + mean-PnL ratio co-condition (L30, L31)
Three issues closed:
_causal_standardiselifted tosrc/stock_core/lib/preprocessing.pyas a returns-the-transform function. Single source of truth.look_ahead_cheat_test+future_news_cheat_testaccept an optionaltransform_fnso cheat features travel through the same standardisation as honest features. Closes the regime trade-off where standardised honest features made raw-scale cheats invisibly small.- Mean-PnL ratio co-condition
|mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10added. Catchesnegative_sharpescale-invariance pathology that Sharpe-only thresholds missed.
Merged at 743d301.
6.B — window-stability test
New src/stock_core/spikes/spike_09_walk_forward/stability_tests.py with window_robustness_test — runs walk_forward on two holdout sizes against the same data and gates on |sharpe_short - sharpe_long| < 2·sqrt(SE²_short + SE²_long). A real edge survives the nested-window comparison; an artifact does not.
Merged at 73b716e.
6.C — prod validation re-run
Re-ran S22 + S20 against the recalibrated harness. Verdict:
| Spike | shuffled_target | look_ahead_cheat | future_news_cheat |
|---|---|---|---|
| S22 (raw features) | FAIL (-2.25) | PASS (6.33, ratio 18.57) | PASS (9.76, ratio 18.07) |
| S20 (standardised) | PASS (0.94) | FAIL (-4.81, ratio 0.00) | FAIL (0.54, ratio 0.00) |
The 0.00 ratios on S20 correctly flagged the negative_sharpe pathology that yesterday's harness silently allowed.
6.D — the autograd-style orchestrator
Sqlite-backed multi-agent orchestrator at src/stock_core/orchestrator/. Five parallel teams (Loss, Features, Architecture, Universe, Personality) running as tmux sessions on prod under the stockwork service account. Each team's bash driver claims pending nodes from state.db (atomic), invokes claude --print to run the experiment, writes a verdict back. Backward rules apply autograd-style child-spawning. Persister tmux session commits + pushes + rclone-syncs every 30 min. Supervisor tmux session self-heals dead teams every 5 min.
Two bugs caught + fixed in production (L33):
- Inconclusive cascade: rules fired on tool-failure verdicts → infinite ever-longer node IDs.
- Ping-pong cycle: rules fired on
proven_rejectedtoo → 2-loss pool oscillated forever. Added_depth(parent.id) >= 3depth-cap guard.
Merged at cfc518f. Patches b91df6a, 03195f5. See Orchestrator guide.
6.E — position-aware loss (L32)
The structural finding from 160 rejected nodes: every high-Sharpe verdict decomposed to mean_pnl = 0 + sharpe_std >> sharpe_mean. The blocker was the loss-fn contract — loss_fn(pnl) was pnl-only and could not regulate position magnitude.
Fix: walk_forward._call_loss(loss_fn, pnl, pos) dispatches via inspect.signature. Legacy losses keep working. New losses can declare pos and get position tensors directly. Added lib/losses.sharpe_with_position_floor(pnl, pos, floor=0.05, alpha=0.5) (initially 5.0, softened to 0.5 per L37). Penalty α·ReLU(floor − mean|pos|) has a direct gradient on positions, breaking the scale-invariance saddle.
Merged at fff370f.
What it produced
L30, L31, L32, L33, L34, L35, L36, L37, L38, L39 all originated in Phase 6. The orchestrator surfaced the first leakage-clean strategies in the codebase — arch-r1-hidden-8 (StrategyNet with hidden=8) — all four leakage tests passing with honest Sharpe = +0.124 and non-degenerate positions. Marked proven_rejected only because post_cost_proxy = max(sharpe - 0.5, 0) = 0 fell below the 0.4 threshold (L38). The methodology was sound at this scale; the universe was the blocker.