Skip to content

Phase 6 — harness recalibration

Phase 6 turned the regime trade-off finding from Phase 5 into five concrete patches (6.A through 6.E), then launched the autograd-style orchestrator (6.D) that ran every subsequent phase. A huddle (Nina + Suren + Wei) drove the punch list.

6.A — transform_fn + mean-PnL ratio co-condition (L30, L31)

Three issues closed:

  1. _causal_standardise lifted to src/stock_core/lib/preprocessing.py as a returns-the-transform function. Single source of truth.
  2. look_ahead_cheat_test + future_news_cheat_test accept an optional transform_fn so cheat features travel through the same standardisation as honest features. Closes the regime trade-off where standardised honest features made raw-scale cheats invisibly small.
  3. Mean-PnL ratio co-condition |mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10 added. Catches negative_sharpe scale-invariance pathology that Sharpe-only thresholds missed.

Merged at 743d301.

6.B — window-stability test

New src/stock_core/spikes/spike_09_walk_forward/stability_tests.py with window_robustness_test — runs walk_forward on two holdout sizes against the same data and gates on |sharpe_short - sharpe_long| < 2·sqrt(SE²_short + SE²_long). A real edge survives the nested-window comparison; an artifact does not.

Merged at 73b716e.

6.C — prod validation re-run

Re-ran S22 + S20 against the recalibrated harness. Verdict:

Spikeshuffled_targetlook_ahead_cheatfuture_news_cheat
S22 (raw features)FAIL (-2.25)PASS (6.33, ratio 18.57)PASS (9.76, ratio 18.07)
S20 (standardised)PASS (0.94)FAIL (-4.81, ratio 0.00)FAIL (0.54, ratio 0.00)

The 0.00 ratios on S20 correctly flagged the negative_sharpe pathology that yesterday's harness silently allowed.

6.D — the autograd-style orchestrator

Sqlite-backed multi-agent orchestrator at src/stock_core/orchestrator/. Five parallel teams (Loss, Features, Architecture, Universe, Personality) running as tmux sessions on prod under the stockwork service account. Each team's bash driver claims pending nodes from state.db (atomic), invokes claude --print to run the experiment, writes a verdict back. Backward rules apply autograd-style child-spawning. Persister tmux session commits + pushes + rclone-syncs every 30 min. Supervisor tmux session self-heals dead teams every 5 min.

Two bugs caught + fixed in production (L33):

  • Inconclusive cascade: rules fired on tool-failure verdicts → infinite ever-longer node IDs.
  • Ping-pong cycle: rules fired on proven_rejected too → 2-loss pool oscillated forever. Added _depth(parent.id) >= 3 depth-cap guard.

Merged at cfc518f. Patches b91df6a, 03195f5. See Orchestrator guide.

6.E — position-aware loss (L32)

The structural finding from 160 rejected nodes: every high-Sharpe verdict decomposed to mean_pnl = 0 + sharpe_std >> sharpe_mean. The blocker was the loss-fn contract — loss_fn(pnl) was pnl-only and could not regulate position magnitude.

Fix: walk_forward._call_loss(loss_fn, pnl, pos) dispatches via inspect.signature. Legacy losses keep working. New losses can declare pos and get position tensors directly. Added lib/losses.sharpe_with_position_floor(pnl, pos, floor=0.05, alpha=0.5) (initially 5.0, softened to 0.5 per L37). Penalty α·ReLU(floor − mean|pos|) has a direct gradient on positions, breaking the scale-invariance saddle.

Merged at fff370f.

What it produced

L30, L31, L32, L33, L34, L35, L36, L37, L38, L39 all originated in Phase 6. The orchestrator surfaced the first leakage-clean strategies in the codebasearch-r1-hidden-8 (StrategyNet with hidden=8) — all four leakage tests passing with honest Sharpe = +0.124 and non-degenerate positions. Marked proven_rejected only because post_cost_proxy = max(sharpe - 0.5, 0) = 0 fell below the 0.4 threshold (L38). The methodology was sound at this scale; the universe was the blocker.

OpenBracket v0.6 — methodology release-ready; v1 forecaster in active build.