Process Supervision
Process supervision is a training method that gives a model feedback on each step of its reasoning, while outcome supervision judges only the final result; in noisy domains, the difference determines whether a model learns skill or luck.
The two approaches answer different questions. Outcome supervision asks: was the answer right? Process supervision asks: was the thinking right?
What it is
Consider a math problem. Outcome supervision checks whether the final answer is correct; process supervision checks each calculation step. A student can reach the right answer through two errors that cancel out, or the wrong answer despite near-perfect reasoning with one small slip. Outcome supervision misgrades both; process supervision catches both. The reasoning, not the answer, is what generalizes to the next problem.
OpenAI made the case empirically in their 2023 paper "Let's Verify Step by Step": training process reward models that score each reasoning step "significantly outperforms outcome supervision" on challenging math problems. The models did not just get more accurate; their reasoning became more interpretable and trustworthy, because every step had been individually supervised. Outcome supervision remains attractive for one reason: it is cheap. You only label final results. But the cost shows up later: models that pattern-match to correct answers without sound underlying logic, and models that learn reward hacking, exploiting whatever the outcome metric measures rather than what you meant.
Process supervision has a data dependency: you cannot supervise steps you never recorded. It requires reasoning traces (decisions broken into explicit phases with tool calls and confidence scores) and step-level evaluations from people qualified to judge them. In finance that looks like structured feedback per step: market read correct, risk assessment slightly aggressive, timing logic solid, blind spot: ignored volume divergence. The supervised steps pay a second dividend at deployment time: because every link in the chain was individually evaluated during training, the model's reasoning stays inspectable in production, and a flawed step can be caught before its conclusion is acted on, even while the eventual outcome is still unknown.
Why it matters for financial AI
Trading is close to the worst case for outcome supervision. A trader buys on a friend's tip and a surprise earnings beat bails them out: outcome supervision says "do more of this." Another trader does rigorous analysis and loses to an unpredictable event: outcome supervision says "never do this again." Both lessons are wrong, and a model trained on enough of them learns to replicate luck. Distinguishing genuine skill from variance on outcomes alone takes hundreds or thousands of independent trades; process supervision makes the distinction on every single decision by evaluating the reasoning directly.
Combined with counterfactuals, process evaluation produces a clean four-quadrant decomposition. Good reasoning with a good outcome is probably skill. Good reasoning with a bad outcome is unlucky, yet still excellent training data. Bad reasoning with a good outcome is lucky noise, possibly harmful to train on. Bad reasoning with a bad outcome is a deserved negative example. Outcome supervision collapses these four cases into two and gets half of them backwards.
This is why process supervision sits at the center of financial post-training. It trains on whether the analysis identified the right factors, not just whether the trade profited; on whether risk was sized appropriately, not just whether the stop got hit. It rewards sound analysis even when the result is unfavorable and penalizes flawed reasoning even when the P&L looks good, exactly the signal RLHF-style outcome preferences cannot provide in a noisy market. UV Labs' decision episodes pair full reasoning traces with outcomes and counterfactuals precisely so labs can train this way; see Decision Data.
Outcome supervision
- Judges only the final result: answer correct, trade profitable
- Cheap labels: one judgment per example
- Conflates luck with skill in noisy domains
- Invites reward hacking on the outcome metric
- Model learns to pattern-match answers
Process supervision
- Judges every intermediate reasoning step
- Expensive labels: requires traces plus expert step evaluation
- Separates decision quality from outcome variance
- Catches flawed logic even when results look good
- Model learns how to think: more robust and interpretable
Common questions
What is the difference between process supervision and outcome supervision?
Outcome supervision judges only the final result: did the answer check out, did the trade make money. Process supervision evaluates each intermediate step of the reasoning that produced the result. Outcome labels are cheaper to collect but conflate luck with skill; process labels are more expensive but teach the model how to think, and OpenAI found process supervision significantly outperforms outcome supervision on complex reasoning tasks.
Why is outcome supervision unreliable in trading?
Because market outcomes are noisy. A trader who buys for terrible reasons can profit from a surprise move, and a trader who does everything right can lose to bad variance. Outcome supervision rewards the first and punishes the second, and both lessons are wrong. Statistically, separating skill from luck on outcomes alone takes hundreds or thousands of independent trades; process supervision separates them on every single decision.
What data does process supervision require?
Stepwise data. You need reasoning traces that break each decision into explicit phases with tool calls and confidence scores, plus step-level evaluations from people qualified to judge them: was the market read correct, was the risk sizing appropriate, was the stop placement sound. Decision episodes that pair traces with outcomes and counterfactuals are the data structure built for this.
Related terms
UV Labs builds post-training decision data for financial AI. Explore Decision Data →