RLHF
RLHF (Reinforcement Learning from Human Feedback) is a post-training technique in which humans rank model outputs, those rankings train a reward model, and the reward model guides the AI toward responses humans prefer.
It is the technique that turned raw next-word predictors into useful assistants, and the one whose assumptions financial AI most thoroughly violates.
What it is
RLHF runs in three steps. First, humans compare pairs of model responses to the same prompt and pick the better one. Second, those preferences train a reward model, a network that learns to predict how a human would rate any response. Third, the language model is optimized with reinforcement learning to maximize that reward signal. Rather than trying to specify exactly what good behavior looks like, you let humans demonstrate it through their choices.
This is what transformed GPT-3 into ChatGPT. OpenAI hired roughly 40 contractors to make thousands of comparisons, and the resulting InstructGPT model (at 1.3B parameters) was preferred by humans over the raw 175B-parameter GPT-3. RLHF is now a standard stage of post-training, usually run after supervised fine-tuning, and an entire industry supplies the labor: generalist labelers earn $20-40/hour, specialists with doctorates $90-200/hour.
Two neighbors are worth knowing. DPO (Direct Preference Optimization) reaches similar results without the separate reward model, adjusting the language model directly from preference pairs: simpler, more stable, less compute. And every reward-model approach carries the risk of reward hacking: the model maximizing the measured signal in ways its designers never intended.
RLHF works for chatbots because its assumptions hold there: anyone can judge whether a response is helpful, judgments take seconds, feedback is immediate, and labelers are cheap and plentiful.
Why it matters for financial AI
Finance violates every one of those assumptions. Only experts can judge financial reasoning quality, and proper evaluation of a multi-factor decision takes real time. Outcomes unfold over hours, days, or weeks, so there is no immediate signal to rate. Worse, the signal is corrupted: a well-reasoned trade can lose money to variance and a reckless one can profit, so naive preference labels reward luck. Frontier labs reportedly pay expert traders well over $100/hour for financial RLHF feedback and still struggle to find qualified contributors. The pool of people who can evaluate complex trading decisions is small, and the best of them have better uses for their time.
The feedback financial AI needs is also richer than "response A beats response B." Useful signal comes in layers: intent feedback (what was the trader trying to accomplish: the same tight stop is right for a scalp and wrong for a swing trade), reasoning feedback (was the market read correct, was the risk assessment sound, what were the blind spots), decision feedback (was sizing, entry, and stop placement appropriate given the analysis), and outcome commentary (post-mortems that separate luck from skill with hindsight). Each layer trains something a single preference bit cannot.
The way through is to stop treating feedback as a separate labeling activity. Human judgment can be captured implicitly from trading itself: every strategy modification says "this configuration beats the last one," every manual override says "I know better here," and capital allocation is revealed preference about relative quality, applied at the level of reasoning steps via process supervision over reasoning traces, which separates decision quality from outcome luck. Scaling that way demands quality control in place of vetted labelers: weighting feedback by the contributor's track record, downweighting feedback that contradicts itself or market outcomes, routing it by domain expertise, and statistically screening for manipulation. This is the approach behind UV Labs' Decision Data: human feedback captured as a byproduct of live trading rather than paid labeling sessions.
Common questions
Is RLHF the same as fine-tuning?
No. Supervised fine-tuning trains a model to imitate examples of good behavior. RLHF optimizes the model against a learned reward model built from human preference comparisons; it can improve behavior nobody could write an example for. Both are stages of post-training, and in practice RLHF usually runs after an initial supervised fine-tuning pass.
Why does standard RLHF break down for financial AI?
Chatbot RLHF assumes anyone can judge an output in seconds and feedback is immediate. Finance violates every assumption: only experts can judge trading reasoning, proper evaluation takes time, outcomes unfold over days or weeks, and good decisions can lose money through variance. Expert traders cost well over 100 dollars per hour and are scarce, so the labeling economics do not scale.
What is the difference between RLHF and DPO?
RLHF trains a separate reward model from human preferences and then optimizes the language model against it with reinforcement learning. DPO (Direct Preference Optimization) skips the reward model and adjusts the language model directly from preference pairs, increasing the probability of preferred responses. DPO is simpler and more stable; RLHF is more flexible.
Related terms
UV Labs builds post-training decision data for financial AI. Read: Human Feedback Beyond RLHF →