Skip to content
Back to Blog

What is Post-Training for LLMs? A Practical Guide

TL;DR

Post-training (RLHF, process supervision) transforms raw LLMs into useful assistants, but standard approaches break down for financial applications.

  • 40 contractors providing preference feedback transformed GPT-3 into ChatGPT; Scale AI and Surge AI built billion-dollar businesses on this insight
  • Financial RLHF is harder: rewards are delayed, good decisions can lose money from variance, and expert feedback costs $150+/hour
  • Financial post-training requires complete decision episodes, counterfactual environments, and process feedback from domain experts

In 2022, OpenAI had a problem. They'd trained GPT-3 on hundreds of billions of tokens of text. The model could complete any sentence, write in any style, and had absorbed an enormous amount of human knowledge. It was also nearly useless.

Ask GPT-3 a question and it might answer it. Or it might ramble about something tangentially related. Or it might produce racist content. Or it might helpfully explain how to build a bomb. The raw language model was powerful but unpredictable, like having a brilliant savant with no social awareness.

What transformed GPT-3 into ChatGPT wasn't more pre-training data. It was a fundamentally different kind of training: post-training.

The 40 Contractors Who Changed Everything

To build InstructGPT (the precursor to ChatGPT), OpenAI hired 40 contractors. Their job was simple: look at model outputs and tell us which one is better.

That's it. Thousands and thousands of comparisons. "Given this prompt, response A is better than response B." This preference data trained what's called a reward model, which in turn guided the language model to produce responses humans actually wanted.

The process is called RLHF, Reinforcement Learning from Human Feedback. It sounds complicated, but the core idea is elegant: rather than trying to specify exactly what good behavior looks like, you let humans demonstrate it through their choices.

Today, this has become a massive industry. Surge AI bootstrapped to $1.2 billion in annual revenue by 2024. Scale AI reached $870 million. Companies like Alignerr pay $90-200 per hour for experts with doctoral degrees to provide specialized feedback. Medical labeling costs 3-5x more than general tasks.

The companies that invested early in human feedback infrastructure now form the backbone of frontier AI development.

The Core Insight

Pre-training teaches models what language looks like. Post-training teaches models what humans actually want. The second part is where value gets created.

What Actually Happens in Post-Training

Post-training typically involves a sequence of stages. The first is usually supervised fine-tuning (SFT): you show the model examples of good responses and train it to imitate them. "When someone asks for a recipe, here's what a good response looks like."

But SFT has limits. You can't write an example for every possible situation. And sometimes you can't even articulate what makes a good response good. You know it when you see it.

That's where RLHF comes in. Human raters compare pairs of responses. A separate "reward model" learns to predict which responses humans prefer. Then the language model is trained to maximize that reward signal.

The results are dramatic. OpenAI's 2022 paper on InstructGPT showed that humans preferred outputs from their RLHF-trained 1.3B parameter model over the raw 175B parameter GPT-3. The smaller, post-trained model was more useful than the larger base model by a wide margin.

Process Supervision: The Next Frontier

In 2023, OpenAI published research on what they called "process reward models." Instead of only judging final outputs, these systems provide feedback at each step of the model's reasoning.

Their paper, "Let's Verify Step by Step," reported that rewarding each correct step "significantly outperforms outcome supervision" on challenging math problems. The model didn't just get better at math. The reasoning became more interpretable and trustworthy.

This matters for financial AI. A trading decision involves many steps: analyzing the market, forming a thesis, sizing the position, setting stops, timing the entry. Outcome-based feedback (did you make money?) conflates luck with skill. Process-based feedback (was each step of your reasoning sound?) separates them.

Recent OpenAI research on reasoning models found something remarkable: when models reason through chain-of-thought traces, they often explicitly reveal their intent in natural language. When a model is about to exploit a flaw in its reward function, the reasoning trace often says so directly, something like "I can game this by..."

This makes reasoning traces valuable not just for training but for monitoring. You can see how the model is thinking, not just what it concludes.

Why Finance Breaks the Standard Playbook

The standard RLHF approach works great for chatbots. A human can immediately tell if a response is helpful, harmless, and honest. The feedback loop is tight.

Finance is different in several uncomfortable ways.

Rewards are delayed. A trade you enter today might not resolve for weeks. The feedback signal is sparse and noisy.

Good decisions have bad outcomes. A perfectly reasoned trade can lose money due to variance. A terrible decision can profit from luck. Behavioral finance research shows that distinguishing skill from luck requires hundreds or thousands of trades, not one.

The environment is adversarial. In a chatbot conversation, the user generally wants helpful responses. In markets, other participants are actively trying to take money from you. What works stops working once it's discovered. Edges decay.

Expert feedback is scarce and expensive. You can hire generalist labelers at $20-40/hour for simple tasks. Finding people who can evaluate trading decisions is harder. Experts who can actually generate alpha have better things to do than rate model outputs.

The stakes are real. When a chatbot produces a bad response, the user is annoyed. When a trading system produces a bad decision, someone loses money. You can't iterate freely through bad outputs when each one has a cost.

What Financial Post-Training Actually Needs

Given these constraints, financial post-training can't just copy the chatbot playbook. It needs infrastructure built for the domain.

Complete decision episodes. Not just "bought BTC at $65,000, sold at $68,000, made 4.6%." The full context: what was the thesis, what signals drove the decision, what alternatives were considered and rejected, what was the risk budget, what was the exit plan. The reasoning, not just the result.

Counterfactual environments. What would have happened with different sizing? Different timing? A tighter stop? This lets you extract maximum learning from each decision point. Instead of one sample, you get many.

Process feedback from experts. Not "good trade / bad trade" but evaluations of each step. Was the market analysis sound? Was the risk assessment appropriate? Was the entry well-timed? This separates decision quality from outcome luck.

Continuous fresh data. A model trained on 2021 bull market data has never seen the 2022 Fed hiking cycle or the FTX collapse. Markets change. Strategies decay. The training data needs to be constantly refreshed.

The FinGPT Lesson

The open-source FinGPT project provides an interesting case study. By fine-tuning existing models on financial data using efficient techniques like LoRA, researchers achieved competitive performance on financial NLP tasks at a fraction of BloombergGPT's cost, under $300 per training run versus Bloomberg's $3 million.

FinGPT performs well on structured classification tasks like sentiment analysis (87.6% F1) and headline classification (95.5% F1). But it struggles with complex reasoning and generation. Financial QA exact match is 28.5% versus GPT-4's 76%.

The pattern: fine-tuning on domain text helps models talk about finance. It doesn't teach them to reason about financial decisions. That requires a different kind of data, the kind that captures the decision-making process itself.

Where This Goes

Post-training transformed language models from impressive curiosities into useful assistants. The same transformation is needed for financial AI. The techniques exist. The compute exists. What's missing is the infrastructure to generate the right kind of training data: complete decision episodes with expert feedback and counterfactual environments. The teams building this infrastructure now will define what financial AI becomes.

Building Financial AI?

Learn how UV Labs provides the post-training infrastructure that research labs need.

Schedule a Conversation