Skip to content
Glossary

Post-Training

Definition

Post-training is everything done to a language model after pre-training (supervised fine-tuning, RLHF, preference optimization, and process supervision) that turns general language ability into specific, reliable behavior.

Pre-training teaches a model what language looks like. Post-training teaches it what humans actually want. The second part is where the value gets created.

What it is

In 2022, OpenAI had a model trained on hundreds of billions of tokens that could complete any sentence, yet was nearly useless. Ask GPT-3 a question and it might answer, ramble, or produce something harmful. What turned it into ChatGPT was not more pre-training data. It was post-training: roughly 40 contractors comparing pairs of model outputs and indicating which was better, thousands of times over.

Post-training typically runs in stages. Supervised fine-tuning (SFT) shows the model examples of good responses and trains it to imitate them (see fine-tuning). But you cannot write an example for every situation, and sometimes you cannot articulate what makes a response good. That is where RLHF comes in: human raters compare outputs, a reward model learns to predict their preferences, and the language model is optimized against that reward, letting humans demonstrate what good looks like through choices rather than specifications. DPO (direct preference optimization) achieves similar results while skipping the separate reward model. The frontier is process supervision, which rewards each step of the model's reasoning rather than only the final answer.

The results are dramatic. OpenAI's InstructGPT paper showed humans preferred outputs from a post-trained 1.3B-parameter model over the raw 175B-parameter GPT-3; the smaller, aligned model beat a base model over a hundred times its size. An industry grew around this insight: Surge AI bootstrapped to $1.2 billion in annual revenue by 2024 and Scale AI reached $870 million, supplying human feedback to frontier labs. The economics scale with expertise: generalist labelers earn $20-40 per hour, while specialists with doctoral degrees command $90-200 per hour, and specialized domains like medicine cost 3-5x more to label than general tasks.

Why it matters for financial AI

Finance breaks the chatbot playbook in several uncomfortable ways. Rewards are delayed: a trade entered today may not resolve for weeks, so the feedback signal is sparse and noisy. Good decisions lose money to variance and bad ones profit from luck, so outcomes are a noisy teacher. The environment is adversarial: edges decay once discovered. And expert feedback is scarce: frontier labs reportedly pay expert traders well over $100/hour and still struggle to find qualified contributors.

Domain text does not fix this. The open-source FinGPT project fine-tuned models on financial text for under $300 per run and performs well on classification (87.6% F1 on sentiment), yet scores 28.5% on financial QA exact match versus GPT-4's 76%. Fine-tuning on financial text helps models talk about finance; it does not teach them to reason about financial decisions. That requires data capturing the decision-making process itself: complete decision episodes with reasoning traces, counterfactual environments for exploring alternatives, process feedback from experts, and continuously fresh data as markets shift (see distributional shift).

One more property of process-supervised post-training is worth noting: when models reason through chain-of-thought traces, they often state their intent in plain language; recent OpenAI work on reasoning models found that a model about to exploit a flaw in its reward function frequently says so directly in its trace. That makes reasoning traces valuable for monitoring deployed systems, not just for training them, which matters doubly when the system in question is operating capital.

This is the gap UV Labs exists to fill: Decision Data is post-training data for financial AI, built from 500K+ decision episodes captured in live markets.

Common questions

Is post-training the same as fine-tuning?

No. Fine-tuning is one technique within post-training: supervised training on examples of good behavior. Post-training is the umbrella term covering the whole pipeline after pre-training: supervised fine-tuning, RLHF, direct preference optimization, and process supervision. Every fine-tune is post-training, but not all post-training is fine-tuning.

Why did post-training matter more than model size for ChatGPT?

OpenAI's InstructGPT results showed that humans preferred outputs from a post-trained 1.3 billion parameter model over the raw 175 billion parameter GPT-3. Pre-training gives a model knowledge; post-training teaches it what humans actually want. The smaller, aligned model was more useful than a base model over a hundred times its size.

What does post-training look like for financial AI?

The chatbot playbook breaks in finance: rewards are delayed, good decisions can lose money to variance, the environment is adversarial, and expert feedback is scarce and expensive. Financial post-training instead needs complete decision episodes with reasoning traces, counterfactual environments for exploring alternatives, process feedback from domain experts, and continuously refreshed data as markets shift.

Related terms

UV Labs builds post-training decision data for financial AI. Explore Decision Data →