RLHF (Reinforcement Learning from Human Feedback)
RLHF is the technique that transformed raw language models like GPT-3 into useful assistants like ChatGPT. Instead of just predicting the next word, RLHF teaches models to generate responses that humans actually find helpful, accurate, and appropriate.
The process has three steps: First, humans compare pairs of AI responses and indicate which is better. Second, these preferences train a "reward model" that predicts how humans would rate any response. Third, this reward model guides the AI to improve its outputs through reinforcement learning.
Standard RLHF doesn't work well for financial AI because rewards are delayed (you don't know if a trade was good until later), experts are scarce and expensive, and good decisions can have bad outcomes due to market randomness. This is why financial AI requires specialized approaches like decision episodes with process supervision.