Synthetic Data
Synthetic data is artificially generated training data that statistically resembles real data, produced by simulators or generative models instead of being collected from the real world.
It solves a real problem (training data that is scarce, expensive, or private), but it comes with a real catch: a model trained on it ultimately learns the generator, not the world.
What it is
Instead of collecting millions of real examples, you create artificial ones with similar statistical properties: practice problems built to follow the same patterns as real exam questions. The motivation is usually scarcity: flash crashes are rare but catastrophic, and if you have only ten historical examples, a generator can produce thousands of plausible crash variations, coverage no amount of waiting for real data will provide. Privacy and cost push the same direction, which is why synthetic data is widespread in domains like healthcare records and computer vision, where labeled real examples are expensive and regulation restricts sharing the originals.
The catch is that synthetic data is bounded by the generator's understanding. Whatever the simulator or generative model doesn't know about the world, the training data won't contain, and the model will confidently learn its absence. In finance, the gaps are systematic. Simulated markets miss market impact (the agent's own orders never move prices, so achievable returns are overstated). They miss the human element (the decision patterns, biases, and heuristics of real participants; the distribution of what a simulator would do is not the distribution of what traders actually do). And they idealize execution: orders fill at requested prices, APIs never fail, there is no slippage or partial fill. Even when individual scenarios look plausible, the distribution is wrong: what someone actually did under uncertainty, with money at risk, is fundamentally different from what a simulator decides to do, and models trained on the latter inherit decision patterns no real market participant exhibits.
Synthetic data should also not be confused with counterfactuals. A counterfactual starts from a real decision someone actually made and computes what alternative choices would have returned from the actual price path; it is derived from ground truth, not invented. The same goes for replayable environments, which re-evaluate real recorded decision points rather than generating scenarios.
Why it matters for financial AI
Because real decision data (reasoning, actions, and outcomes from live markets) barely exists publicly, synthesizing it is the obvious temptation for anyone training financial AI. But a model trained on synthetic scenarios learns to navigate the simulation, not the market; the gap between the two is where financial AI systems fail in production. Poor synthetic data is worse than scarce real data, because it teaches patterns that do not exist, a fast track to overfitting on artifacts and to brittleness under distributional shift.
The history of financial NLP offers a cautionary parallel about data quantity versus data kind. BloombergGPT was trained on 363 billion tokens of real financial text at roughly $3M in compute, and it still cannot make a trading decision; more data of the wrong kind does not produce decision-making competence. Synthetic decision data has the same shape of problem one step earlier: it manufactures volume in a category where the binding constraint is fidelity, not quantity.
The legitimate uses are narrower: augmenting genuinely rare events, stress-testing against scenarios that haven't happened yet, and working around privacy constraints. For teaching decision-making itself, the value lives in real reasoning under real stakes. UV Labs takes the capture approach rather than the generation approach: 500K+ decision episodes recorded from 1M+ live trades, available as Decision Data.
Synthetic decision data
- Generated by a simulator or model
- Reasoning is templated or imitated, made without stakes
- Idealized execution: perfect fills, no API failures
- No market impact: returns overstated
- Quality bounded by the generator's understanding
- Unlimited volume, uncertain signal
Captured decision data
- Recorded from live markets in real time
- Real reasoning by agents risking real capital
- Real execution: slippage, partial fills, latency, errors
- Outcomes reflect actual market dynamics
- Grounded: no simulator gap to fail across
- Scarcer, but multiplied by counterfactual analysis
Common questions
Is synthetic data good enough to train financial AI?
Not on its own. Synthetic data is only as good as the generator that produced it, and in markets the generator misses what matters: market impact, real execution friction, and the way humans actually decide under stakes. Models trained on synthetic scenarios learn to navigate the simulator, not the market. It can supplement real data for rare events and stress tests, but it cannot replace captured decision data.
What is the difference between synthetic data and counterfactual data?
Synthetic data is invented: a generator produces scenarios that never happened. Counterfactual data is computed: it starts from a real decision someone actually made and derives what alternative exits, stops, or sizing would have returned from the actual price path that followed. Counterfactuals stay anchored to real market dynamics; synthetic data inherits whatever the generator got wrong.
When is synthetic data useful?
When real examples are too scarce, expensive, or sensitive to collect: augmenting rare events like flash crashes, stress-testing systems against scenarios that have not happened yet, and sharing data where privacy rules prevent using the original. The quality bar is high: poor synthetic data teaches models patterns that do not exist in real markets.
Related terms
UV Labs builds post-training decision data for financial AI. Explore Decision Data →