Why LLMs Still Hallucinate and How Evaluations Perpetuate It
OpenAI’s new paper says hallucinations — plausible but false model outputs — persist because pretraining optimizes next-word prediction without truth labels and because evaluation suites reward guessing. Researchers show models confidently give multiple wrong answers and propose changing evals to penalize confident errors and reward appropriate uncertainty.
OpenAI explains why LLM hallucinations persist
A new OpenAI paper and blog post tackle a question developers and product leaders keep asking: why do large language models still produce confident but false statements, aka hallucinations? The researchers define hallucinations as “plausible but false statements generated by language models,” and show that even state-of-the-art chatbots can give multiple different, all-wrong answers to simple factual prompts.
To make this concrete, the paper’s authors asked a widely used chatbot for the title of Adam Tauman Kalai’s Ph.D. dissertation and got three different wrong answers. They then asked for his birthday and received three different, all-wrong dates. Why does a model sound so certain while being so wrong?
OpenAI points to two linked causes. First, pretraining optimizes next-token prediction across vast, fluent text without labels that mark statements as true or false. Models learn statistical patterns—spelling and punctuation scale away as they’re regular—but arbitrary, low-frequency facts (like a person’s birthday) can’t be inferred from pattern alone and lead to confident fabrications.
Second, and crucially, OpenAI argues that current evaluation practices set bad incentives. If models are scored primarily on accuracy—how often they are exactly right—guessing can be rational. Multiple-choice-style scoring makes lucky guesses win and leaving blanks lose, so models learn to produce answers even when they are unsure.
Their proposed fix isn’t to retrain from scratch but to change the tests. Evaluations should penalize confident errors more than they penalize expressions of uncertainty and should give partial credit for appropriate hedging or admitting ignorance—similar to tests that deter blind guessing with negative marking or partial credit.
What does this mean for teams building and deploying LLMs? The implications are immediate and practical: models need evaluation suites that reward calibration and truthful uncertainty, product UIs that surface confidence, and retrieval or grounding systems that reduce the need to guess.
Practical steps to reduce confident hallucinations
Start by aligning evaluation incentives with real-world safety and utility. Recommended actions include:
- Use scoring that penalizes confident wrong answers and rewards calibrated uncertainty.
- Add evaluation tasks that require grounding to external sources or citations.
- Measure and report calibration metrics (confidence vs. accuracy) alongside accuracy.
- Deploy UI affordances that let models express uncertainty and surface evidence to users.
- Combine retrieval, structured knowledge, and human review for low-frequency factual queries.
Think of evaluations as the rules of a game: if the scoreboard rewards guessing, players (models) will guess. Change the scoreboard and you change behavior. That’s the core actionable insight from OpenAI’s paper.
For businesses, this translates into reduced legal, reputational, and operational risk when assistants admit uncertainty or cite sources instead of inventing facts. For regulators and governments, updated benchmarks that emphasize calibrated responses make audits and compliance clearer. For engineers, it means new evaluation tooling and CI gates focused on calibration, not just raw accuracy.
OpenAI’s conclusion is sober: hallucinations won’t vanish entirely. But by redesigning incentives—changing how models are judged and rewarded—we can make systems that are less prone to confident falsehoods and better aligned with real-world decision-making needs.
In practice, that means pairing model improvements with evaluation and product changes: better scoring, better grounding, and clearer UX for uncertainty. Organizations that adopt these changes will deploy safer, more reliable assistants that admit what they don’t know—and point users to verifiable evidence when it matters most.
Keep Reading
View AllGoogle Reveals Gemini Daily Usage Limits
Google clarifies Gemini quotas: free users get 5 prompts/day and 100 images; Pro and Ultra raise limits to 100/500 prompts and 1,000 images.
Mistral AI Rises as Europe’s Open-Source Rival to OpenAI
French startup Mistral AI nears a $14B valuation with Le Chat, open-source models, and strategic partners reshaping Europe's AI scene.
Warner Bros. Discovery Sues Midjourney Over AI Copyright
Warner Bros. Discovery sued Midjourney, accusing the AI image/video maker of infringing copyrights by generating characters like Batman and Bugs Bunny.
AI Tools Built for Agencies That Move Fast.
QuarkyByte helps teams turn OpenAI’s insights into practical safeguards: we design uncertainty-aware evaluation frameworks, calibration audits, and risk-aligned deployment checks tailored to your models. Partner with us to shift incentives away from blind guessing and reduce confident errors in production AI systems.