Thinking Machines Tackles LLM Nondeterminism
Thinking Machines Lab, led by Mira Murati, published its first research post arguing that randomness in LLM outputs stems from how GPU kernels are orchestrated during inference. The lab proposes making inference deterministic to improve enterprise reliability and reinforcement learning. This early peek signals ambitious research and raises questions about products, openness, and real-world impact.
Thinking Machines’ Reproducible AI Push
Mira Murati’s Thinking Machines Lab — backed by a headline $2 billion seed — published its first research post in a new blog called Connectionism. The paper, “Defeating Nondeterminism in LLM Inference,” argues that the common variability in large language model replies isn’t an immutable fact but a solvable engineering problem.
Authored by researcher Horace He, the post traces the root cause to how GPU kernels — the low-level programs running on Nvidia chips — are stitched together during inference. Small differences in kernel ordering, fusion, and numerical execution paths can nudge outputs into different answers even when the input is identical.
Think of it like an orchestra: the score (the model) is the same, but tiny timing or instrument changes in the backend conductor (the kernel orchestration) produce noticeably different performances. He’s proposing that by taking control of that conductor — making orchestration deterministic — models can deliver reproducible responses.
Why this matters: reproducible outputs improve enterprise reliability, make audit trails meaningful, and reduce noise during reinforcement learning (RL). RL benefits because reward signals depend on repeatable model behavior — less randomness means cleaner feedback and more efficient fine-tuning for custom models.
- Compliance and auditability for regulated sectors that need consistent outputs
- Cleaner RL training with reduced label noise and faster convergence
- More dependable output for researchers and startups customizing models for business use-cases
Thinking Machines says it will publish code and research frequently, framing openness as part of its culture. That promise stands in contrast to how some larger AI firms have moved toward closed models as they scaled. This post gives a rare peek into a secretive startup and signals that the team is attacking a deep systems problem rather than just model architecture.
The research is ambitious but practical: changing kernel orchestration touches GPUs, drivers, and compiler toolchains. Challenges remain — floating-point variations, hardware scheduling, and parallel execution models can all reintroduce variability. Whether Thinking Machines can turn this research into a product that justifies its multibillion valuation is the open question.
If they succeed, the impact is broad: labs doing reproducible science, enterprises needing stable outputs for customer-facing workflows, and better RL pipelines across industry. For now, the post is an invitation — to researchers, customers, and competitors — to watch how a high-profile team tries to turn an assumed limit into a feature.
The next moves to watch: whether the lab releases tools or deterministic kernels, what its first product actually does, and whether reproducibility becomes a competitive differentiator. This is research to production in real time — and the industry could benefit if reproducible inference really can be engineered at scale.
Keep Reading
View AllAnthropic Restores Claude After Short API Outage
Anthropic briefly took Claude, its APIs, and Console offline; service was quickly restored and fixes are being monitored.
YouTube Launches AI Auto Dubbing to Expand Global Reach
YouTube rolls out AI-powered multi-language audio and localized thumbnails, helping creators grow international viewership and watch time.
How ElevenLabs Is Making AI Voice Truly Human
ElevenLabs’ Mati Staniszewski will explain the tech, ethics, and real-world uses behind lifelike AI voice at TechCrunch Disrupt 2025.
AI Tools Built for Agencies That Move Fast.
QuarkyByte can design reproducibility audits, simulate deterministic inference impacts on RL training, and benchmark model stability for regulated industries like finance or pharma. Request a pilot assessment to quantify reliability gains and operational ROI from reproducible LLM deployments.