MoR Architecture Boosts LLM Efficiency and Throughput

KAIST AI and Mila researchers unveiled Mixture-of-Recursions (MoR), a novel Transformer framework that merges parameter sharing with adaptive computation. By routing tokens to variable recursion depths and applying recursion-wise KV caching, MoR reduces parameters by 50%, trims training time by 19%, cuts memory by 25%, and doubles inference speed. Enterprises can uptrain existing models with MoR to balance performance and efficiency cost-effectively.

Published July 27, 2025 at 01:14 PM EDT in Artificial Intelligence (AI)

Researchers at KAIST AI and Mila have unveiled Mixture-of-Recursions (MoR), a new Transformer architecture that boosts LLM memory and compute efficiency while improving accuracy and throughput under fixed budgets.

Scaling Challenges of LLMs

As LLMs grow, their memory footprints and computational needs outpace many organizations’ infrastructure. Techniques like layer tying and early exiting address parts of the problem but fall short of unifying parameter sharing with adaptive compute.

Introducing Mixture-of-Recursions

MoR builds on recursive transformers by partitioning the model into shared recursion blocks. A lightweight routing mechanism assigns recursion depth per token, dynamically adjusting “thinking” depth based on complexity.

Key Components of MoR

Dynamic routing: A router directs tokens to different recursion depths, similar to Mixture-of-Experts but with shared layers.
Recursion-wise KV caching: Selectively caches key-value pairs only for active tokens at each recursion step, slashing memory traffic.

Performance Gains

Benchmark tests on models from 135M to 1.7B parameters show MoR delivers higher few-shot accuracy with up to 50% fewer parameters, cuts training time by 19%, reduces peak memory by 25%, and more than doubles inference throughput in some configurations.

Enterprise Adoption Path

Rather than training from scratch, enterprises can uptrain open-source models with MoR, minimizing upfront costs. Developers gain new “knobs” to balance efficiency and performance based on specific deployment needs.

Future Outlook

Modality-agnostic design means MoR could extend efficiency gains to video, audio, and multi-modal AI workflows, making large-scale AI more accessible and cost-effective for a wider range of enterprise applications.

Keep Reading

View All

Artificial Intelligence (AI)July 27

AI-Powered Search Reinvents How We Find Information Online

Generative AI is reshaping search, from conversational AI Modes to query fan-out. Discover how AI-driven search provides richer, faster answers.

4 months ago

Artificial Intelligence (AI)July 27

Trump Unveils AI Action Plan to Drive Enterprise Adoption

President Trump signs AI Action Plan to accelerate US AI leadership, bolster open-source models, and streamline enterprise adoption across infrastructure.

4 months ago

Artificial Intelligence (AI)July 27

AI’s Disruption of Work Sparks Efficiency Boom and Human Costs

CEOs hail AI for cutting costs; workers warn of forced labor as algorithms replace jobs. Experts debate efficiency vs dignity in the AI workforce.

4 months ago

The Future of Business is AI

AI Tools Built for Agencies That Move Fast.

Unlock enterprise-grade LLM performance with QuarkyByte’s expertise in model optimization. Explore how integrating MoR’s routing and caching strategies with your AI stack can cut inference costs by up to 50% and boost throughput. Connect with our team to tailor a MoR adoption plan that scales.

Learn More Contact Us