MoR Architecture Boosts LLM Efficiency and Throughput
KAIST AI and Mila researchers unveiled Mixture-of-Recursions (MoR), a novel Transformer framework that merges parameter sharing with adaptive computation. By routing tokens to variable recursion depths and applying recursion-wise KV caching, MoR reduces parameters by 50%, trims training time by 19%, cuts memory by 25%, and doubles inference speed. Enterprises can uptrain existing models with MoR to balance performance and efficiency cost-effectively.
Researchers at KAIST AI and Mila have unveiled Mixture-of-Recursions (MoR), a new Transformer architecture that boosts LLM memory and compute efficiency while improving accuracy and throughput under fixed budgets.
Scaling Challenges of LLMs
As LLMs grow, their memory footprints and computational needs outpace many organizations’ infrastructure. Techniques like layer tying and early exiting address parts of the problem but fall short of unifying parameter sharing with adaptive compute.
Introducing Mixture-of-Recursions
MoR builds on recursive transformers by partitioning the model into shared recursion blocks. A lightweight routing mechanism assigns recursion depth per token, dynamically adjusting “thinking” depth based on complexity.
Key Components of MoR
- Dynamic routing: A router directs tokens to different recursion depths, similar to Mixture-of-Experts but with shared layers.
- Recursion-wise KV caching: Selectively caches key-value pairs only for active tokens at each recursion step, slashing memory traffic.
Performance Gains
Benchmark tests on models from 135M to 1.7B parameters show MoR delivers higher few-shot accuracy with up to 50% fewer parameters, cuts training time by 19%, reduces peak memory by 25%, and more than doubles inference throughput in some configurations.
Enterprise Adoption Path
Rather than training from scratch, enterprises can uptrain open-source models with MoR, minimizing upfront costs. Developers gain new “knobs” to balance efficiency and performance based on specific deployment needs.
Future Outlook
Modality-agnostic design means MoR could extend efficiency gains to video, audio, and multi-modal AI workflows, making large-scale AI more accessible and cost-effective for a wider range of enterprise applications.
Keep Reading
View AllAI-Powered Search Reinvents How We Find Information Online
Generative AI is reshaping search, from conversational AI Modes to query fan-out. Discover how AI-driven search provides richer, faster answers.
Trump Unveils AI Action Plan to Drive Enterprise Adoption
President Trump signs AI Action Plan to accelerate US AI leadership, bolster open-source models, and streamline enterprise adoption across infrastructure.
AI’s Disruption of Work Sparks Efficiency Boom and Human Costs
CEOs hail AI for cutting costs; workers warn of forced labor as algorithms replace jobs. Experts debate efficiency vs dignity in the AI workforce.
AI Tools Built for Agencies That Move Fast.
Unlock enterprise-grade LLM performance with QuarkyByte’s expertise in model optimization. Explore how integrating MoR’s routing and caching strategies with your AI stack can cut inference costs by up to 50% and boost throughput. Connect with our team to tailor a MoR adoption plan that scales.