All News

MoR Architecture Boosts LLM Efficiency and Throughput

KAIST AI and Mila researchers unveiled Mixture-of-Recursions (MoR), a novel Transformer framework that merges parameter sharing with adaptive computation. By routing tokens to variable recursion depths and applying recursion-wise KV caching, MoR reduces parameters by 50%, trims training time by 19%, cuts memory by 25%, and doubles inference speed. Enterprises can uptrain existing models with MoR to balance performance and efficiency cost-effectively.

Published July 27, 2025 at 01:14 PM EDT in Artificial Intelligence (AI)

Researchers at KAIST AI and Mila have unveiled Mixture-of-Recursions (MoR), a new Transformer architecture that boosts LLM memory and compute efficiency while improving accuracy and throughput under fixed budgets.

Scaling Challenges of LLMs

As LLMs grow, their memory footprints and computational needs outpace many organizations’ infrastructure. Techniques like layer tying and early exiting address parts of the problem but fall short of unifying parameter sharing with adaptive compute.

Introducing Mixture-of-Recursions

MoR builds on recursive transformers by partitioning the model into shared recursion blocks. A lightweight routing mechanism assigns recursion depth per token, dynamically adjusting “thinking” depth based on complexity.

Key Components of MoR

  • Dynamic routing: A router directs tokens to different recursion depths, similar to Mixture-of-Experts but with shared layers.
  • Recursion-wise KV caching: Selectively caches key-value pairs only for active tokens at each recursion step, slashing memory traffic.

Performance Gains

Benchmark tests on models from 135M to 1.7B parameters show MoR delivers higher few-shot accuracy with up to 50% fewer parameters, cuts training time by 19%, reduces peak memory by 25%, and more than doubles inference throughput in some configurations.

Enterprise Adoption Path

Rather than training from scratch, enterprises can uptrain open-source models with MoR, minimizing upfront costs. Developers gain new “knobs” to balance efficiency and performance based on specific deployment needs.

Future Outlook

Modality-agnostic design means MoR could extend efficiency gains to video, audio, and multi-modal AI workflows, making large-scale AI more accessible and cost-effective for a wider range of enterprise applications.

Keep Reading

View All
The Future of Business is AI

AI Tools Built for Agencies That Move Fast.

Unlock enterprise-grade LLM performance with QuarkyByte’s expertise in model optimization. Explore how integrating MoR’s routing and caching strategies with your AI stack can cut inference costs by up to 50% and boost throughput. Connect with our team to tailor a MoR adoption plan that scales.