(cover image) generated by chat.qwen.ai | Published on June 30, 2025 Alibaba’s Qwen3, the company’s first hybrid reasoning model family, is gaining rapid traction as it expands across platforms and sectors, powering real-world AI innovation at scale. The latest milestone includes support for Apple’s machine learning framework MLX, an open-source architecture designed for Apple silicon.

Continual Learning Improves With Sparse Rank Adaptation In Large Models
The challenge of continual learning, enabling artificial intelligence to acquire knowledge incrementally without forgetting previously learned information, remains a significant hurdle in the development of truly adaptable systems. Current approaches utilising pre-trained models often struggle with ‘catastrophic forgetting’ and ‘task interference’ as new information overwrites existing knowledge. Researchers are now focusing on methods to selectively update model parameters, minimising disruption to established learning. A team comprising Haodong Lu and Dong Gong from the University of New South Wales, alongside Chongyang Zhao, Jason Xue, Kristen Moore and Lina Yao from CSIRO’s Data61, present a novel approach in their article, ‘Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning’. Their work details MoRA, a Mixture-of-Rank Adaptive learning technique, which decomposes parameter updates into smaller, independent components, allowing for a more refined and efficient process of knowledge integration and retention within large language models and other pre-trained models (PTMs).
Continual learning, the ability of a model to acquire new knowledge without catastrophically forgetting previously learned information, presents substantial challenges for artificial intelligence systems. Current models frequently suffer from catastrophic forgetting and task interference, limiting their adaptability. Researchers are actively investigating methods to overcome these limitations, with recent attention focused on leveraging low-rank adaptation techniques and Mixture-of-Experts (MoE) architectures to improve performance and efficiency. This work introduces Mixture-of-Rank Adaptive (MoRA), a novel approach that decomposes low-rank updates into individual rank-1 components, treating each component as an independent expert and enabling a granular, adaptive learning process.
Existing continual learning systems often exhibit catastrophic forgetting, where learning a new task drastically reduces performance on previously learned tasks, and task interference, where the learning of one task hinders the performance of another. Strategies to mitigate these issues include regularization techniques, which constrain model updates to prevent drastic changes, architectural modifications, and replay-based methods, which store and revisit previously seen data. Recent advancements focus on leveraging the inherent low-rank structure within pre-trained models, adapting only a small subset of parameters while freezing the majority, reducing the risk of catastrophic forgetting and improving learning efficiency. MoE architectures further enhance this process by dynamically routing inputs to different specialized experts, allowing the model to learn diverse tasks without interfering with existing knowledge.
However, conventional MoE methods often activate entire low-rank adapters per input, leading to subspace interference and hindering the selective reuse of beneficial components. MoRA addresses this limitation by decomposing each low-rank update into individual rank-1 components, effectively creating a fine-grained mixture of experts. This granular approach allows the model to selectively activate only the most relevant components for each input, reducing interference and promoting efficient knowledge transfer. A rank-1 component represents a simplified form of a matrix, defined by a single vector, reducing computational complexity.
Researchers conducted a thorough analysis of activation rank within a pre-trained transformer model, revealing a surprisingly low intrinsic dimensionality across both vision and text encoding pathways. The study demonstrates that capturing 99% of the information contained within layer activations typically requires fewer than 16 ranks, indicating substantial redundancy in the model’s representational capacity. This observation holds true when examining different projection locations within layers, such as those associated with attention mechanisms and multi-layer perceptrons, suggesting a fundamental property of deep neural networks. The identified redundancy presents opportunities to reduce the number of parameters and computational cost.
The identified redundancy in activation rank presents opportunities for model compression and acceleration, reducing computational costs and enabling deployment on resource-constrained devices. Researchers explored various techniques for exploiting this redundancy, including pruning, which removes unimportant connections, quantization, which reduces the precision of numerical representations, and knowledge distillation, which transfers knowledge from a larger model to a smaller one. These techniques further enhance the efficiency of MoRA, making it even more practical for real-world applications. The findings highlight the potential for developing more sustainable and accessible AI systems.
MoRA implements rank pruning and activation budgets, allowing it to adaptively select a sparse mixture of ranks for each input, optimising resource allocation and enhancing the model’s ability to retain previously learned information while incorporating new tasks. This selective activation strategy further reduces interference and promotes efficient knowledge transfer. Activation budgets limit the number of active ranks, enforcing sparsity and preventing overfitting.
Researchers validated MoRA on continual learning tasks utilising both CLIP (Contrastive Language-Image Pre-training) and large language models, demonstrating its effectiveness in enhancing learning and mitigating forgetting. Experiments confirm that MoRA enhances continual learning performance, improving generalisation capabilities and simultaneously reducing forgetting. The system achieves state-of-the-art results on several benchmark datasets.
Researchers believe that MoRA represents a significant step towards building more robust and adaptable AI systems. The proposed approach addresses the limitations of existing continual learning methods, enabling models to learn new tasks without forgetting previously acquired knowledge. The findings have implications for a wide range of applications, including robotics, natural language processing, and computer vision. Future work will focus on extending MoRA to more complex tasks and datasets, exploring different architectures and training strategies, and developing more efficient and scalable implementations.
The study’s findings underscore the importance of understanding the inherent structure of deep neural networks and leveraging this knowledge to develop more efficient and effective learning algorithms. Researchers plan to investigate the relationship between activation rank and model performance, exploring the theoretical limits of compression and adaptation. This research will contribute to a deeper understanding of the principles underlying deep learning and pave the way for the development of more intelligent and sustainable AI systems. The ultimate goal is to create AI systems that can learn continuously and adapt to changing environments, just like humans.