Meta proposes new scalable storage layers that improve knowledge and reduce hallucinations

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more

As companies continue to deploy large language models (LLMs) in various applications, one of their biggest challenges is improving factual knowledge about models and reducing hallucinations. In a new article, researchers report Meta AI suggest “scalable storage layers“, which could be one of several possible solutions to this problem.

Scalable storage layers add more parameters to LLMs to increase their learning capacity without requiring additional computing resources. The architecture is useful for applications where you want to save extra memory for factual knowledge but also want to take advantage of the inference speed of more nimble models.

Density and layers of memory

Traditional Language models Use “dense layers” to encode large amounts of information in their parameters. In dense layers, all parameters are used to their full extent and are usually activated simultaneously during inference. Dense layers can learn complex functions, and increasing them requires additional computing and energy resources.

In contrast, for simple factual knowledge, much simpler layers with associative memory architectures would be more efficient and interpretable. That’s what storage layers do. They use simple, sparse activations and key-value search mechanisms to encode and retrieve knowledge. Sparse layers use more memory than dense layers, but only use a small portion of the parameters at a time, making them much more computationally efficient.

Memory layers have been around for several years, but are rarely used in modern deep learning architectures. They are not optimized for current hardware accelerators.

Current border LLMs typically use some form of “Mixture of experts” (MoE) architecture, which uses a mechanism vaguely similar to storage layers. MoE models consist of many smaller expert components that are specialized for specific tasks. At inference time, a routing mechanism determines which expert to activate based on the input sequence. PEERan architecture recently developed by Google DeepMind, extends MoE to millions of experts and provides more granular control over the parameters activated during inference.

Update storage layers

Storage layers are computationally intensive but memory intensive, which presents particular challenges for current hardware and software frameworks. In their work, the meta-researchers propose several modifications that solve these challenges and allow them to be deployed on a large scale.

Layers of memory — *Storage layers can store knowledge in parallel across multiple GPUs without slowing down the model (Source: arXiv)*

First, the researchers configured the storage layers for parallelization and distributed them across multiple GPUs to store millions of key-value pairs without changing other layers in the model. They also implemented a dedicated CUDA kernel to handle high memory bandwidth operations. And they developed a parameter sharing mechanism that supports a single set of storage parameters across multiple storage layers within a model. This means that the keys and values used for searches are shared across multiple levels.

These modifications enable the implementation of storage layers within LLMs without slowing down the model.

“Storage layers with their sparse activations complement dense networks well and provide increased capacity for knowledge acquisition while maintaining low computational requirements,” the researchers write. “They scale efficiently and offer practitioners an attractive new way to trade off storage and computing power.”

To test storage layers, researchers made changes Llama models by replacing one or more dense layers with a common storage layer. They compared the memory-enhanced models with the dense LLMs as well as MoE and PEER models on various tasks, including factual question answering, scientific and common sense world knowledge, and coding.

Memory model vs. dense layers — A 1.3B memory model (solid line) trained on 1 trillion tokens approaches the performance of a 7B model (dashed line) on factual question-answer tasks because it is assigned more memory parameters (source: arxiv )

Their results show that memory models significantly improve over dense baselines and compete with models that consume two to four times more computing power. They also match the performance of MoE models that have the same computational budget and number of parameters. Particularly noteworthy is the model’s performance on tasks that require factual knowledge. For example, when answering questions objectively, a memory model with 1.3 billion parameters approaches the performance of Llama-2-7B, which was trained with twice as many tokens and ten times more computing power.

Additionally, the researchers found that the benefits of memory models remain consistent with model size as they scaled their experiments from 134 million to 8 billion parameters.

“Given these findings, we strongly advocate that storage layers should be integrated into all next-generation AI architectures,” the researchers write, adding that there is still plenty of room for improvement. “In particular, we hope that new learning methods can be developed to further increase the effectiveness of these layers, enabling less forgetting, fewer hallucinations and continuous learning.”

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily is for you. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thank you for your subscription. Check out more VB newsletter here.

An error has occurred.