Google’s new neural network LLM architecture separates storage components to control skyrocketing capacity and computing costs

Titan.webp.png

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more


A new neural network architecture developed by researchers at Google could solve one of the major challenges for large language models (LLMs): expanding their memory at inference time without skyrocketing storage and computing power costs. Called TitansThe architecture allows models to find and store small bits of information that are important in long sequences during inference.

Titans combines traditional LLM attention blocks with layers of “neural memory” that allow models to efficiently perform both short- and long-term memory tasks. According to the researchers, LLMs that leverage long-term neural memory can scale to millions of tokens and outperform both classic LLMs and alternatives like Mamba, while having much fewer parameters.

Attention layers and linear models

The classic transformer architecture used in LLMs uses the Self-attention mechanism to calculate the relationships between tokens. This is an effective technique that can be used to learn complex and granular patterns in token sequences. However, as the sequence length increases, the computational and storage costs of computing and storing attention increase quadratically.

Recent suggestions include Alternative architectures that have linear complexity and can be scaled without exploding storage and computing costs. However, Google researchers argue that linear models do not perform competitively compared to classic transformers because they compress their contextual data and tend to miss important details.

They propose that the ideal architecture should have various memory components that can be coordinated to use existing knowledge, memorize new facts, and learn abstractions from their context.

“We argue that in an effective learning paradigm, similar to the human brain, there are distinct but interconnected modules, each responsible for a component critical to the learning process,” the researchers write.

Long-term neural memory

“Memory is a confederation of systems – e.g. B. short-term, working and long-term memory – each of which fulfills a different function with different neural structures and each can work independently of one another,” the researchers write.

To fill the gap in current language models, the researchers propose a “neural long-term memory” module that can learn new information at the time of inference, without the inefficiencies of the full attention mechanism. Instead of storing information during training, the neural memory module learns a function that can remember new facts during inference and dynamically adjust the memory process based on the data it encounters. This solves the generalization problem that other neural network architectures suffer from.

To decide what information is worth remembering, the neural memory module uses the concept of “surprise.” The more a sequence of tokens deviates from the type of information stored in the model’s weights and existing memory, the more surprising it is and the more worthwhile it is to remember. This allows the module to use its limited memory efficiently and only store data that adds useful information to what the model already knows.

In order to process very long data sequences, the neural memory module has an adaptive forgetting mechanism that allows it to remove information that is no longer needed, helping to manage the limited capacity of memory.

The memory module can complement the attention mechanism of current transducer models, which the researchers describe as “short-term memory modules that pay attention to the current context window size.” On the other hand, our neural memory can play the role of long-term memory with the ability to continuously learn from data and store it in its weights.”

Titan architecture

Example of Titan architecture (Source: arXiv)

The researchers describe Titans as a family of models that integrate existing transformer blocks with neural memory modules. The model consists of three key components: the “core” module, which acts as short-term memory and uses the classical attention mechanism to attend to the current segment of input tokens that the model is processing; a “long-term memory” module that uses the architecture of neural memory to store information beyond the current context; and a “persistent memory” module, the learnable parameters that remain fixed after training and store time-independent knowledge.

The researchers suggest different ways to connect the three components. But in general, the main advantage of this architecture is that the attention and memory modules can complement each other. For example, the attention layers can determine which parts of the current context window to store in long-term memory based on historical and current context. Meanwhile, long-term memory provides historical knowledge that is not present in the current attentional context.

The researchers conducted small tests on Titan models with 170 to 760 million parameters for a variety of tasks, including language modeling and long-sequence language tasks. They compared the performance of Titans with various transformer-based models, linear models such as mamba And Hybrid models like samba.

Titans (red line) outperforms other models, including GPT-4, on long sequence tasks in both few shots and fine-tuned settings (Source: arXiv)

Titans demonstrated strong language modeling performance compared to other models, outperforming both transformers and linear models of similar sizes.

The difference in performance is particularly pronounced in tasks with long sequences, such as “Needle in a haystack“, where the model needs to retrieve bits of information from a very long sequence, and BABILongwhere the model has to reason about facts spread out in very long documents. In fact, in these tasks, Titan outperformed models with orders of magnitude larger parameters, including GPT-4 and GPT-4o-miniand a Llama-3 model enhanced with Retrieval-Augmented Generation (RAG).

Additionally, researchers were able to expand Titans’ context window up to 2 million tokens while keeping storage costs at a moderate level.

The models still need to be tested at larger sizes, but the results of the study show that researchers have not yet reached the limit of titans’ potential.

What does this mean for enterprise applications?

With Google at the forefront of long context modelsWe can expect this technology to make its way into private and open models like Gemini and Gemma.

Because LLMs support longer context windows, there is growing potential for building applications where you integrate new knowledge into your prompt rather than using techniques like RAG. The development cycle for developing and iterating across prompt-based applications is much faster than complex RAG pipelines. Meanwhile, architectures like Titans can help reduce inference costs for very long sequences, allowing companies to deploy LLM applications for more use cases.

Google plans to release the PyTorch and JAX code for training and evaluating Titans models.



Source link
Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *