Beyond RAG: How cache-enhanced generation reduces latency and complexity for smaller workloads

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more

Retrieval-Augmented Generation (RAG) is the de facto method for adapting large language models (LLMs) for tailored information. However, RAG has upfront technical costs and can be slow. Thanks to advances in long-context LLMs, companies can now bypass RAG by including all proprietary information in the prompt.

A new study from National Chengchi University in Taiwan shows that by using long-context LLMs and caching techniques, you can build custom applications that outperform RAG pipelines. This approach, called Cache-Augmented Generation (CAG), can be a simple and efficient replacement for RAG in enterprise environments where the knowledge corpus fits within the context window of the model.

Limitations of RAG

RAG is a effective method to process open questions and specialist tasks. It uses retrieval algorithms to collect documents relevant to the query and adds context to enable the LLM to compose more accurate responses.

However, RAG introduces several limitations for LLM applications. The additional retrieval step introduces latency that can impact the user experience. The result also depends on it Quality of document selection and ranking step. In many cases, the limitations of the models used for retrieval require documents to be broken down into smaller pieces, which can impact the retrieval process.

And in general, RAG increases the complexity of the LLM application and requires the development, integration and maintenance of additional components. The additional overhead slows down the development process.

Cache-enhanced retrieval

*RAG (top) vs. CAG (bottom) (Source: arXiv)*

The alternative to developing a RAG pipeline is to put the entire document corpus into the prompt and let the model choose which bits are relevant to the request. This approach eliminates the complexity of the RAG pipeline and the problems caused by fetch errors.

However, there are three key challenges to preloading all documents into the command prompt. First, long prompts slow down the model and increase inference costs. Secondly, the Length of the LLM context window sets limits on the number of documents that can fit in the command prompt. Finally, adding irrelevant information to the prompt can confuse the model and reduce the quality of its responses. So simply cramming all of your documents into the command prompt instead of selecting the most relevant ones can ultimately degrade the performance of the model.

The proposed CAG approach leverages three key trends to address these challenges.

First, advanced caching techniques make processing prompt templates faster and cheaper. The premise of CAG is that the knowledge documents are included in every prompt sent to the model. Therefore, you can calculate the attention values of your tokens in advance instead of doing so upon receiving requests. This upfront calculation reduces the time needed to process user requests.

Leading LLM providers such as OpenAI, Anthropic, and Google offer prompt caching capabilities for the repetitive parts of your prompt, including the knowledge documents and instructions that you include at the beginning of your prompt. This is possible with Anthropic Reduce costs by up to 90% and latency around 85% for the cached parts of your prompt. Equivalent caching capabilities have been developed for open source LLM hosting platforms.

Second, Long Context LLMs make it easier to incorporate more documents and knowledge into prompts. Claude 3.5 Sonnet supports up to 200,000 tokens, while GPT-4o supports 128,000 tokens and Gemini supports up to 2 million tokens. This makes it possible to include multiple documents or entire books in the prompt.

Finally, advanced training methods enable models to better recall, reason, and answer questions over very long sequences. In the past year, researchers have developed several LLM benchmarks for long-sequence tasks, including BABILong, LongICLBenchAnd RULER. These benchmarks test LLMs on difficult problems such as multi-fetch and multi-hop question answering. There is still room for improvement in this area, but AI labs continue to make progress.

As newer generations of models continue to expand their contextual windows, they will be able to process larger collections of knowledge. In addition, we can expect that the ability of models to extract and use relevant information from long contexts will continue to improve.

“These two trends will significantly expand the usability of our approach and enable it to handle more complex and diverse applications,” the researchers write. “Therefore, our methodology is well-positioned to become a robust and versatile solution for knowledge-intensive tasks and leverage the growing capabilities of next-generation LLMs.”

RAG vs CAG

To compare RAG and CAG, researchers conducted experiments using two widely accepted question-answer benchmarks: squadthat focuses on contextual questions and answers from individual documents, and HotPotQAwhich requires multi-hop reasoning across multiple documents.

They used a Flame-3.1-8B Model with a 128,000 token context window. For RAG, they combined the LLM with two retrieval systems to obtain passages relevant to the question: the basics BM25 algorithm And OpenAI embeds. For CAG, they inserted several documents from the benchmark into the prompt and let the model determine which passages to use to answer the question. Their experiments show that CAG outperformed both RAG systems in most situations.

*CAG outperforms both sparse RAG (BM25 retrieval) and dense RAG (OpenAI embeddings) (Source: arXiv)*

“By preloading the entire context from the test set, our system eliminates retrieval errors and ensures holistic reasoning of all relevant information,” the researchers write. “This advantage is particularly evident in scenarios where RAG systems may retrieve incomplete or irrelevant passages, resulting in suboptimal response generation.”

CAG also significantly reduces the time to generate the answer, especially as the length of the reference text increases.

*The generation time for CAG is much shorter than RAG (Source: arXiv)*

However, CAG is not a panacea and should be used with caution. It is well suited to environments where the knowledge base does not change frequently and is small enough to fit within the model’s context window. Companies should also be alert to cases where their documents contain conflicting facts due to the context of the documents that could bias the model in reaching conclusions.

The best way to determine if CAG is right for your use case is to do some experiments. Fortunately, implementing CAG is very simple and should always be considered as a first step before investing in more development-intensive RAG solutions.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily is for you. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thank you for your subscription. Check out more VB newsletter here.

An error has occurred.