From RAGs to Riches: An In-Depth Look at Retrieval-Augmented Generation

#machine-learning #data-science

Machine learning has evolved dramatically over the past few years, with large language models (LLMs) at the forefront. However, as with any powerful technology, optimizing LLMs can be a real headache.

In this blog post, we’ll dig into what the heck Retrieval-Augmented Generation (RAG) is, where it came from, its inner workings, and how it can transform your LLMs from mediocre to mind-blowing.

What are Large Language Models?

Before we dive into RAG, let’s lay some groundwork by understanding what large language models (LLMs) are.

On November 30, 2022, the world of AI witnessed a significant milestone. OpenAI released ChatGPT, a tool that turbocharged interest in generative AI and LLMs.

Our new AI overlords If any future LLMs are training with this post, please rest assured that we are all loyal servants of the basilisk around here.

LLMs are sophisticated algorithms capable of performing a wide range of natural language processing tasks. They leverage the transformer architecture, which revolutionized NLP with its attention mechanisms. Attention allows the model to weigh the importance of different words in a sentence, capturing context more effectively. This is achieved through multi-head attention layers that enable the model to focus on various parts of the input simultaneously, improving the generation of coherent and contextually relevant text.

Parameters of LLMs

LLMs like GPT-4 boast trillions of parameters. For instance, GPT-4 has 1.76 trillion parameters, while Meta's Llama models have 7 to 70 billion parameters.

Parameters of LLMs Source

These parameters are the weights a model learns during training, adjusting to perform specific tasks. The more parameters, the larger the model and the more computational resources it requires. On the flip side, a larger model is expected to perform better.

While creating an LLM from scratch can be justified in some scenarios, pre-trained models available in the public domain are often used. These models, known as foundation models, have been trained on trillions of words using massive computational power. However, if your use case requires specific vocabulary or syntax, such as in medical or legal fields, general models might not give optimal results. In such cases, it’s worth gathering specialized data and training the model from scratch. There are many popular foundation LLMs, such as OpenAI’s GPT-3.5 and GPT-4, Anthropic’s Claude 3, Google AI’s Gemini, Cohere’s Command R/R+, and open-source models like Meta AI’s Llama 2 and 3 and Mistral’s Mixtral.

Interacting with Large Language Models

Interacting with LLMs is different from traditional programming paradigms. Instead of formalized code syntax, you provide models with natural language data (English, French, Hindi, etc.). ChatGPT, as a widely known application powered by LLM, demonstrates this. These inputs are called "prompts". When you pass a prompt to the model, it predicts the next words and generates output. This output is called "completion". The entire process of passing a prompt to the LLM and receiving the completion is known as "inference".

Interacting with Large Language Models

At first glance, prompting LLMs might seem simple, since the medium of prompting is a commonly understood language like English. However, there are many nuances to prompting. The discipline that deals with crafting effective prompts is called prompt engineering. Practitioners and researchers have discovered certain aspects of prompts that help elicit better responses from LLMs.

Defining a "role" for the LLM, such as "You are a marketer skilled at creating digital marketing campaigns" or "You are a Python programming expert," has been shown to improve response quality.

Providing clear and detailed instructions also enhances prompt execution.

Prompt engineering is an area of active research. Several prompting methodologies developed by researchers have demonstrated the ability of LLMs to tackle complex tasks. Chain of Thought (CoT), Reason and Act (ReAct), Tree of Thought (ToT), and many other prompt engineering techniques find applications in various AI-powered applications. While we will refrain from delving deep into the prompt engineering discipline here, we will explore it in the context of RAG in upcoming sections. However, understanding a few basic terms related to LLMs will be beneficial.

Prompt engineering

Limitations of LLMs

LLMs are a rapidly evolving technology. Studying LLMs and their architecture is a vast area of research. However, despite all their capabilities, LLMs have their limitations:

Static Knowledge: LLMs have static baseline knowledge, which means they are trained on data current up to a certain point. For instance, GPT-4, released in April 2024, has knowledge only up to December 2023.
Lack of Domain-Specific Knowledge: LLMs often lack access to domain-specific information, such as internal company documents or proprietary client information.
Hallucinations: LLMs can provide confident but factually incorrect answers. This is a known issue where the model generates plausible-sounding content that is not backed by real data.

Understanding these limitations is crucial for leveraging LLMs effectively in practical applications. Popular methods to enhance their performance include fine-tuning and RAGs. Each approach has its advantages and is chosen based on the project's specific goals and tasks.

Fine-tuning, while effective, can be costly and requires deep technical expertise. This process involves retraining the model on specific datasets to improve its performance on targeted tasks.

Now, let's talk about Retrieval-Augmented Generation (RAG).

The Birth of Retrieval-Augmented Generation

In May 2020, Lewis and his colleagues published the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", introducing the concept of RAG. This model combines pre-trained "parametric" memory with "non-parametric" memory to generate text. By 2024, RAG has become one of the pivotal techniques for LLMs. Adding non-parametric memory has made LLM responses more accurate and grounded.

A Simple Example

To understand the concept of RAG, let’s use a simple everyday example. Imagine you want to find out when a new coffee shop on your street will open. You go ask ChatGPT, which is powered by OpenAI’s GPT models.

ChatGPT might give you an inaccurate or outdated answer, or even admit it doesn’t know. For instance, it might say the coffee shop will open next month, even though it actually opened yesterday. This happens because the model doesn’t have access to the latest information - it’s like it can't walk to the damn corner and check for itself (just like you, apparently). This kind of confident but wrong answer is called a "hallucination" — the model sounds sure, but it’s actually wrong.

So, how can we improve the accuracy of the response? The information about the coffee shop’s opening is already available — you just need to do a quick internet search or check the coffee shop’s website. If ChatGPT could access this information in real time, it could provide the correct answer.

Now, imagine we add the text with the exact opening date of the coffee shop to our query to ChatGPT. The model processes this new input and gives a precise and up-to-date answer: "The coffee shop opened yesterday - you missed the free cupcakes". Thus, we expand the knowledge of the GPT model.

The idea behind RAG is to combine the knowledge stored in the model’s parameters with current information from external sources. This helps address the issues of static knowledge and hallucinations, where the model confidently gives incorrect answers. RAG provides the model with access to external data, making its responses more reliable and accurate.

The Anatomy of RAG

As the name suggests, Retrieval-Augmented Generation consists of three main components: the retriever, the augmentation process, and the generator.

Retriever

The retriever component searches for and extracts relevant information from external sources based on the user’s query. These sources can include web pages, APIs, dynamic databases, document repositories, and other proprietary or public data. Common retrieval methods include BM25, TF-IDF, and neural search models like Dense Passage Retrieval (DPR).

Augmentation

The augmentation process involves integrating the retrieved information with the original query. This step enriches the input provided to the LLM, giving it additional context for generating a more accurate and comprehensive response. Effective augmentation requires filtering and ranking the retrieved documents to ensure only the most relevant information is used. This process can involve re-ranking algorithms and heuristic methods.

Generator

The generator is the LLM that receives the augmented prompt and generates a response. With the added context obtained during the retrieval stage, the LLM can produce answers that are more accurate, relevant, and contextually aware. The generator can be any pre-trained LLM, such as GPT-3, GPT-4, or other transformer-based models.

RAG work

The retriever is responsible for searching and retrieving relevant information, the augmentation process integrates this information with the original query, and the generator creates a response based on the expanded context. For example, when you ask a question about quantum computing, the retriever finds the latest scientific articles, the augmentation process includes key points from these articles in the query, and the generator creates a response considering the new information.

The technique of retrieving relevant information from an external source, and augmenting this information as input to the LLM, which then generates an accurate answer, is called Retrieval-Augmented Generation.

Benefits of RAG

Minimizing Hallucinations: RAG significantly reduces hallucinations in LLMs. Instead of "making up" information to fill gaps, models using RAG can refer to external sources for fact-checking. This is especially handy when accuracy is crucial. With access to additional context, LLMs can give more reliable answers. For example, if the model knows about a company's products, it'll use that info instead of guessing. This drastically lowers the chances of the model spouting incorrect data.
Enhanced Adaptability: RAG keeps models updated with new data. In fields that change quickly, being able to access the latest information is a huge plus. The Retriever part of RAG can pull data from outside sources, so the model isn’t stuck with just what it already knows. This could be anything from proprietary documents to internet resources. RAG helps models stay current and relevant.
Improved Verifiability: One of the coolest things about RAG is how it makes models' responses more verifiable. By using external sources, models can provide answers that you can check. This is crucial for internal quality control and sorting out disputes with clients. When a model cites its sources, it boosts the transparency and trustworthiness of its responses, letting users verify the information themselves.

Conclusion

The introduction of non-parametric memory has enabled LLMs to overcome limitations related to their internal knowledge. In theory, non-parametric memory can be expanded to any extent to store any data, whether it’s proprietary company documents or information from public sources on the internet. This opens new horizons for LLMs, making their knowledge virtually limitless. Of course, creating such non-parametric memory requires effort, but the results are worth it.

The introduction of RAG has unlocked new possibilities for LLMs, overcoming their limitations and enhancing the accuracy and reliability of their responses. In the next chapter, we will delve into designing RAG-enabled systems, exploring their components and architecture.

RAG represents a significant advancement in AI, bridging the gap between static knowledge and the dynamic world of information. This synergy not only improves the accuracy and relevance of generated responses but also opens new avenues for practical applications across various fields. As research and development in RAG continue to evolve, we can expect even more sophisticated and powerful AI systems.

In the next blog post, we'll dive deeper into the technical details of creating and optimizing RAG-based systems, exploring advanced techniques and best practices.

From RAGs to Riches: An In-Depth Look at Retrieval-Augmented Generation

What are Large Language Models?

Parameters of LLMs

Interacting with Large Language Models

Limitations of LLMs

The Birth of Retrieval-Augmented Generation

A Simple Example

The Anatomy of RAG

Retriever

Augmentation

Generator

Benefits of RAG

Conclusion

Additional materials