Building a Q&A Application with Retrieval Augmented Generation (RAG)

Chatbots have come a long way, and one of the most powerful applications of Large Language Models (LLMs) is the complex Question and Answer (Q&A) chatbot. These bots can answer questions about specific sources of information, leveraging a technology known as Retrieval Augmented Generation (RAG).

What is Retrieval Augmented Generation (RAG)?

While general language models can perform common tasks like sentiment analysis and named entity recognition without additional background knowledge, more complex and knowledge-intensive tasks require a system that can access external knowledge sources. This approach leads to more factual and reliable answers, helping to mitigate the "hallucination" problem often associated with LLMs.

Researchers at Meta AI introduced a method called Retrieval Augmented Generation (RAG) to tackle these knowledge-intensive tasks. RAG combines an information retrieval component with a text generation model. It can be fine-tuned, and its internal knowledge can be updated efficiently without the need for retraining the entire model.

RAG accepts input, retrieves a set of relevant/supporting documents, and provides the source of these documents (e.g., Wikipedia). These documents are combined with the original prompt and sent to the text generator to produce the final output. This makes RAG adaptable to situations where facts change over time, which is particularly useful since the parameterized knowledge of LLMs is static. RAG allows language models to access the latest information without retraining, generating reliable outputs based on retrieved information.

RAG Architecture

A typical RAG application consists of two main components:

Indexing: This is the pipeline used to bring data from sources and create an index. It typically happens offline.

Retrieval and Generation: This is the actual RAG chain that takes user queries at runtime, retrieves relevant data from the index, and passes it to the model.

The most common complete sequence from raw data to answers is as follows:

Indexing

Loading: We start by loading the data, which is done using DocumentLoaders.
Splitting: A text splitter divides large chunks into smaller pieces. This is useful for indexing data and passing it to the model, as large chunks are harder to search and do not fit within the model's limited context window.
Storage: We need a place to store and index our splits so they can be searched later. This is typically done using a VectorStore and Embeddings model.

Retrieval and Generation

Retrieval: Given user input, relevant splits are retrieved using a retriever for storage.
Generation: The ChatModel/LLM uses a prompt that includes the question and the retrieved data.