Retrieval Augmented Generation (RAG)

Week 3: Advanced Techniques in Prompt Engineering

Retrieval Augmented Generation (RAG) | Become a Prompt Engineer

Understanding Retrieval Augmented Generation (RAG)

RAG is a powerful technique that combines the strengths of large language models with the ability to retrieve and use external information. This approach allows models to access up-to-date or domain-specific knowledge that may not be part of their initial training data, leading to more accurate and contextually relevant responses.

RAG enhances AI-generated content by incorporating relevant external information, improving accuracy and relevance in tasks requiring specific or current knowledge.

Key Concepts in RAG

  1. Embeddings: Dense vector representations of text that capture semantic meaning. In RAG, embeddings are used to represent both the query and the documents in the knowledge base.
  2. Vector Databases: Specialized databases designed to store and efficiently search through large collections of embeddings.
  3. Similarity Search: The process of finding documents in the knowledge base that are semantically similar to the query, typically using cosine similarity or Euclidean distance between embeddings.
  4. Prompt Engineering: The art of crafting effective prompts that combine the query, retrieved context, and instructions for the language model to generate accurate and relevant responses.

Components of a RAG System

  1. Retriever: Searches and retrieves relevant information from an external knowledge base.
  2. Generator: A language model that uses the retrieved information to generate responses.
  3. Knowledge Base: A collection of documents or data that the retriever searches through.
  4. Query Encoder: Converts input queries into embeddings suitable for information retrieval.
  5. Document Encoder: Converts documents in the knowledge base into embeddings for efficient storage and retrieval.

The RAG Pipeline

Here's how a typical RAG system works:

  1. Query Processing: The input query is encoded into an embedding using the query encoder.
  2. Retrieval: The query embedding is used to search the knowledge base for relevant documents. This is typically done using a vector database and similarity search.
  3. Context Preparation: The retrieved documents are processed and combined with the original query to create a prompt for the language model.
  4. Generation: The language model generates a response based on the provided prompt, which includes both the query and the retrieved context.
  5. Post-processing: The generated response may be further processed or filtered to ensure quality and relevance.

Simple RAG Example

Let's implement a basic RAG system using Python, OpenAI's GPT-4-mini, and FAISS for vector search:

Example: Simple RAG System

We'll create a RAG system that answers questions based on a small corpus of text about AI concepts.


from openai import OpenAI
import numpy as np
import faiss

# Initialize OpenAI client
client = OpenAI()

# Sample corpus
corpus = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers.",
    "Natural Language Processing deals with the interaction between computers and human language.",
    "Reinforcement learning is learning what to do to maximize a reward.",
    "Computer vision is an AI field that trains computers to interpret visual data."
]

# Function to get embeddings
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Create embeddings for the corpus
embeddings = [get_embedding(text) for text in corpus]
index = faiss.IndexFlatL2(len(embeddings[0]))
index.add(np.array(embeddings).astype('float32'))

# Function to retrieve relevant text
def retrieve(query, k=1):
    query_embedding = get_embedding(query)
    _, indices = index.search(np.array([query_embedding]).astype('float32'), k)
    return [corpus[i] for i in indices[0]]

# Function to generate answer
def generate(query, context):
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the given context."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()

# RAG function
def rag(query):
    context = retrieve(query)[0]
    return generate(query, context)

# Example usage
query = "What is machine learning?"
answer = rag(query)
print(f"Question: {query}")
print(f"Retrieved Context: {retrieve(query)[0]}")
print(f"Answer: {answer}")

# Print sample embeddings
print("\nSample Embedding (first 10 dimensions):")
print(embeddings[0][:10])
                

Advanced RAG Example

Now, let's create a more advanced RAG system that includes multiple retrieval steps and handles a larger corpus:

Example: Advanced RAG System

This advanced RAG system includes features like chunking, multiple retrieval steps, and result reranking.


from openai import OpenAI
import numpy as np
import faiss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Initialize OpenAI client
client = OpenAI()

# Larger sample corpus (you can expand this or load from a file)
corpus = [
    "Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computer systems to improve their performance on a specific task through experience.",
    "Deep learning is a subfield of machine learning that uses artificial neural networks with multiple layers (hence 'deep') to progressively extract higher-level features from raw input.",
    "Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans using natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human language in a valuable way.",
    "Reinforcement learning is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. It differs from supervised learning in that correct input/output pairs need not be presented.",
    "Computer vision is an interdisciplinary scientific field that deals with how computers can be made to gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do."
]

# Function to get embeddings
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Function to chunk text
def chunk_text(text, chunk_size=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_size = 0
    for sentence in sentences:
        if current_size + len(sentence.split()) > chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_size = len(sentence.split())
        else:
            current_chunk.append(sentence)
            current_size += len(sentence.split())
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks

# Create chunks and embeddings for the corpus
chunks = [chunk for text in corpus for chunk in chunk_text(text)]
embeddings = [get_embedding(chunk) for chunk in chunks]
index = faiss.IndexFlatL2(len(embeddings[0]))
index.add(np.array(embeddings).astype('float32'))

# TF-IDF vectorizer for reranking
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(chunks)

# Function to retrieve relevant text
def retrieve(query, k=3):
    query_embedding = get_embedding(query)
    _, indices = index.search(np.array([query_embedding]).astype('float32'), k)
    retrieved_chunks = [chunks[i] for i in indices[0]]
    
    # Rerank using TF-IDF and cosine similarity
    query_tfidf = tfidf_vectorizer.transform([query])
    similarities = cosine_similarity(query_tfidf, tfidf_matrix[indices[0]])
    reranked_indices = similarities.argsort()[0][::-1]
    
    return [retrieved_chunks[i] for i in reranked_indices]

# Function to generate answer
def generate(query, contexts):
    prompt = f"Contexts:\n"
    for i, context in enumerate(contexts, 1):
        prompt += f"{i}. {context}\n"
    prompt += f"\nQuestion: {query}\n\nAnswer:"
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the given contexts. Use the information from the contexts to provide accurate and detailed answers."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()

# RAG function
def rag(query):
    contexts = retrieve(query)
    return generate(query, contexts)

# Example usage
query = "How does deep learning relate to artificial neural networks?"
answer = rag(query)
print(f"Question: {query}")
print(f"Retrieved Contexts:")
for i, context in enumerate(retrieve(query), 1):
    print(f"{i}. {context}")
print(f"\nAnswer: {answer}")

# Print sample embeddings
print("\nSample Embedding (first 10 dimensions):")
print(embeddings[0][:10])
                

Key Concepts Explained

  • Embeddings: Dense vector representations that capture semantic meaning. In the examples, we use OpenAI's text-embedding-3-small model to create embeddings for both queries and documents.
  • FAISS: A library for efficient similarity search and clustering of dense vectors. We use it to create an index of our document embeddings and perform fast similarity searches.
  • Chunking: The process of breaking down large documents into smaller, manageable pieces. This allows for more fine-grained retrieval and can improve the relevance of retrieved context.
  • Reranking: A second pass over retrieved documents to improve the relevance of results. In the advanced example, we use TF-IDF and cosine similarity for reranking.
  • Prompt Engineering: The craft of designing effective prompts for language models. In our examples, we structure the prompt to include the retrieved context and the user's query.

Benefits of RAG

  • Up-to-date Information: Can incorporate recent information not present in the model's training data.
  • Reduced Hallucination: By grounding responses in retrieved facts, RAG can reduce false or made-up information.
  • Explainability: The retrieved passages can serve as evidence for the model's responses.
  • Customizability: The knowledge base can be tailored to specific domains or use cases.

RAG combines the strengths of retrieval-based and generation-based approaches, leading to more accurate, relevant, and controllable AI-generated content.

Practical Applications of RAG

  1. Question Answering Systems: Providing accurate answers based on large document collections.
  2. Chatbots and Virtual Assistants: Enhancing responses with real-time information.
  3. Content Generation: Creating articles or reports with up-to-date facts and figures.
  4. Research Tools: Assisting in literature review and information synthesis.

Challenges and Considerations

  • Retrieval Quality: The system's performance heavily depends on the retriever's ability to find relevant information.
  • Knowledge Base Management: Keeping the knowledge base current and relevant can be challenging.
  • Computational Resources: RAG systems can be more resource-intensive than standard language models.
  • Integration Complexity: Combining retrieval and generation components effectively requires careful design and tuning.

Future Directions

The field of RAG is rapidly evolving. Some exciting areas of research and development include:

  • Multi-modal RAG: Incorporating image, video, and audio data into RAG systems.
  • Adaptive Retrieval: Dynamically adjusting retrieval strategies based on the query and context.
  • Personalized RAG: Tailoring retrieval and generation to individual user preferences and history.
  • Federated RAG: Enabling RAG across distributed and privacy-preserving knowledge bases.

Practice Exercises

Now it's your turn to apply what you've learned about Retrieval Augmented Generation (RAG). Try these exercises to reinforce your understanding and skills:

Exercise 1: Implement a Basic RAG System

Create a simple RAG system that answers questions about famous scientists using a small corpus of information.

import numpy as np
from openai import OpenAI
import faiss

# Initialize OpenAI client (make sure to set your API key)
client = OpenAI()

# Sample corpus about famous scientists
corpus = [
    "Albert Einstein developed the theory of relativity.",
    "Marie Curie conducted pioneering research on radioactivity.",
    "Charles Darwin proposed the theory of evolution by natural selection.",
    "Nikola Tesla was an inventor who contributed to the design of the modern alternating current electricity supply system.",
    "Stephen Hawking made groundbreaking contributions to the fields of cosmology, general relativity and quantum mechanics."
]

# TODO: Implement the following functions:
# 1. get_embedding(text): Get embeddings for a given text
# 2. create_index(corpus): Create a FAISS index from the corpus
# 3. retrieve(query, index, corpus, k=1): Retrieve relevant text from the corpus
# 4. generate(query, context): Generate an answer using the GPT model
# 5. rag(query, index, corpus): Implement the full RAG pipeline

# Test your implementation
query = "What did Albert Einstein work on?"
# TODO: Create the index and run your RAG function
# print the result
                    

Hint: Make sure to implement all the functions (get_embedding, create_index, retrieve, generate, and rag). The rag function should use retrieve to get the context and then use generate to create the final answer.


import numpy as np
from openai import OpenAI
import faiss

client = OpenAI()

corpus = [
    "Albert Einstein developed the theory of relativity.",
    "Marie Curie conducted pioneering research on radioactivity.",
    "Charles Darwin proposed the theory of evolution by natural selection.",
    "Nikola Tesla was an inventor who contributed to the design of the modern alternating current electricity supply system.",
    "Stephen Hawking made groundbreaking contributions to the fields of cosmology, general relativity and quantum mechanics."
]

def get_embedding(text, model="text-embedding-3-small"):
    print(f"Getting embedding for: {text[:30]}...")
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

def create_index(corpus):
    print("Creating FAISS index...")
    embeddings = [get_embedding(text) for text in corpus]
    index = faiss.IndexFlatL2(len(embeddings[0]))
    index.add(np.array(embeddings).astype('float32'))
    print("Index created successfully.")
    return index

def retrieve(query, index, corpus, k=1):
    print(f"Retrieving relevant text for query: '{query}'")
    query_embedding = get_embedding(query)
    _, indices = index.search(np.array([query_embedding]).astype('float32'), k)
    return [corpus[i] for i in indices[0]]

def generate(query, context):
    print("Generating answer...")
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the given context."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()

def rag(query, index, corpus):
    context = retrieve(query, index, corpus)[0]
    print(f"Retrieved context: {context}")
    return generate(query, context)

# Create index
index = create_index(corpus)

# Test the RAG system
query = "What did Albert Einstein work on?"
print(f"\nTesting RAG system with query: '{query}'")
result = rag(query, index, corpus)
print(f"\nFinal RAG Response: {result}")

# Try with different queries to test the system further
query = "Who studied radioactivity?"
result = rag(query, index, corpus)
print(f"\nQuery: {query}")
print(f"RAG Response: {result}")
                        

Exercise 2: Implement RAG with Chunking

Enhance your RAG system by implementing text chunking to handle longer documents and improve retrieval accuracy.


import numpy as np
from openai import OpenAI
import faiss
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

client = OpenAI()

# Larger corpus with longer texts
corpus = [
    "Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass–energy equivalence formula E = mc2.",
    "Marie Skłodowska Curie was a Polish and naturalized-French physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize, the first person and the only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize in two scientific fields.",
    "Charles Robert Darwin was an English naturalist, geologist and biologist, best known for his contributions to the science of evolution. He established that all species of life have descended over time from common ancestors and, in a joint publication with Alfred Russel Wallace, introduced his scientific theory that this branching pattern of evolution resulted from a process that he called natural selection.",
    "Nikola Tesla was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system. Tesla conducted a range of experiments with mechanical oscillators/generators, electrical discharge tubes, and early X-ray imaging.",
    "Stephen William Hawking was an English theoretical physicist, cosmologist, and author who was director of research at the Centre for Theoretical Cosmology at the University of Cambridge. Hawking was born on the 300th anniversary of Galileo's death and died on the 139th anniversary of Einstein's birth. His scientific works included a collaboration with Roger Penrose on gravitational singularity theorems in the framework of general relativity and the theoretical prediction that black holes emit radiation, often called Hawking radiation."
]

# TODO: Implement the following functions:
# 1. chunk_text(text, chunk_size=100): Split text into chunks
# 2. get_embedding(text): Get embeddings for a given text
# 3. create_index(chunks): Create a FAISS index from the chunks
# 4. retrieve(query, index, chunks, k=3): Retrieve relevant chunks
# 5. generate(query, context): Generate an answer using the GPT model
# 6. rag_with_chunking(query, index, chunks): Implement the full RAG pipeline with chunking

# Test the RAG system with chunking
query = "What were Einstein's main contributions to physics?"
# TODO: Create chunks, build the index, and run your RAG with chunking function
# print the result
                    

Hint: Use sent_tokenize to split the text into sentences, then combine sentences into chunks. In the rag_with_chunking function, retrieve multiple chunks and combine them before generating the answer.


import numpy as np
from openai import OpenAI
import faiss
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

client = OpenAI()

corpus = [
    "Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass–energy equivalence formula E = mc2.",
    "Marie Skłodowska Curie was a Polish and naturalized-French physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize, the first person and the only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize in two scientific fields.",
    "Charles Robert Darwin was an English naturalist, geologist and biologist, best known for his contributions to the science of evolution. He established that all species of life have descended over time from common ancestors and, in a joint publication with Alfred Russel Wallace, introduced his scientific theory that this branching pattern of evolution resulted from a process that he called natural selection.",
    "Nikola Tesla was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system. Tesla conducted a range of experiments with mechanical oscillators/generators, electrical discharge tubes, and early X-ray imaging.",
    "Stephen William Hawking was an English theoretical physicist, cosmologist, and author who was director of research at the Centre for Theoretical Cosmology at the University of Cambridge. Hawking was born on the 300th anniversary of Galileo's death and died on the 139th anniversary of Einstein's birth. His scientific works included a collaboration with Roger Penrose on gravitational singularity theorems in the framework of general relativity and the theoretical prediction that black holes emit radiation, often called Hawking radiation."
]

def chunk_text(text, chunk_size=100):
    print(f"Chunking text: {text[:30]}...")
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_size = 0
    for sentence in sentences:
        if current_size + len(sentence.split()) > chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_size = len(sentence.split())
        else:
            current_chunk.append(sentence)
            current_size += len(sentence.split())
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    print(f"Created {len(chunks)} chunks.")
    return chunks

def get_embedding(text, model="text-embedding-3-small"):
    print(f"Getting embedding for: {text[:30]}...")
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

def create_index(chunks):
    print(f"Creating FAISS index for {len(chunks)} chunks...")
    embeddings = [get_embedding(chunk) for chunk in chunks]
    index = faiss.IndexFlatL2(len(embeddings[0]))
    index.add(np.array(embeddings).astype('float32'))
    print("Index created successfully.")
    return index

def retrieve(query, index, chunks, k=3):
    print(f"Retrieving relevant chunks for query: '{query}'")
    query_embedding = get_embedding(query)
    _, indices = index.search(np.array([query_embedding]).astype('float32'), k)
    return [chunks[i] for i in indices[0]]

def generate(query, context):
    print("Generating answer...")
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the given context."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()

def rag_with_chunking(query, index, chunks):
    print(f"\nProcessing query: '{query}'")
    retrieved_chunks = retrieve(query, index, chunks, k=3)
    print("Retrieved chunks:")
    for i, chunk in enumerate(retrieved_chunks, 1):
        print(f"{i}. {chunk[:50]}...")
    context = " ".join(retrieved_chunks)
    return generate(query, context)

# Create chunks and index
print("Chunking corpus and creating index...")
all_chunks = [chunk for text in corpus for chunk in chunk_text(text)]
index = create_index(all_chunks)

# Test the RAG system with chunking
query = "What were Einstein's main contributions to physics?"
print(f"\nTesting RAG system with query: '{query}'")
result = rag_with_chunking(query, index, all_chunks)
print(f"\nFinal RAG Response: {result}")

# Try with different queries to test the system further
query = "What did Marie Curie research?"
result = rag_with_chunking(query, index, all_chunks)
print(f"\nQuery: {query}")
print(f"RAG with Chunking Response: {result}")
                    

Summary

Congratulations on completing these RAG exercises! You've now implemented three different RAG systems, each with increasing complexity and capabilities:

  1. A basic RAG system that retrieves and generates responses based on a simple corpus.
  2. A multi-step RAG system that improves retrieval accuracy through reranking.
  3. A RAG system with text chunking that can handle longer documents and potentially improve retrieval relevance.

These exercises have given you hands-on experience with key concepts in RAG, including:

  • Creating and using embeddings for semantic search
  • Implementing retrieval mechanisms using FAISS
  • Generating responses using language models
  • Enhancing retrieval through multi-step processes and reranking
  • Handling longer documents through text chunking

As you continue to explore and work with RAG systems, consider experimenting with:

  • Different embedding models and their impact on retrieval accuracy
  • Various chunking strategies and their effects on retrieval and generation
  • More advanced reranking techniques, such as using machine learning models
  • Implementing RAG for specific domains or use cases