Shittu Olumide
2024-11-04 08:00:00
www.kdnuggets.com
Image by Editor | Midjourney
Large Language Models (LLMs) have shown tremendous potential to users and organizations alike; their vast capabilities and generative prowess have made them popular and widely accepted recently. Some of the drawbacks that LLMs face are the inability to generate or respond to prompts given by users in a context-aware manner, by sounding very generic and open-ended, or sometimes outdated in the information they respond with. If implemented correctly, Retrieval Augmented Generation (RAG) has been applied to solve this challenge.
RAG(Retrieval-Augmented Generation) has recently become one of the most popular ways to utilize publicly available LLMs. RAG improves the quality of response generated by LLMs, which is why many organizations have adopted RAG when implementing LLMs in their software systems.
There has been a rising need for professionals capable of building highly optimized RAG systems that meet organizational needs. According to Grandview research, the RAG market size was estimated at USD 1,042.7 million last year (2023); there is a projection that it is going to grow at a CAGR of 44.7% from 2024 to 2030, this is because of the RAPID advancements in the field of Natural Language Processing (NLP) and the need for intelligent AI systems.
On the flip side of RAG implementation is RAG optimization; this is the process of improving the performance of RAG systems by making information retrieval more accurate, leading to better overall performance. In the later part of this tutorial, you will learn several techniques for this.
Prerequisites
To fully understand this technical article, you should be familiar with LLMs and how they work. You should also be knowledgeable in Python programming, as this article’s code, snippets, and implementations are in Python.
Understanding RAG and its Components
RAG is simply optimizing the output information generated by LLMs by referencing an external authoritative knowledge base outside the training data. This authoritative knowledge base is additional information that contains data specific to a particular organization or domain.
LLMs are typically trained on large volumes of data, which enables them to perform tasks such as language translation and generating answers to questions.
RAG utilizes LLMs’ generative capabilities to generate custom, institutional, and domain-specific responses. So, RAG adds extra functionality to publicly available LLMs. This saves the ridiculous amount of time and financial implications it would have taken to build a custom LLM from scratch to serve an intended purpose, say, a chatbot for a business.
Let me walk you through a high-level workflow of a RAG system:
- A prompt comes in from the user from a front-end interface
- The RAG model then ensures that the right information is retrieved from the authoritative knowledge base based on the prompt received
- The RAG model then ensures that the right information is retrieved from the authoritative knowledge base based on the prompt received
- The retrieved information from the authoritative knowledge base is now used to generate a response by the LLM that is sent back to the client
This way, you have seen that the prompt does not just go straight to the LLM, as it would have without RAG implementations. Still, the information semantically in sync with the prompt is retrieved from the authoritative knowledge base. The LLM’s generative capabilities are now used to generate a response that the user can see, understand, and appreciate.
RAG plus LLM equals magic.
Image by Andy Kelly on Source
Applications of RAG
Due to its value and impact on the Natural Language Processing landscape, RAG has attracted widespread adoption and applicability in different sectors and use cases. Even non-technical people have started integrating RAG systems into their businesses for better productivity.
Some of RAG’s applications range from content creation and summarization to conversational agents and chatbots. A functional RAG system is typically made up of three (3) components, they are:
- Retrieval Component
- Augmentation Component
- Generation component
Retrieval Component
This component handles retrieving pertinent information from the external authoritative knowledge base. It ensures that the information or passage retrieved is the most closely related to the prompt given. Several mechanisms can be utilized, including keyword-based search, semantic similarity search, and a neural network-based retrieval approach.
Any of these can be implemented based on the one that suits the project.
The code snippet below shows how retrieval is done in an RAG system from an external knowledge base.
import faiss # This handles similarity search
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch
# A pre-trained embedding model (e.g., BERT) is loaded
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Function to encode text into embeddings
def text_to_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
embeddings = model(**inputs).last_hidden_state.mean(dim=1) # Mean pooling
return embeddings.cpu().numpy()
# Sample document corpus also known as the
# authoritative knowledge base, in this example it is for a # bakery shop
documents = [
"We are open for 6 days of the week, on Monday, Tuesday, Wednesday, Thursday, Friday, Saturday",
"The RAG system uses a retrieval component to fetch information.",
"We are located in Lagos, our address is 456 computer Lekki-Epe Express way.",
"Our CEO is Mr Austin, his phone number is 09090909090"]
# Encode documents and store in FAISS index
dimension = 384 # Set embedding dimension based on #the model used
index = faiss.IndexFlatL2(dimension) # Create FAISS index
# Create document embeddings and add to FAISS index
doc_embeddings = np.vstack([embed_text(doc) for doc in documents])
index.add(doc_embeddings)
# Query Given it by a user
query = "Where is the location of your business?"
query_embedding = embed_text(query)
# Retrieve top 2 documents based on similarity
top_k = 2
_, indices = index.search(query_embedding, top_k)
retrieved_docs = [documents[idx] for idx in indices[0]]
print("Your Query:", query)
print("Retrieved Documents:", retrieved_docs)
The code snippet above gives you practical insight and more details on the inner workings of the retrieval process of RAG.
Three major things happened:
- Embedding Creation: The document or authoritative knowledge base and the query passed to it are embedded. Don’t worry much about the new concept of ‘embedding’; you will understand it in full detail in the late part of this article
- Indexing using FAISS: The embedded documents are stored in a FAISS index, which enables speedy similarity search
- Retrieval: The top k documents most similar to the query passed by the user are retrieved based on cosine similarity
Augmentation Component
After the retrieval process has been completed successfully, the augmentation process adds more contextual meaning to the retrieved information as it relates to the prompt passed by the user, making it more fluent.
Generation Component
The generation process ensures natural language generation based on the augmented information. It allows humans to make sense of the information retrieved, which is made possible by using pre-trained LLMs like GPT-4, GPT-5, BERTH, etc.
The code snippet below provides a complete RAG pipeline, showing the Retrieval, Augmentation, and Generation processes of an RAG system using Pytorch.
from sentence_transformers import SentenceTransformer
from transformers import T5ForConditionalGeneration, T5Tokenizer
import faiss
import torch
# Load Sentence Transformer model for embeddings (using PyTorch)
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample documents for retrieval
documents = [
"We are open for 6 days of the week, on Monday, Tuesday, Wednesday, Thursday, Friday, Saturday",
"The RAG system uses a retrieval component to fetch information.",
"We are located in Lagos, our address is 456 computer Lekki-Epe Express way.",
"Our CEO is Mr. Austin, his phone number is 09090909090"
]
# Embed the documents
doc_embeddings = embed_model.encode(documents)
# Use FAISS for fast similarity search
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)
# Load T5 model and tokenizer for the generation component
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
# Define a query
query = "How does a RAG system work in machine learning?"
# Retrieve top-k relevant documents
query_embedding = embed_model.encode([query])
top_k = 2
_, indices = index.search(query_embedding, top_k)
retrieved_docs = [documents[idx] for idx in indices[0]]
# Concatenate retrieved docs to augment the query
augmented_query = query + " " + " ".join(retrieved_docs)
print("Augmented Query:", augmented_query)
# Prepare input for T5 model
input_text = f"answer_question: {augmented_query}"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
# Generate answer using T5
with torch.no_grad():
output = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
answer = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Answer:", answer)
What are RAG Embeddings?
Embeddings in RAG are dense vector representations of texts; this is different from one-hot encoding that represents words as sparse vectors with high dimensionality; embedding compresses this information into low dimensionality and continuous vector, capturing the semantic relationship between words, making the model understand the context.
So basically, embedding involves converting texts to low-dimensional vector representations capable of understanding semantic relationships.
What are you embedding in a RAG system? You are embedding the prompt passed by the user and the custom documents/authoritative domain-specific knowledge to be retrieved. This is done so that information retrieval is semantically coherent with the prompts passed.
The next step is to choose a retrieval model when developing a RAG system and the LLM has been selected(say GPT-4). Some of the popular models are DPR(Dense Passage Retriever), Sentence-BERT, and RoBERTa; these models handle the embeddings for you. After that, your custom documents are processed and integrated for retrieval.
So you send a prompt. The retrieval model embeds the prompt, capturing context and semantic relationship. It retrieves the closest related information from the embedded database. It passes it to the LLM, which utilizes its generative prowess to generate text that aligns with the retrieved data.
The need to optimize embeddings in RAG
In a RAG system, embedding optimization plays a crucial role in ensuring the quality and relevance of the retrieved information/data from the knowledge source or authoritative base, just like previously explained when discussing what embeddings are; the prompts passed to the model is transformed to embeddings, these embeddings capture the semantic meaning of the user prompts before retrieval from the authoritative knowledge base is done.
If the embeddings are properly optimized, they can boost the overall performance of the model by retrieving the right information that aligns very closely with the user’s prompts. That is why embedding optimization is vital to an RAG system.
Also, depending on the implementation of the RAG system, pre-trained embedding models are utilized. Some of the popularly used embedding models are:
- DPR(Dense Passage Retriever)
- Sentence-BERT
- RoBERTa
- infloat/e5-large-v2
More models can be found here.
These pre-trained embedding models can handle the embedding for you (they convert your prompts to embeddings or numeric representation), but this comes with a trade-off; since they are trained in large datasets of generic data, they may not fully understand custom or domain applications. That is why you must fine-tune or optimize your embedding models.
Techniques for embedding tuning in RAG
There are various approaches to achieving embedding tuning; you will find below some of the popular techniques for achieving embedding tuning;
1. By Adapting to the Domain
Embeddings tuned specifically for a certain field or topic can make all the difference. For instance, training embeddings on relevant data can make the RAG model much more precise in areas like law or healthcare, where the language has unique terms and nuances. This way, when users ask questions, they get answers that resonate with the context.
2. Use Contrastive Learning
Consider contrastive learning as helping the model “hone in” on what’s similar and what’s not. By teaching the model to group related queries and answers closer together in understanding (and keep unrelated ones further apart), you’re making it easier for the model to return results that make sense for the question asked.
3. Add Signals from Real Data
Adding in some supervised data (like user feedback or tagged examples) can be powerful for getting the embeddings even closer to what people expect. This helps steer the model toward the patterns that matter, like recognizing which responses tend to hit the mark and which ones don’t. The more the model learns from real user interactions, the smarter it gets at delivering useful responses.
4. Self-Supervised Learning
Self-supervised learning is a great option for situations when there is little labeled data to work with. This method finds patterns within the data itself, which helps build a foundation for the model without requiring as much manual tagging. It’s ideal for general-use RAG systems that need to stay flexible.
5. Combine Embeddings for Richer Responses
Sometimes, blending multiple embeddings works wonders. For example, combining general-purpose embeddings with those fine-tuned for a specific field can create a well-rounded model that understands general and niche questions. This approach is especially helpful if you’re dealing with a wide range of topics.
6. Keep Embeddings Balanced
Regularization techniques like dropout or triplet loss help the model avoid getting “stuck” on certain words or ideas, keeping its understanding broad enough to handle different queries. This ensures that the model doesn’t get too narrow in its responses, which helps it stay versatile for new or unexpected questions.
7. Challenge the Model with Hard Negatives
Hard negatives are just close enough to be tricky but still incorrect. Adding these in training encourages the model to refine its understanding, especially when dealing with subtle differences. It’s like giving it the mental reps it needs to get better at spotting the right answer in a sea of almost-right options.
8. Use Feedback Loops for Continuous Improvement
With active learning, you can set up a feedback loop where uncertain or challenging answers are flagged for human review. These reviews feed back into the model to keep refining its accuracy over time, which is great for fields that are always evolving or have many complex nuances.
9. Go Deeper with Cross-Encoder Tuning
For more nuanced queries—especially ones that require a close match between question and answer—a cross-encoder approach can help. Cross-encoders evaluate query and document pairs directly, so the model “reads” them together rather than treating them as separate entities. This often leads to a deeper understanding in fields where exact matching is key.
Fine-tuning embeddings this way lets RAG models deliver responses that feel more natural and on-point. In short, it’s about making AI a better listener and responder that can meet users with answers that hit home.
Methods for evaluating embedding quality in RAG
Evaluating the quality of a RAG system’s embedding is crucial, as it serves as a pointer indicating whether or not it can retrieve relevant and contextually correct data, whether optimized or not.
Listed below are methods used to evaluate embedding quality in RAG:
- Cosine Similarity and Nearest Neighbor Evaluation: This approach calculates the cosine similarity between query embeddings and their relevant documents
- Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP): In this method, when a query is given, the retrieved documents are ranked based on relevance, and MRR or MAP scores are calculated
- Embedding Clustering and Visualization: This involves using techniques like t-SNE or UMAP to project embeddings in a 2D or 3D space for visualizing the similarities of how queries and documents are clustered together
- Human Judgment and Feedback Loops: This involves using humans to evaluate the quality of the retrieved information based on prompts and give feedback for possible improvements
- Domain-Specific Evaluation Metrics: This approach ensures that the embeddings perform effectively for the nuances of a particular domain, as this can negatively affect the performance of RAG systems applied in such specialized disciplines
Challenges in Embedding Tuning for RAG
Although embedding tuning can have a huge impact on the performance of RAG systems, it can sometimes be very challenging to implement. It is not straightforward or direct, and it can sometimes require iterations until the desired performance is attained.
Some of the challenges include:
- Cost: Computational cost for training and fine-tuning embeddings, especially when dealing with large datasets
- Overfitting: The model might become too familiar with the training data. If a prompt other than those exact training data is passed to it, it cannot retrieve the right information
- Difficulty Getting High-Quality Data: Since models depend heavily on the data used for their training if sufficient high-quality and accurate data about a particular domain or niche is not used to train the model, it is likely to be biased and under-performant
- Managing Changes in Domain Trends: Due to the dynamic nature of most domains, where there are always new updates and advancements, the models must be retrained frequently to avoid becoming outdated, and this is not easy to keep up with
Conclusion
RAG optimization is crucial when developing a system that requires high accuracy since the embedding models used for developing RAG systems are mostly for generic applications. Embedding tuning is necessary to improve the retrieval accuracy of the RAG system for better performance.
After implementing either of the retrieval techniques above in developing your RAG model, the right thing to do is to test the performance of your newly developed RAG model to know how well it responds to certain prompts passed to it; if it performs excellently well; meeting your expectations and requirements, good for you, you did a good job in developing the RAG system. If it does not give you the desired responses when you pass certain prompts to it, never worry much; you can still improve the model’s performance through further optimization and fine-tuning. Thanks for reading.
Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.