Retrieval-augmented generation
There is a lot of great text on the internet explaining why the world is round, but there is also a fair amount arguing that it’s flat. If you use GenAI with a big, messy dataset, it can provide answers that seem plausible to the system but aren’t true in real life.
If you plan to use an LLM for GenAI tasks inside your organization, there is a good chance you will decide to utilize a technique called retrieval-augmented generation (RAG). Organizations use RAG to improve accuracy and control of results, reduce hallucinations, and allow the dataset the LLM is working with to be updated regularly without enormous cost.
How does RAG work? In a nutshell, a large AI model like GPT-4 has been trained on a massive 45GB collection of text—around 13 trillion tokens. That means it knows a lot! But it also means it can sometimes get things wrong. There is a lot of great text on the internet explaining why the world is round, but there is also a fair amount arguing that it’s flat. If you use GenAI with a big, messy dataset, it can provide answers that seem plausible to the system but aren’t true in real life. To counteract this problem, you narrow the dataset that the LLM queries when it looks for information to form its answer.
Here at Stack Overflow, for example, we can narrow the dataset down as follows:
Because it looked at a relatively small and constrained dataset, it can also provide annotations, allowing users to check the source material. This helps users verify the accuracy and freshness of the answer, as well as providing them the opportunity to dive deeper if they so choose.
There are hidden, system-level prompts that help guide this process. When a user asks a question, there might be a set of hidden prompts that help guide the LLM as follows:
To get RAG up and running, you’ll need to cover a few basics first.
Pick your LLM AI model
You can work with large foundational models from providers like OpenAI, Google, Amazon, or IBM through an API. If you prefer, you can also work with open source models, like Facebook’s Llama model, and run them on-prem. You can also build and train your own model from scratch, although this can be quite costly and time-consuming.
Decide on your dataset
This could be information from your organization’s wikis, code from your company’s private repos, or documentation that provides the answers and script a customer service chatbot needs to assist your clients with tech support. The cleaner this data is, the better. In AI, the old mantra—garbage in, garbage out—definitely holds true. Look for data that you know is accurate, doesn’t contradict itself, and has metadata like tags, votes, or labels that help the system understand it.
Decide how you’ll chunk that data
You’ll need to use an embedding model to vectorize the data before the LLM can work with it. Like foundation models, there are lots of options—big tech companies, smaller startups, open source solutions, or in-house builds. The embedding process will take your text and convert it to a series of numbers stored in a vector database. To keep things simple for now, all you need to know is that this process is required if you want to use a conversational interface and get useful replies in natural language.
Spin up a vector database
There are options offered by startups focused primarily on this space, like Weaviate or Pinecone, as well as offerings from big providers like MongoDB or AWS. Based on your embedding, the text from your dataset will be clustered together so that the LLM can understand the semantic relationship between all your data points.
Build out a RAG system
We have more detail on how to do that in this article on RAG.
Benchmark your performance
Human evaluation from your own employees is certainly a good benchmark to use before pushing a RAG system from a small private alpha to a wider release. There are also organizations developing more detailed benchmarks that you can utilize to fine tune your system. See an example below from the folks at Weaviate, the company that we use to power our embedding and vector database.