4. Key tools, technologies, and terms

Retrieval-augmented generation

There is a lot of great text on the internet explaining why the world is round, but there is also a fair amount arguing that it’s flat. If you use GenAI with a big, messy dataset, it can provide answers that seem plausible to the system but aren’t true in real life.

If you plan to use an LLM for GenAI tasks inside your organization, there is a good chance you will decide to utilize a technique called retrieval-augmented generation (RAG). Organizations use RAG to improve accuracy and control of results, reduce hallucinations, and allow the dataset the LLM is working with to be updated regularly without enormous cost.

How does RAG work? In a nutshell, a large AI model like GPT-4 has been trained on a massive 45GB collection of text—around 13 trillion tokens. That means it knows a lot! But it also means it can sometimes get things wrong. There is a lot of great text on the internet explaining why the world is round, but there is also a fair amount arguing that it’s flat. If you use GenAI with a big, messy dataset, it can provide answers that seem plausible to the system but aren’t true in real life. To counteract this problem, you narrow the dataset that the LLM queries when it looks for information to form its answer.

Here at Stack Overflow, for example, we can narrow the dataset down as follows:

1. A user asks a question.
2. The LLM then looks only at data from questions on Stack Overflow that have an accepted answer.
3. The LLM then generates an answer based on that data and provides it back to the user. This answer is a synthesis of what it has just read, and is much shorter than all the text it reviewed when seeking an answer.

Because it looked at a relatively small and constrained dataset, it can also provide annotations, allowing users to check the source material. This helps users verify the accuracy and freshness of the answer, as well as providing them the opportunity to dive deeper if they so choose.

There are hidden, system-level prompts that help guide this process. When a user asks a question, there might be a set of hidden prompts that help guide the LLM as follows:

Prompt 1: Take the query and use your large foundation model to process it, tokenize it, and be sure you understand it.
Prompt 2: If the query is understood, consult our chosen dataset of Stack Overflow answers.
Prompt 3: If you don’t find sufficient data for an answer, alert that user that you don’t have a viable response.
Prompt 4: If you find sufficient data to produce an answer, create a short synthesis that provides users with a helpful reply in 200-300 words. Also provide links to the data that supports your answer.

To get RAG up and running, you’ll need to cover a few basics first.

Pick your LLM AI model

You can work with large foundational models from providers like OpenAI, Google, Amazon, or IBM through an API. If you prefer, you can also work with open source models, like Facebook’s Llama model, and run them on-prem. You can also build and train your own model from scratch, although this can be quite costly and time-consuming.

Decide on your dataset

This could be information from your organization’s wikis, code from your company’s private repos, or documentation that provides the answers and script a customer service chatbot needs to assist your clients with tech support. The cleaner this data is, the better. In AI, the old mantra—garbage in, garbage out—definitely holds true. Look for data that you know is accurate, doesn’t contradict itself, and has metadata like tags, votes, or labels that help the system understand it.

Decide how you’ll chunk that data

You’ll need to use an embedding model to vectorize the data before the LLM can work with it. Like foundation models, there are lots of options—big tech companies, smaller startups, open source solutions, or in-house builds. The embedding process will take your text and convert it to a series of numbers stored in a vector database. To keep things simple for now, all you need to know is that this process is required if you want to use a conversational interface and get useful replies in natural language.

Spin up a vector database

There are options offered by startups focused primarily on this space, like Weaviate or Pinecone, as well as offerings from big providers like MongoDB or AWS. Based on your embedding, the text from your dataset will be clustered together so that the LLM can understand the semantic relationship between all your data points.

Build out a RAG system

We have more detail on how to do that in this article on RAG.

Benchmark your performance

Human evaluation from your own employees is certainly a good benchmark to use before pushing a RAG system from a small private alpha to a wider release. There are also organizations developing more detailed benchmarks that you can utilize to fine tune your system. See an example below from the folks at Weaviate, the company that we use to power our embedding and vector database.

Stay updated

Subscribe to receive Stack Overflow for Teams content around knowledge sharing, collaboration, and AI.

By submitting this form, I agree to the Terms of Service and have read and understand Stack Overflow’s Privacy Policy