4. Key tools, technologies, and terms

A practical RAG example

We’ve heard from many people, including our podcast guests and clients, that they’re looking for ways to overcome the limitations of LLMs: hallucinations, outdated data, and answers without sources. For most folks, that way is retrieval-augmented generation (RAG), and below is specific guidance on how to implement a RAG system. This is based on a recommendation from Cameron Wolfe, a PhD in Deep Learning and Director of AI at Rebuy Engine.

For a RAG system, you’ll need to link a vector database to a text/object store to correlate the vector representations of the text with the text you’d like to retrieve. But you should also consider using a specialized LLM to encode your vectors. Wolfe recommends Sentence BERT, a model trained to specialize in vector search and RAG.

BERT is an early Transformer-based model Google developed in 2018. Instead of predicting the next token, it works by filling in the gaps within a sentence, known as infilling or cloze completions. A variation of this model, sBERT (Sentence BERT), optimizes this model to match similar pairs of sentences. Where BERT could take 65 hours of inference computations, sBERT takes about five seconds. It’s a smaller model than most of the marquee LLMs (between 110 and 340 million parameters) so it requires less computing resources. That can translate to major cost savings for your organization.

Much of sBERT’s training data set and documentation focuses on sentence comparison, but you can use it to embed shorter or longer chunks of text. This all depends on your use case and documents. Using sentences as the basic unit for embedding is certainly a simple choice to make with this model, but it may not be the best for your use case. However, the runtime and memory requirements for BERT-based models typically grow quadratically based on the input length.

Once you’ve decided the size of your text chunks, compute the embeddings and store them in your vector database. You may want to test some of these embeddings at this point by comparing similarity between two sentences or performing some basic searches based on natural language searches. You can use the sBERT package function `util.cos_sim(A, B)` to find those matches, where `A` and `B` are multi-dimensional matrices—`torch.Tensor` objects. If you wanted to find the sentences that most closely match in your entire set of embeddings, you could set both `A` and `B` to that set.

At this point, you have an effective embedding and retrieval pipeline. When a user enters a query, convert it to a vector and find the nearest vectors. You can combine this with GenAI-assisted summarization—as we do—to have the best of both worlds: the overall answer provided by an LLM with sources provided by RAG.

As with any GenAI or LLM feature, you should use fine-tuning to continually improve your model. Real-world queries will give you new sentences that are similar to existing text, so you can pass those pairs back to sBERT with a score indicating their similarity. When a user selects a result based on a query, that’s a signal for its similarity score. The same goes for when a user doesn’t select a result. In a perfect model, the user would always select the first result. If they don’t, you may want to fine-tune those sentence pairs to make them more similar.

Hopefully, this gives you a concrete example to use to begin implementing a vector-based RAG system. This is certainly not the only method, but it is an LLM customized for this specific task. So unless you have good reasons for selecting more generalized tools, you may find the best results with several tools designed for very specific tasks.