It’s impossible to ignore how rapidly GenAI is advancing and changing the business and technology landscape. You may have an idea where GenAI could add value to your organization, but it's difficult to decide where to invest your resources.
Does it make sense for you to build your own GenAI model, buy a third-party solution out of the box, or use an open-source model as your foundation and tune it to your needs?
What kind of training or upskilling will your teams need to make use of the new technology? And how will you assess the ongoing performance and return on investment of your AI initiatives?
To start, you need to acknowledge that GenAI is frontier tech. Even major corporations like Microsoft has teams researching how software engineers build GenAI. The reality is there are a lot of unknowns. As a new technology and landscape, it’s difficult to determine how long a GenAI model will take to build and how much it will cost.
There’s also little public data available about companies who have turned GenAI products into profitable businesses (yet!). Given this, the first step towards making this decision begins with embracing exploration. You need to be willing to invest resources in building unproved prototypes and make a lot of mistakes.
In this article, we’ll run through the steps we recommend taking to build a simple “chat assistant”—a GenAI system powered by a large language model (LLM) that can take user queries as input and produce informative text or useful code as an output. We’ll also discuss the options and tradeoffs you’ll need to consider, whether you’re building your own model, buying one off the shelf, relying on open-source software, or using a SaaS solution through an API.
Below are 4 options, ranked in order of the cost and complexity it would take to implement them.
1) Build your own model from scratch
The hardest and most expensive up front, but potentially the most productive in the long term, and offers you the greatest control over legal risks and privacy concerns. Requires expert personnel, specialty infrastructure, and custom data pipelines. For some sense of scale, check out Meta’s recent post on open sourcing its process for training GenAI.
2) Build on top of an open source model
With this approach, it’s easy to get something simple built, but it can also be difficult to measure whether the fine tuning you're doing adds improvements, as few baselines and qualitative tests exist. Depending on the model you choose, it may also be hard to determine full legal and privacy risks, since you may not know the data set it was initially trained on.
3) Buy a model
A great option if you don’t have the resources or team for options 1 and 2. Downsides are exposing your data to a third-party provider, spending money on a model you may not own or be able to use long term, and the carrying cost of a SaaS service if you opt to have the provider supply both the training and the integration/orchestration.
4) Work with an API
The simplest and initially cheapest option is clear: build a chatbot wrapper that routes user queries to a big LLM provider and delivers the output. It’s a great way to get started, but leaves you with less control of your data and can become quite costly as user interactions scale. You’ll still have a lot of work to do on the prompt engineering, but can pay experts to help on this too.
Start by asking questions
Before you get into choosing (and spending money on) an AI model, it’s worth stepping back to ask three big questions:
- What is the business use case here? Do you intend to serve customers, partners, employees, or some mix of the above?
- What proprietary data will you bring to the mix? Do you need your copilot to have access to knowledge or code that isn’t publicly available? Or can you make do with third-party models trained on the open internet?
- What kind of skills and resources do you have internally? Working with GenAI will require employees to work on data science, machine learning, prompt engineering, vector databases, and more. Building your own model could also require access to specialized hardware.
The best way to start answering the questions above is to get a team of your engineers exploring an online playground. These are typically offered free of charge by companies like OpenAI, Google, IBM, Amazon, and many more. In these areas, you can experiment with a slew of existing models or see how your own model performs. It’s a virtual sandbox for getting familiar with an AI model’s capabilities.
You can test different models against one another to see what produces the best results for the prompt style you expect from users. Some might work best as chatbots, others as coding assistants. Evaluating the quality of results can be difficult, as GenAI systems are non-deterministic. Asked the same question ten times, they may give seven very similar answers with slight variations and three that are completely wrong.
As you consider what model to choose, you’ll also have to weigh a number of other variables: the cost for training and for inference (the computation that happens each time a model takes an input query and outputs a response). There are also legal and privacy risks inherent to using GenAI: Do you know what data the model was trained on? Can you create guardrails to prevent it from producing toxic or illegal output? Finally, there is the process of operationalizing this system with your existing tech stack.
When it makes sense to build
Building a model can be very expensive, but it has some real advantages. First, your model can specialize in your industry or domain. Bloomberg built a finance GPT, Intuit built one to help marketers using Mailchimp, and Sorcero built one to understand medical research. As mentioned earlier, firms like Meta and X are opening sourcing a lot of information about how they build models, along with details on the weights and architecture of the neural network themselves.
Building offers a significant advantage in some areas of legal and security risk. If you know exactly what data your model has been trained on, you can have confidence it won’t produce software code that infringes on someone else’s IP. Some providers, like IBM, solve this problem by making the training data available to customers and guaranteeing that it contains no copyrighted material.
Will Falcon worked on Facebook’s FAIR group as an AI researcher and is the creator of PyTorch Lightning, a framework for one of the most popular tools in the GenAI field. “I think like the research shows, you should train it from scratch if you have the data, but it's not realistic for most people,” he explained on the Stack Overflow podcast. “If you have a ton of data and if you can afford it, you should be pre-training.”
Of course, there is a flipside to that equation. If you train your AI model on company data, you need to ensure that it won’t be sharing private or confidential information with customers or employees. You’ll want to sanitize the data set it’s working with before you release it into the wild.
Also, don’t forget that training alone won’t make a model great. On a recent interview, Open AI CEO Sam Altman said the work post-training work done by humans was equally if not more important. This requires a lot of people and time, although there are providers that offer Reinforcement Learning From Human Feedback (RLHF as a service).
Here’s when it makes sense to build a model from scratch:
Build on top of an open source model
Building from scratch is not your only option when it comes to AI. You can stand on the shoulders of giants by utilizing open source models such as Meta’s Llama 2 and fine tuning them on your data.
As Abid Al Awan writes in his article on fine tuning for Datacamp, “After the launch of the first version of LLaMA by Meta, there was a new arms race to build better large language models (LLMs) that could rival models like GPT-3.5 (ChatGPT). The open-source community rapidly released increasingly powerful models.
While exciting, this avenue also presented challenges. As Awan points out, “Most open-source models carry restricted licensing meaning, it can only be used for research purposes only. Secondly, only large companies or research institutes with sizable budgets could afford to fine-tune or train the models. Lastly, deploying and maintaining state-of-the-art large models was expensive.
Over the subsequent months, a few changes have lessened these difficulties. “The new version of LLaMA models aims to address these issues. It features a commercial license, making it accessible to more organizations. Additionally, new methodologies now allow fine-tuning on consumer GPUs with limited memory.”
As you can see in the quote above, building on top of a model like Llama 2 has some big advantages in terms of acquiring state of the art capabilities without huge upfront costs, large dedicated teams, or long term maintenance. But there are drawbacks.
We don’t know the dataset Llama 2 was trained on. While the model is commercially licensed, there are open questions about using copyrighted text or proprietary code in your training data. There are providers, like IBM, that seek to address this issue by offering a model where the dataset is publicly available and free of copyright infringement.
It’s also important to remember that a “state-of-the-art” model means an AI system which has scored well on a series of relatively new and obscure academic benchmarks like the MMLU. It does not mean that an open source model will have the same level of polish or professionalism in its responses as offerings like ChatGPT or Bard. These systems have been fine tuned with extensive RLHF—reinforcement learning with human feedback—in other words, actual humans evaluating the model’s response to prompts and helping to guide it towards responses that are most useful and least toxic or inaccurate.
If you build on top of an open source model, you will still need to invest in RLHF or a post-processing step, like RAG, that seeks to improve the accuracy of your model’s output.
Here’s when it makes sense to build on top of an open source model:
When it makes sense to buy off the shelf
There are AI vendors that will help train a model for you. MosaicML lets you share your data and then handles the rest. Prosus, the corporate owner of Stack Overflow, has used Mosaic to build AI models. And Replit, which released a top-tier model for code generation, also used Mosaic. You bring the data and Mosaic handles the orchestration, efficiency, node failures, and infrastructure.
One nice thing about this approach is that it's easy to test and scale. In our conversation with Mosaic, they recommend a laddered approach. Here’s Mosaic’s chief scientist, Jonathan Frankle, answering the question of how big a customer’s first model should be:
"If you start to look at the scaling laws, the cost of training the next step up increases quadratically. So you can kind of climb the ladder. Start with the small one, see if that's useful to you. If you're getting good results, you're seeing some progress and you're seeing return on investment, try the bigger one and spend a little more. And if you're seeing return on investment in that, try the next bigger one and spend a little more.
I don't ever want a customer to come to me and say, “I'm going to spend $10 million on this model right off the bat.” What I'd much rather see is to take it one step at a time. Let's run into all the issues at the smallest possible scale we're going to run into them. There are always devils in the details with datasets and evaluation and all sorts of other fun stuff that is the real data science and machine learning problem. And keep going until you decide it's no longer worth it to you.
So far, I don't think we've had a customer who stopped yet. They keep finding value and going bigger. I'm sure that won't be true forever. There is some point at which it's no longer worth it, but stop when you stop seeing return on investment. And hopefully for both of us, it's a long way away because that means we're doing good business and you're getting something worthwhile."
Now remember, just like with open source models, something that performs really well on standardized tests and benchmarks might not be easy to integrate into your existing tech stack or be ready for release to public consumers. This last part is important, because putting a GenAI system into production is tough! Here is a nice summary of the challenge from PyTorch Lightning’s Will Falcon:
"Everyone wants to add GenAI to their offerings or utilize it to make their organization better. But a lot of people are disappointed. They saw what ChatGPT can do, but it doesn’t seem to work as well in their application.
It is very hard to scale these things. I've deployed both kinds of systems. Before I was in AI, I was a software engineer deploying regular web apps.
AI is terribly complicated. In web apps you have microservices, you do horizontal scaling, you beef up instances when you need to. In AI it doesn't work that way. AI has different patterns. Your code could work, but the model could still crash because there's a gradient off or maybe your math is wrong, maybe the data is wrong. So it's less deterministic than software. So I think that software engineers coming into it now are a little bit upset. They're like, “Oh my God, it's not translating!” And it probably won't, because it is a very different paradigm."
Here’s when it makes sense to buy a model off the shelf:
Connect to an API
This is by far the easiest way to get started with an LLM. You can build a simple system to forward your users queries to LLM providers like OpenAI, Anthropic, or Gemini via an API. The service provides an output you feed back to the users.
In a setup like this, you’ll be paying per token—roughly the number of words sent to and from the model via API. You might wonder, how would this compare to our earlier options, where you train your own model?
The cost of getting a model you like doesn’t stop once it’s been built. You still have to host it and pay the inference cost—the cost for the model to take input and generate output back to the user. An API covers that cost, but includes it in the usage pricing.
Here’s a useful thought experiment. In the example given here, it would cost about $23,000 a month to host an 180 billion parameter model on a popular cloud service. You could, of course, go with a much smaller model, but the larger you scale, the closer your capability gets to what you seen in ChatGPT or Gemini.
If instead you rely on an API from a service like OpenAI, you pay by the token. As an example, one cent per 1000 tokens for input, and three cents per 1000 tokens for output. 1000 tokens is about 750 words.
Let’s assume the average user question has about 100 tokens and the average answer has about 500 tokens. When you are first starting out, the API is clearly cheaper. 1000 user interactions a day is $10 worth of input and $150 worth of output. Your monthly bill is going to be well below $23,000
On the other hand, let’s say your app goes viral. Suddenly you have 100,000 users and 1 million interactions per day. Now you’re talking about $10,000 a day worth of input costs and $150,000 a day in output. A single day of API fees is nearly seven times higher than your monthly hosting costs for your own model.
Here's when it makes sense to connect to an API:
Building your own service is definitely more costly upfront. In the long run, however, you may be able to better optimize costs as you scale.
You can’t take the dream out of the machine
Regardless of what path you take to build or buy your AI model, you’re going to need retrieval-augmented generation (RAG).
No matter the training or the fine tuning a model gets, you won’t eliminate the potential for the system to “hallucinate” facts. This propensity to make things up is part of LLMs’ DNA. Take it from Andrej Karpathy, part of the founding team at OpenAI who has served as director of AI at Tesla:
"I always struggle a bit when I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
We direct their dreams with prompts. The prompts start the dream, and based on the LLM's hazy recollection of its training documents, most of the time the result goes someplace useful.
It's only when the dreams go into deemed factually incorrect territory that we label it a "hallucination". It looks like a bug, but it's just the LLM doing what it always does.
At the other end of the extreme consider a search engine. It takes the prompt and just returns one of the most similar "training documents" it has in its database, verbatim. You could say that this search engine has a "creativity problem"—it will never respond with something new. An LLM is 100% dreaming and has the hallucination problem. A search engine is 0% dreaming and has the creativity problem."
OK, so what we want from our GenAI system is something that has the conversational creativity of an LLM, but the factual accuracy of a search engine. To get there, you need to stop relying on LLMs as a source of truth. Instead, think of them as an incredible natural language interface for computer systems. They have a marvelous ability to understand and generate language, but you need to define for them what data is accurate and where it should source and cite its answers from. We may see LLMs behave differently in the future if, as Google recently showed off with Gemini 1.5, the context window for a conversation can be millions, even tens of millions of tokens long. But for now, you’ll want to make sure your system is retrieving its answers from a dataset you define and trust.
How do you build a RAG system? You can read about how we did it at Stack Overflow as part of bringing semantic search to our public platform. You can also listen to a podcast discussing how we built a system like this as part of our OverflowAI initiative. We’ve also got more detail on RAG and a practical example of how to build it in our Industry Guide to AI.
If you want to learn more about optimizing your data to serve as the foundation for GenAI, check out this blog post or read more about building a knowledge base with Stack Overflow for Teams.