How to set goals for AI initiatives

If you start building AI without a clear purpose and a clear definition of what you hope to gain out of it, your hard work may be for nothing.

At this point, everyone has heard about how powerful GenAI is. “10x your developer productivity!” has been the refrain. New research arrives seemingly every day showing the amazing abilities of ever more powerful models. Companies brag about their own GenAI programs and how much value they add for customers.

If you are part of an organization that isn’t implementing GenAI, the FOMO might be hitting you pretty hard right now. You might be wondering if it’s time to start shopping for large language models (LLMs) and connecting them to your application. You might even be getting some serious pressure from your C-suite and customers: “We need GenAI now!”

Or maybe you’ve already started and are feeling a little lost. Engineers like to build projects, but they may have gotten ahead of their skis by building something without clearly defined goals and measures of success.

With all the hype and pressure around GenAI, we’re hearing in many of our conversations that everyone wants to take advantage of GenAI power, but there may not be any concrete goals besides implementing it.

Implementing GenAI features can be expensive and time-consuming—outputs are nondeterministic and may require a great deal of experimenting to get right. If you go into this process without a clear purpose and a clear definition of what you hope to gain out of it (other than the ability to paste “AI” into all your marketing materials), your hard work may be for nothing.

This guide will talk about what you should think through before building a GenAI program so that any investments you make will pay off.

Customer-facing or developer productivity?

Your first question should be whether you want to improve your process or your product. Improving your process means providing and supporting GenAI tools within your organization. Improving your product means you’ll be integrating a model and the supporting software into your end product. You can—and many organizations do—implement GenAI projects that meet both goals, but they will likely have separate goals and requirements, so we’ll discuss goals in this section assuming that they are not the same project.

Improving your process can be easier, as you may just need to adopt a CodeGen tool or use an existing chatbot product. If you’re looking to use proprietary data with GenAI, you’ll need to either train a custom model or implement retrieval-augmented generation (RAG) to both access and protect that data. We at Stack Overflow are working on a solution for this by adding GenAI-powered summarization to Stack Overflow for Teams. Either way, your goal will likely involve GenAI providing answers on internal questions or understanding your codebase.

Improving your product is certainly harder. Regardless of your goal, it will involve implementing an LLM, building and maintaining a data platform, training models, handling queries, and hiring the people who can support these engineering and data science tasks. There are plenty of possible features you could add—natural language interfaces, translation, content generation, summarization, semantic or cognitive search, etc.—so sorting out exactly what this feature should do is part of the goal-setting process.

Figuring out a feature of this magnitude is a large, cross-functional effort, so we recommend having an asynchronous tool that enables and preserves collaboration—like Stack Overflow for Teams. As a bonus, if you’re using internal information to create a customized chatbot, the Q&A format that Stack Overflow for Teams uses produces some of the best results.

Implementation considerations

In addition to the “customer” of your GenAI program, you need to factor in the quality goals of the program. GenAI in general and LLMs in particular produce non-deterministic outputs; that is, they will not produce the same results for the same inputs. Because the models are built probabilistically—the next word is picked based on a dice roll, essentially, albeit a dice roll biased by the training data—you may get results that don’t make sense. LLMs may invent ideas, programming features, or facts. This volatility is essential to the creative process that LLMs engage in, but you’ll need to determine how much volatility you’ll tolerate.

Here’s a few points to consider:

  • How much human assistance do you expect? GenAI applications with a human reviewing every output can tolerate more inaccuracies, but outputs sent directly to customers or acted on by autonomous agents may need greater accuracy and guardrails.
  • How much direct interaction will customers have with the inputs and outputs? We’ve seen examples recently where poorly considered chatbots agreed to sell cars for a dollar or gave customers the wrong information, leaving the company liable.
  • Will you be using proprietary information? If so, you’ll need to build in ways to limit responses to that information, whether through training or fine-tuning a model to use your data, using retrieval-augmented generation to cite sources, or using system prompts and/or long context windows to supplement any input prompts.

Quantifying goals

There are two types of metrics to consider: the quality and performance of the AI model and the business effects.

AI quality metrics could include:

  • Accuracy: Are the responses correct every time? Do they reflect a specific domain?
Measuring this: Use manual testing, or you can use other LLMs as evaluators. The HumanEval (https://github.com/openai/human-eval) benchmark is the current standard of code evaluation, though there are many others available.
  • Reasoning: Do they provide complex chain-of-thought answers or is the output closer to summarization/semantic search?
Measuring this: Again, you can manually judge this or use automated benchmarks to judge how your GenAI application answer a series of logic questions.
  • Speed/performance: Latency and response time make a significant difference to user experience, so investing here may improve customer satisfaction. Improving this quality will often depend on technology that supports the LLM, but not necessarily the LLM itself, like cloud infrastructure and CPU/GPU power. For mobile applications, you may have to sacrifice the other qualities to fit models on devices using quantization techniques.
Measuring this: Primarily, this will be the time delay between the query and the response.
  • Risk: Are you willing to accept some liability for LLM hallucinations or inaccuracies? Do you have a governance plan in place? Are you protected from potential copyright and bias issues?
Measuring this: This is tough to measure, but is often the flip-side of Accuracy. There are highly-specialized LLMs for evaluating risk, safety, and toxicity.

Business metrics could include:

  • Cost: How much are you willing to spend on computing infrastructure, specialized engineering roles, and software/SaaS/support?
  • ROI: Do you have a business plan to help pay for your investment or is this going to be a cost center? If you have a business plan, have you done the market research to see if your customers will be willing to pay you enough to make the project worth it?
  • Productivity gains: Are your developers delivering value faster? This can apply to both internal and external GenAI programs. There will usually be an initial slowdown as developers learn a new tool, but over time, does that slowdown disappear (or reverse)? For internal GenAI, does the speed at which your developers are able to write new code make up for the potentially hallucination and bugs that it potentially introduces?
  • Customer acquisition/retention: Are new customers signing up and existing customers renewing/upgrading? With any additional feature, you hope that it will lead to greater interest and satisfaction in your product. As GenAI can be a massive undertaking, make sure that if your goal is customers, you have the research to back it up.

These aren’t the only attributes you can build metrics around. In fact, I’d wager that once you start thinking about your goals, you’ll come up with other qualities that are relevant to you. But you can’t pick them all, so determine which are the most important and make strategic tradeoffs.

Deliver value not hype

While everyone might be excited about GenAI and its possibilities, your customers fundamentally expect you to improve their experience of your product. GenAI could very well do that, either by giving them a killer feature or letting your team deliver value faster. But you’re only going to accomplish that if you have a solid goal in place as to what you want GenAI to actually do for you and your customers.

Once you know where you want your GenAI program to go, we can help you get there. Check out our Industry Guide to AI.