The rise of GenAI and LLMs
The birth of the transformer
Neural networks have become the leading approach to modern AI. Prior to ChatGPT, they were recognized for breakthroughs in areas such as image recognition, natural language processing, and gameplay. Until recently, however, they rarely created anything original, but rather mastered a specific task or system.
GenAI ventured into new territory: a neural network system that would take a user’s prompt and create something unique in response. For systems like DALL-E and Midjourney, users could input a prompt and the system would generate an image. In 2017, researchers at Google published a paper proposing a new architecture for neural networks: the transformer. This approach allowed networks to scale to much larger sizes and make better use of compute provided by GPUs.
The transformer opened the door to the large language model, a generative system trained on text to respond in text. In 2018, five years before ChatGPT, OpenAI released GPT-1, where GPT stands for Generative Pretrained Transformer model. When prompted, it could generate coherent sentences and even paragraphs. But it also made lots of mistakes and often wandered off course. The subsequent release of GPT-2 and 3 made big waves in the world of data science and AI, but didn’t generate any mainstream recognition.
The arrival of ChatGPT (roughly GPT 3.5) was another watershed moment. Something about the scale of the training and the subsequent work to fine-tune the system through reinforcement learning and human feedback produced a GenAI that was accurate, knowledgeable, and rational enough to capture the world’s imagination.
Gaining momentum
Fast forward to today, and neural networks have achieved a staggering scale. Systems like ChatGPT, Google’s Gemini, or Anthropic’s Claude are estimated to train on an unbelievable amount of text (more than 10 TB of internet data!), and this continues to scale. This data is the training set, and AI companies use special-purpose compute clusters consisting of tens of thousands of high-end GPUs to train their AI model on this raw material, a process that can take weeks or months and cost tens of millions of dollars.
What’s special about LLMs is that they’re able to generalize over the entirety of human knowledge, allowing them to engage in conversations about nearly any topic. These systems can now best most humans at all standardized tests, speak dozens of languages, and write code. In the next few years, the foundational models for GenAI systems will improve in their ability to reason and converse. They will also become multi-modal, as Google’s Gemini demonstrates, meaning a single model can both understand and create images, audio, video, and text. Work is also underway to add these systems to robots, so they can fully embody human senses and perception of the world.
The subsequent chapters of this guide will concentrate on LLMs that work with text only, but it’s certainly worth thinking about the ways in which your organization might leverage a multi-modal model in the future.