Matthew Zeiler knows a thing or two about AI. He won the prestigious ImageNet competition in 2013, a seminal moment that helped jump start the deep learning revolution. He studied with Geoff Hinton, winner of the Nobel Prize for his research on neural nets, and Jeff Dean, who has led AI research at Google since 2018.
Today, Zeiler is the CEO of Clarifai, which helps companies to train and scale AI models. When asked what advice he would give to organizations still at the beginning of their AI journey, he offered some straightforward advice:
"We’ve seen that data is the biggest area that people get wrong and take the most time to get right. They kind of overestimate how good their data setup is today. We go to a large enterprise and they say 'We’ve got all the data. It’s high quality. It’s labeled. We know where all of it is.' And then you start like a proof of concept or production contract and they’re like, 'Actually, when we see it, there’s not that much of it,’ or they don’t even know where it is internally.
“Obviously data is the precursor to customizing any of the AI or even applying the AI over your data sets. The big suggestion I always have is get your data in order now because inevitably you’re going to be applying AI to it. And data is also kind of your gold mine within an organization. Just like we talked about with enterprise search, the efficiency of your company kind of depends on the quality of your data."
The landscape of artificial intelligence is evolving at an unprecedented pace, and data remains the central resource fueling its growth. As Zeiler says, organizations often overestimate their readiness to turn internal data into the fuel that will help get an AI project or initiative up and running. While many leaders believe they have a wealth of high-quality, organized data, the reality is often quite different. For CIOs, CTOs, and engineering managers, understanding and preparing your organization’s data infrastructure can be the difference between falling behind or leading in an AI-driven future.
This article lays out the key steps you should take to make sure your data is prepared and organized to maximize the success of your AI initiatives.
1. Assess your current data quality and accessibility
First, conduct an honest inventory of your data. Many organizations assume their data is both comprehensive and accessible, only to find in the midst of an AI initiative that gaps exist in quality, structure, and labeling.
The benefits for this include:
2. Prioritize high-quality, relevant data over volume
In today’s AI ecosystem, bigger isn’t always better. Just as Zieler mentioned, the value of data doesn’t lie in quantity alone. Data quality is crucial for model accuracy and efficiency. The recent shift toward smaller, high-performing models, as seen with tools like Meta’s Llama and Phi-1, underscores the benefits of training on refined, relevant data.
The benefits for this include:
3. Build a collaborative, up-to-date knowledge base
A universal challenge organizations face is capturing and maintaining institutional knowledge, especially as AI becomes increasingly integrated into enterprise operations. Without a unified knowledge base, valuable insights remain fragmented or lost across departments, leading to inefficiencies and duplicated work.
The benefits for this include:
4. Develop data governance and compliance standards
The loss of public training data, often due to privacy concerns and legal restrictions, highlights the need for a strong governance framework around your internal data. As AI regulations tighten, maintaining compliant, well-documented data practices becomes crucial.
The benefits for this include:
5. Establish a feedback loop with your users and stakeholders
A favorable GenAI future requires adaptability and a close understanding of evolving user needs. A feedback loop enables your organization to learn from user interactions, identify gaps in your AI’s performance, and continuously refine the data.
The benefits for this include:
6. Mitigate the risk of synthetic data
Synthetic data has gained popularity as a means to supplement training data, but excessive reliance on it can lead to quality degradation or “generational loss.” This occurs when AI models, continuously trained on synthetic data, lose the fidelity of the original human-created data.
The benefits for this include:
Laying the foundation for a data-driven future
As Matthew Zieler says, data is both a foundational resource and a competitive asset. By investing in data quality, governance, and knowledge-sharing infrastructure now, engineering leaders, CIOs, and CTOs can help future-proof their organizations for a GenAI-driven era. Well-curated data and collaborative knowledge-sharing practices help organizations leverage AI to drive innovation and efficiency.