Matthew Zeiler knows a thing or two about AI. He won the prestigious ImageNet competition in 2013, a seminal moment that helped jump start the deep learning revolution. He studied with Geoff Hinton, winner of the Nobel Prize for his research on neural nets, and Jeff Dean, who has led AI research at Google since 2018.
Today, Zeiler is the CEO of Clarifai, which helps companies to train and scale AI models. When asked what advice he would give to organizations still at the beginning of their AI journey, he offered some straightforward advice:
"We’ve seen that data is the biggest area that people get wrong and take the most time to get right. They kind of overestimate how good their data setup is today. We go to a large enterprise and they say 'We’ve got all the data. It’s high quality. It’s labeled. We know where all of it is.' And then you start like a proof of concept or production contract and they’re like, 'Actually, when we see it, there’s not that much of it,’ or they don’t even know where it is internally.
“Obviously data is the precursor to customizing any of the AI or even applying the AI over your data sets. The big suggestion I always have is get your data in order now because inevitably you’re going to be applying AI to it. And data is also kind of your gold mine within an organization. Just like we talked about with enterprise search, the efficiency of your company kind of depends on the quality of your data."
The landscape of artificial intelligence is evolving at an unprecedented pace, and data remains the central resource fueling its growth. As Zeiler says, organizations often overestimate their readiness to turn internal data into the fuel that will help get an AI project or initiative up and running. While many leaders believe they have a wealth of high-quality, organized data, the reality is often quite different. For CIOs, CTOs, and engineering managers, understanding and preparing your organization’s data infrastructure can be the difference between falling behind or leading in an AI-driven future.
This article lays out the key steps you should take to make sure your data is prepared and organized to maximize the success of your AI initiatives.
1. Assess your current data quality and accessibility
First, conduct an honest inventory of your data. Many organizations assume their data is both comprehensive and accessible, only to find in the midst of an AI initiative that gaps exist in quality, structure, and labeling.
Actionable step: Perform a data audit
Start by auditing where your data is stored, how it’s labeled, and how accessible it is to your teams. A data audit can uncover which datasets are fragmented, outdated, or incomplete. Collaborate with data engineers, IT staff, and stakeholders to map out where key information resides and identify any silos.
The benefits for this include:
2. Prioritize high-quality, relevant data over volume
In today’s AI ecosystem, bigger isn’t always better. Just as Zieler mentioned, the value of data doesn’t lie in quantity alone. Data quality is crucial for model accuracy and efficiency. The recent shift toward smaller, high-performing models, as seen with tools like Meta’s Llama and Phi-1, underscores the benefits of training on refined, relevant data.
Actionable step: Curate and clean existing data
Focus on curating datasets that truly represent your organization’s objectives and end-user needs. Invest in data-cleaning tools and processes that reduce noise, ensuring that your data is consistently high-quality. Leveraging human oversight to validate data, especially if you’re considering synthetic data, can prevent quality loss.
The benefits for this include:
3. Build a collaborative, up-to-date knowledge base
A universal challenge organizations face is capturing and maintaining institutional knowledge, especially as AI becomes increasingly integrated into enterprise operations. Without a unified knowledge base, valuable insights remain fragmented or lost across departments, leading to inefficiencies and duplicated work.
Actionable step: Create a centralized knowledge management system
Use a knowledge-sharing platform like Stack Internal to centralize, validate, and continuously update information across departments. Encourage employees to document workflows, challenges, and solutions, and implement a regular review process to keep content relevant. Additionally, structure your knowledge base with AI in mind by tagging and categorizing content so models can effectively access this information.
The benefits for this include:
4. Develop data governance and compliance standards
The loss of public training data, often due to privacy concerns and legal restrictions, highlights the need for a strong governance framework around your internal data. As AI regulations tighten, maintaining compliant, well-documented data practices becomes crucial.
Actionable step: Implement a data governance framework
Develop data access and usage policies that meet current regulations and best practices, ensuring data security and privacy. Assign data stewards within each department who can oversee adherence to these guidelines and conduct regular compliance checks. Document each stage of data processing to ensure traceability, accuracy, and transparency.
The benefits for this include:
5. Establish a feedback loop with your users and stakeholders
A favorable GenAI future requires adaptability and a close understanding of evolving user needs. A feedback loop enables your organization to learn from user interactions, identify gaps in your AI’s performance, and continuously refine the data.
Actionable step: Use feedback to guide data and model improvements
Deploy surveys, usability testing, and performance tracking to capture user feedback on AI-generated outputs. Analyze this data to identify common pain points or areas where model performance could improve. Regularly incorporating this feedback into your data curation and model training will ensure your AI systems remain relevant and effective.
The benefits for this include:
6. Mitigate the risk of synthetic data
Synthetic data has gained popularity as a means to supplement training data, but excessive reliance on it can lead to quality degradation or “generational loss.” This occurs when AI models, continuously trained on synthetic data, lose the fidelity of the original human-created data.
Actionable step: Balance synthetic data with real-world examples
If you’re using synthetic data, balance it with high-quality, human-generated data. Establish a clear threshold for how much synthetic data is permissible within any dataset. Encourage model validation with real-world data, especially for tasks involving critical decision-making or customer interactions.
The benefits for this include:
Laying the foundation for a data-driven future
As Matthew Zieler says, data is both a foundational resource and a competitive asset. By investing in data quality, governance, and knowledge-sharing infrastructure now, engineering leaders, CIOs, and CTOs can help future-proof their organizations for a GenAI-driven era. Well-curated data and collaborative knowledge-sharing practices help organizations leverage AI to drive innovation and efficiency.