Get your (data) house in order: Preparing for a future with AI

The landscape of artificial intelligence is evolving at an unprecedented pace, and data remains the central resource fueling its growth.

Matthew Zeiler knows a thing or two about AI. He won the prestigious ImageNet competition in 2013, a seminal moment that helped jump start the deep learning revolution. He studied with Geoff Hinton, winner of the Nobel Prize for his research on neural nets, and Jeff Dean, who has led AI research at Google since 2018.

Today, Zeiler is the CEO of Clarifai, which helps companies to train and scale AI models. When asked what advice he would give to organizations still at the beginning of their AI journey, he offered some straightforward advice:

"We’ve seen that data is the biggest area that people get wrong and take the most time to get right. They kind of overestimate how good their data setup is today. We go to a large enterprise and they say 'We’ve got all the data. It’s high quality. It’s labeled. We know where all of it is.' And then you start like a proof of concept or production contract and they’re like, 'Actually, when we see it, there’s not that much of it,’ or they don’t even know where it is internally.

“Obviously data is the precursor to customizing any of the AI or even applying the AI over your data sets. The big suggestion I always have is get your data in order now because inevitably you’re going to be applying AI to it. And data is also kind of your gold mine within an organization. Just like we talked about with enterprise search, the efficiency of your company kind of depends on the quality of your data."

The landscape of artificial intelligence is evolving at an unprecedented pace, and data remains the central resource fueling its growth. As Zeiler says, organizations often overestimate their readiness to turn internal data into the fuel that will help get an AI project or initiative up and running. While many leaders believe they have a wealth of high-quality, organized data, the reality is often quite different. For CIOs, CTOs, and engineering managers, understanding and preparing your organization’s data infrastructure can be the difference between falling behind or leading in an AI-driven future.

This article lays out the key steps you should take to make sure your data is prepared and organized to maximize the success of your AI initiatives.

First, conduct an honest inventory of your data. Many organizations assume their data is both comprehensive and accessible, only to find in the midst of an AI initiative that gaps exist in quality, structure, and labeling.

Actionable step: Perform a data audit

Start by auditing where your data is stored, how it’s labeled, and how accessible it is to your teams. A data audit can uncover which datasets are fragmented, outdated, or incomplete. Collaborate with data engineers, IT staff, and stakeholders to map out where key information resides and identify any silos.

The benefits for this include:

Improved data transparency - Know exactly where your data is stored to prevent duplication and missed opportunities.

Foundations for AI success - Clear, structured data enables more efficient and accurate model training.

In today’s AI ecosystem, bigger isn’t always better. Just as Zieler mentioned, the value of data doesn’t lie in quantity alone. Data quality is crucial for model accuracy and efficiency. The recent shift toward smaller, high-performing models, as seen with tools like Meta’s Llama and Phi-1, underscores the benefits of training on refined, relevant data.

Actionable step: Curate and clean existing data

Focus on curating datasets that truly represent your organization’s objectives and end-user needs. Invest in data-cleaning tools and processes that reduce noise, ensuring that your data is consistently high-quality. Leveraging human oversight to validate data, especially if you’re considering synthetic data, can prevent quality loss.

The benefits for this include:

Cost efficiency - Smaller, high-quality models require fewer resources and deliver faster response times.

Enhanced model performance - Quality data reduces bias, noise, and the risk of “model collapse” from too much low-value or synthetic data.

A universal challenge organizations face is capturing and maintaining institutional knowledge, especially as AI becomes increasingly integrated into enterprise operations. Without a unified knowledge base, valuable insights remain fragmented or lost across departments, leading to inefficiencies and duplicated work.

Actionable step: Create a centralized knowledge management system

Use a knowledge-sharing platform like Stack Overflow for Teams to centralize, validate, and continuously update information across departments. Encourage employees to document workflows, challenges, and solutions, and implement a regular review process to keep content relevant. Additionally, structure your knowledge base with AI in mind by tagging and categorizing content so models can effectively access this information.

The benefits for this include:

Institutional knowledge retention - Prevent information loss and encourage knowledge sharing across departments.

Improved AI training - A structured, organized repository of knowledge becomes a high-quality dataset for future GenAI projects.

The loss of public training data, often due to privacy concerns and legal restrictions, highlights the need for a strong governance framework around your internal data. As AI regulations tighten, maintaining compliant, well-documented data practices becomes crucial.

Actionable step: Implement a data governance framework

Develop data access and usage policies that meet current regulations and best practices, ensuring data security and privacy. Assign data stewards within each department who can oversee adherence to these guidelines and conduct regular compliance checks. Document each stage of data processing to ensure traceability, accuracy, and transparency.

The benefits for this include:

Regulatory compliance - Meet global data privacy standards and avoid potential legal issues.

Data accuracy and accountability - Ensure data integrity and increase stakeholder confidence in your AI initiatives.

A favorable GenAI future requires adaptability and a close understanding of evolving user needs. A feedback loop enables your organization to learn from user interactions, identify gaps in your AI’s performance, and continuously refine the data.

Actionable step: Use feedback to guide data and model improvements

Deploy surveys, usability testing, and performance tracking to capture user feedback on AI-generated outputs. Analyze this data to identify common pain points or areas where model performance could improve. Regularly incorporating this feedback into your data curation and model training will ensure your AI systems remain relevant and effective.

The benefits for this include:

Continuous improvement - Ensure that your data remains responsive to user needs and aligned with organizational goals.

Increased user trust - Engage users by showing a commitment to refining AI outputs and responding to real-world feedback.

Synthetic data has gained popularity as a means to supplement training data, but excessive reliance on it can lead to quality degradation or “generational loss.” This occurs when AI models, continuously trained on synthetic data, lose the fidelity of the original human-created data.

Actionable step: Balance synthetic data with real-world examples

If you’re using synthetic data, balance it with high-quality, human-generated data. Establish a clear threshold for how much synthetic data is permissible within any dataset. Encourage model validation with real-world data, especially for tasks involving critical decision-making or customer interactions.

The benefits for this include:

Enhanced data fidelity - Reduces the risk of model degradation and keeps AI outputs closer to real-world expectations.

Regulated AI training - Establishes consistency and quality in data, ensuring your model remains reliable.

As Matthew Zieler says, data is both a foundational resource and a competitive asset. By investing in data quality, governance, and knowledge-sharing infrastructure now, engineering leaders, CIOs, and CTOs can help future-proof their organizations for a GenAI-driven era. Well-curated data and collaborative knowledge-sharing practices help organizations leverage AI to drive innovation and efficiency.

Last updated December 9, 2024