Why should the public care about AI training data?

Why should the public care about AI training data?
Photo by Eddi Aguirre / Unsplash

Artificial Intelligence technologies are being rolled out quickly in many parts of the world. From the digital assistants on our phones to underlying systems that influence what news we see and what opportunities we encounter. The data used to train these systems impacts their performance and behavior.

These technologies can address social challenges when designed with care, but they can also reinforce existing inequalities or create new problems when they’re driven by unrepresentative, skewed or toxic data. 

AI systems often reflect, magnify, magnify and reproduce existing societal biases, resulting in discriminatory outcomes with outsized harms. These problems begin with the data used to train AI systems.

Public participation in creating AI training data is essential. Any data created in controlled, artificial contexts will struggle to represent the full, beautiful diversity of human people. To serve the public interest, we need to engage the public in creating AI training data. 

What is wrong with current approaches to data creation?

Today's AI training data is primarily collected, controlled, and utilized by a relatively small number of technology companies. The current approach to data creation often involves  using vast quantities of web data that was not intended for that purpose. Given that 30%+ of the world’s population are not online, this compounds the issues of representation. Even widely used voice assistants support only a tiny fraction of the world's languages compared to community efforts like Wikipedia's 299 languages.

Legal issues abound in using the public internet as primary training data - from copyright infringement to consent to fair use debates. Social norms too are hotly contested, there is anxiety about worker displacement, meaningful consent and the devaluation of creative industries. 

An AI ecosystem powered exclusively by market forces will always prioritise profitable applications while undervaluing critical work that could benefit society. This represents a fundamental misalignment between how AI training data is currently created and the broader public interest.

Why should public sector organisations and governments get involved in creating public AI data?

Public sector bodies have both the mandate and opportunity to ensure AI development serves everyone. Unlike private companies, which are primarily focused on profit maximisation, governments have a mandate to serve all their constituents, and deliver sustainable public benefit. 

Public alternatives to private initiatives have long coexisted across sectors—from transportation to broadcasting. These public options provide more choice, create market pressure for trustworthiness and innovation, distribute power more widely, and build more resilient economies. 

Governments are uniquely positioned to mobilize diverse citizen participation through their existing touchpoints with the public. By treating AI training data as a public good—similar to infrastructure, education, or research—governments can drive innovation that serves overlooked needs and communities.

The First Mandate for Completeness: Ensuring Diversity and Total Coverage

Governments have a unique responsibility to serve all citizens equally, making them ideally positioned to create AI training data with true population diversity. Artificial alternatives - like gig work - tend to overrepresent particular groups (say, young adults.)

Governments generally have touchpoints with larger swathes of the population, including people of diverse ages, genders, ethnicities, abilities, locations, languages and socioeconomic backgrounds. This kind of organic diversity is rarely captured in commercially produced datasets.

Existing Touchpoints with Citizens

Governments already interact with citizens through numerous services across their lifespan, creating natural opportunities to collect diverse, representative data. Think of healthcare, education, social services, and citizens' advice. Public sector organizations can ethically gather high-quality data while improving service delivery. 

These interactions provide a foundation for creating comprehensive training datasets that reflect real-world diversity without requiring additional data collection infrastructure. For example, when being certified as a doctor, you could agree to be recorded reading a paragraph of a medical dictionary. When graduating from high school, you could be asked if you would donate a single essay into the public domain. This ‘many hands make light work’ approach to AI data creation, which centres public participation, would scale available training data whilst embedding consent. 

By not just creating but publicly sharing high-quality public datasets, governments can enable smaller organizations, researchers, and startups to develop AI solutions that might otherwise be impossible due to data access limitations.

Data Reuse and Social Norms Around Giving and Community

Increasing the volume of authentic AI training data that reflects real-world diversity also requires setting expectations around data donation, reuse and recycling. Government initiatives can establish positive social norms around data sharing and reuse for public benefit. Programs like NIH's Bridge2AI tried to create ethical frameworks for data reuse, ensuring consent forms clearly explained to participants how their data would be used. By establishing transparent, ethical standards for public data collection and sharing, and being obsessed with communicating these to constituents, governments can build trust. 

Building STEM Skills 

Creating public datasets provides opportunities to develop AI skills in the wider public. Common Voice has shown that our contributors “learn by doing” - sample bias by recruiting participants, and about accent diversity by listening to clips. We imagine a world in which people can learn about AI by being actively brought in as co-creators, designers and shapers. Technical capacities in data science, machine learning, and AI development can be gained through experience in building it.