Humans, Machines, Languages plenary address - Granada, June 2025
I had the pleasure of speaking at the HuMaLa conference in Granada, Spain last week. Granada was every bit as stunning as I remember, and the organisers were so kind and accomodating.
As the organisers have it, the theme was "Humanistic insights for human-machine language technologies: privacy, security, and wellbeing'. This echoes the priorities of the EU's recently introduced AI Act: "human oversight, safety, privacy, transparency, non-discrimination and social and environmental wellbeing". We hope to explore these timely topics from a range of humanistic perspectives, with a focus on human-machine language technologies."
I never write my talks verbatim, I just prepare slides, so the following isn't a word-for-word reproduction of the talk I gave, but rather a summary of what I covered.
Mozilla Common Voice:
Crowdsourcing a Hyper Multilingual Speech Corpus - Lessons from Community-Led Data
Mozilla Common Voice is a community project to crowdsource multilingual speech data. This massive volunteer-led effort has grown into the world's most linguistically diverse open speech dataset, demonstrating the potential of democratizing AI training data. In this talk, I'll also get real about some of the challenges we have faced, and lessons we've learned.
What is Common Voice?
Common Voice is a scalable, multilingual, open source platform that collects and validates text and voice contributions from volunteers around the world. The project addresses a critical gap in speech technology: most voice recognition systems are trained on data that represents only a fraction of the world's linguistic diversity. By crowdsourcing voice data from communities themselves, Common Voice aims to make speech technology more inclusive and representative of global linguistic diversity.
The platform operates on a simple but powerful premise - that tech should speak your language. Communities should have agency over the language data space, and quality speech datasets should be freely available to enable innovation across the entire spectrum of developers, from academic researchers to small startups to large corporations.
Remarkable Growth and Scale
The numbers tell a compelling story of rapid adoption and community engagement. Common Voice now supports over 300 languages and has attracted more than 750,000 contributors worldwide. The dataset has been downloaded over 5 million times, demonstrating significant demand for diverse, open speech data.
The growth trajectory has been particularly striking in recent years. Between 2020 and 2025, the platform saw a 426% increase in account holders, growing from 68,100 to 358,031 contributors. Even more dramatically, downloads increased by 1,657%, jumping from 38,500 to 676,377 over the same period. This explosive growth reflects both increased awareness of the importance of linguistic diversity in AI and the practical value that developers are finding in the datasets.
How the Platform Works
Common Voice operates through two primary data collection modes, each designed to accommodate different community needs and language contexts.
Scripted Data Collection
The traditional approach follows a structured six-step process. First, someone requests that a new language be added to the platform. The website is then localized into that language, followed by sentence collection where communities gather text for people to read aloud. Once sufficient sentences are collected, Common Voice launches the site in that language, enabling voice contribution where people record themselves reading the collected sentences. Other community members then validate these voice clips for quality and accuracy. Finally, Mozilla releases the dataset every three months, making it freely available for download and use.
Spontaneous Speech Collection
Recognizing that many languages lack sufficient written text in the public domain, Common Voice also developed a spontaneous speech feature. This allows communities to start with a small corpus of prompts and then transcribe naturally spoken responses. Users see a question like "What do you want to do with your next day off?" and can respond naturally, with their speech then transcribed by other community members. This approach proves particularly valuable for oral languages or those with limited written resources.
A Diverse Community of Contributors
Common Voice is powered by people with a range of different skills and motivations. Language managers and localizers enable languages by adding language corpus and translating the platform interface. Corpora contributors participate in the core activities of speaking, listening, and writing sentences. Dataset consumers freely download the data for innovation and building applications. Open-source developers contribute to the platform itself or create alternative tools and interfaces.
What Motivates Participation
The motivations driving community participation are diverse and deeply personal. Many contributors are driven by principles of open data, public knowledge, and anti-monopolistic practices - they believe that fundamental linguistic resources should be publicly available rather than controlled by a few large corporations.
Language activism, heritage preservation, and cultural preservation represent another powerful motivator. For many communities, particularly those speaking endangered or minoritized languages, Common Voice represents a way to ensure their languages remain viable in the digital age. Contributors often speak of wanting to preserve their linguistic heritage for future generations.
Innovation and technology research motivations draw developers, researchers, and technologists who see the datasets as enabling new applications and research possibilities. Finally, many participants view voice technology as a means to an end - they're motivated by specific domain applications or social goals that speech technology can help address.
Real-World Applications and Impact
The datasets have enabled numerous innovative applications across different sectors and communities.
Kiazi Bora: Empowering Women in Tanzania
SEE Africa uses Common Voice to support gender equality through their Kiazi Bora voice app, which teaches women in rural Tanzania how to grow and sell a powerhouse potato. The app, powered by the Common Voice Kiswahili dataset, helps boost women's nutrition, earning potential, and status as more equal members of society. As Rahma Mkai, SEEAfrica project lead, explains: "We will know our work is done when women have fair access to all matters pertaining to development; when their voices are heard and respected; and when their communities regard them as reliable key change agents."
MabelAI: Private Medical Translation
MabelAI Sweden addresses critical challenges in medical communication by using Common Voice for private, effective medical translations. The need for language translation in medical settings often creates delays and privacy issues. MabelAI's solution provides fully on-device translation processes that aren't saved or stored, addressing data security requirements in healthcare while enabling communication in remote areas without network connectivity. As founder Karolina Sjöberg Jabbar notes: "MabelAI is uniquely compatible with data security requirements in healthcare. It can also function without a network, allowing for use in fieldwork in remote areas."
Reality Defender: Combating Deepfakes
Reality Defender uses Common Voice to detect and debunk deepfakes, helping governments and organizations distinguish real audio from artificially generated content. As senior data engineer Zee Fryer explains: "Many audio models tend to be biased toward recognizing western, American voices. Common Voice is great because it contains different dialects, accents, a good balance of pitches and speaking speeds." Their tools for scrutinizing audio content are trained on Common Voice data, demonstrating how diverse training data improves the robustness of AI applications.
Platform Development Challenges and Solutions
Building a platform that truly centers linguistic diversity has required Mozilla to confront and solve numerous technical and social challenges.
Addressing Dialect and Variant Complexity
Languages often have several variants that make it difficult for users to validate sentences, read them, or validate audio-text pairs due to unfamiliar vocabulary or accents. Common Voice initially lacked a way to allow contributors to engage only with their specific variant through the data creation flow, making the experience challenging for macrolanguages or languages with significant internal diversity.
The solution involved implementing a variant tagging system where sentences can be tagged with specific variants, users can assign a variant to their profile, and contributors can choose to engage only with sentences and audio clips in their own variant. This ensures that, for example, speakers of North-Western Welsh only encounter and validate content in their specific variant of the language.
Reducing Data Requirements
Many languages have little to no text in the public domain. To collect 1000 hours of 5-second clips would require 720,000 sentences - an impossible threshold for communities without extensive written corpora like old books, Wikipedia articles, or large public domain texts.
Common Voice addressed this through a two-pronged approach. In 2021-2022, they introduced a banding system that moderated sentence collection requirements based on whether the language was classified as High, Medium, or Low Resource. In 2024, they built the Spontaneous Speech feature, allowing communities to start with a small corpus of prompts and then transcribe naturally spoken responses.
Streamlining Localization Requirements
The localization workload had grown substantially as the platform expanded, with users finding that translating the entire website took excessive time. The 824 strings required for full localization proved overwhelming for many communities, particularly those working on under-resourced languages lacking vocabulary for many technical phrases.
The solution reduced the localization requirement from 824 strings to 300 strings, and the overall platform from 1372 to 1149 strings, effectively decreasing the localization workload. Communities still have the option to localize further, but this change allows them to begin collecting data much sooner. For Spontaneous Speech, communities can even use a proxy language in the user interface - for instance, Spanish for Nahuatl.
Evolving Licensing Approaches
Some communities have become concerned about the suitability of CC0 (public domain) licensing for their context. They may wish to screen who accesses their data, control what it's used for, or ensure that financial benefits flow back to their communities more directly.
Mozilla is addressing this through several initiatives. They're partnering with GIZ, Maseno, GhanaNLP, and UNDP to pilot new licenses on Common Voice, including the Nwulite Obodo License (NOODL) for communities seeking more community-controlled approaches. They're also launching Mozilla Data Collective, a new platform with built-in governance features, which represents the next evolution of community-controlled data platforms.
Data Governance Framework
Common Voice operates according to a comprehensive data governance approach built on four key principles: privacy, security and transparency; community participation and decision making; mutual accountability; and value and recognition.
Privacy and Security
The platform implements pseudonymization of published voice data, provides clear pathways for profile and data deletion, maintains limited data collection via analytics, excludes users from site analytics if do-not-track is enabled, and makes the provision of additional information like email and age entirely optional.
Community Participation
Mozilla positions itself as providing community-led self-service infrastructure where sovereign language speakers choose what data to collect. Mozilla doesn't choose the languages or design the datasets - communities choose the domains, speakers, and variants with minimal intervention. The platform roadmap is informed by ongoing engagement with the Common Voice community, and in-country community champions handle regional and language-specific work.
The level of community input varies based on the impact and generality of decisions. For low-impact, highly specific decisions like adjusting nomenclature for domain metadata, Mozilla consults primarily with self-selecting super-users. For medium-impact decisions like splitting Chinese language variants, they hold space for additional Common Voice communities and advisors to weigh in through research processes. For high-impact, general decisions like enabling Maori language for contribution, they consult external sovereignty structures and implement "Do No Harm" policies.
Accountability and Recognition
For consuming communities, while CC0 is the most radically open license available, Mozilla enables people to raise concerns about data uses and engage in dialogue when needed. For creating communities, they maintain a three-step process for enforcing community guidelines: flag and educate, written warning, and account removal.
Recognition takes multiple forms including community-level acknowledgment, support for academic credentialism and grant applications, professional letters of recommendation and CV inclusions, online language community events, awards and competitions, and co-hosting community events.
Challenges in Governance Dialogue
Despite these frameworks, governance dialogue within the community remains challenging. The data ecosystem has fundamentally changed since Common Voice began, with LLM companies racing to collect training data from across the public internet, lobbying to abandon copyright constraints, and increasing scrutiny of extractive practices. Meanwhile, crude translation is often treated as adequate representation, despite community preferences for authentic linguistic diversity.
Community views on these issues vary dramatically. Some contributors ask "Why should I create data for megacorps to hoover up?" while others say "We don't want to be left behind - my grandchildren cannot speak with me - we want tools to help them learn." Some worry about voice cloning and safety, while others argue that "Restrictions just limit the good this can do." Some feel that "My people have been exploited for hundreds of years - we are done giving up our precious heritage," while others see this as "an investment in my children's future" that's "not for sale."
The Case for Open Licensing
The tension around licensing reflects deeper questions about how to balance openness with community control. Open licenses like CC0 drive innovation by enabling a wide variety of intended and unintended use cases. They open up the field and promote inclusivity by allowing low-resource actors like academia, startups, and marginalized speaker communities to participate. They push back on market dynamics by nudging higher-resource actors to diversify their products even when there's no commercial incentive, and they make consent clear for the entire pipeline, avoiding enforcement issues downstream.
However, open licensing also means unregulated access with no community control over potentially uncomfortable use cases. There's no mechanism forcing high-resource actors to contribute back to the project equitably, and if no profit is generated, there's no profit to share with communities.
Looking Forward: Mozilla Data Collective
Recognizing these challenges, Mozilla is developing the next phase of supporting the language data ecosystem through Mozilla Data Collective, a platform for ethical, community-led AI data that emphasizes Creation, Curation, and Choice.
For data contributors, the new platform will store, host, and list datasets with clear stewardship and robust documentation. Governance dashboards will put communities in control, allowing them to share data in line with their values and needs. Communities can create new value in existing data by opting into cross-walking, blending, or augmenting datasets.
For data consumers, the platform will provide a discoverable library with transparent preference signals, enable connections with communities to contextualize data and foster collaboration, and offer data supply chain transparency to enable compliance in the shifting legal landscape.
This evolution represents Mozilla's response to a market opportunity that encompasses governance beyond compliance, security and privacy protection, appropriateness of use and social contextualization, diversity and inclusion, quality and consistency, and labor consent and community benefit.
Lessons from Community-Led Data
The Common Voice experience offers several crucial lessons for community-led data initiatives. First, the importance of genuine community control - when communities have agency over their own data collection processes, they create more meaningful and sustainable datasets. Second, the need for flexible technical approaches that can accommodate the diversity of global languages and communities rather than imposing one-size-fits-all solutions.
Third, the value of iterative platform development based on community feedback. Many of Common Voice's most important features - variant support, reduced localization requirements, spontaneous speech - emerged from listening to community needs and constraints. Fourth, the recognition that governance is not a technical problem to be solved once, but an ongoing dialogue that must evolve with changing technological and social contexts.
Finally, the project demonstrates that while open licensing enables remarkable innovation and democratization, communities increasingly seek more nuanced approaches to data control that balance openness with self-determination. The challenge moving forward is developing platforms and frameworks that honor both the promise of open data and the legitimate desires of communities to maintain agency over their cultural and linguistic heritage.
Mozilla Common Voice has proven that community-led data collection can work at scale, creating valuable resources while centering the needs and voices of language communities themselves. As the initiative evolves through Mozilla Data Collective and beyond, it continues to offer a compelling alternative to extractive data practices, demonstrating that technology can be built with and for communities rather than simply extracting from them.
For more information about Mozilla Common Voice and Mozilla Data Collective, contact em@mozillafoundation.org