Inclusion, community and data creation: lessons from the island of Borneo
E.M. Lewis-Jong
Common Voice is a Mozilla community data project to enable technology that speaks your language.
In 2024, I worked with communities on a series of data creation activities across the island of Borneo (my partner's home!). These included langagues like Fuzhou, Standard Malay, Bahasa Malay, Sabah Malay, Sabah Bisaya, Kelabit, Eastern Penan, Western Penan, Heng Hua, Central Melanau, Serian Bidayuh, Kenyah, and Sa’ban.
Challenges in creating datasets for low resource languages
In the first instance, nomenclature and definitions could be a barrier to communicating the intended ‘boundaries’ of data collection, for example to collect one linguist-defined variant and not another. Language communities might have a name for their language that varied from the (generally outsider-produced) reference materials such as Glottolog (Hammarström et al., 2024), and the linguistic variation that seemed sufficiently salient to a linguist to differentiate between variants might not be intuitive to communities themselves.
They might say, for example, that the more important distinction was the villages in which the language was spoken, or not acknowledge the variation at all. It was found in conversation with some language communities, such as Penan, that there was not universal agreement about whether linguist-defined variants were in fact mutually intelligible between speaker communities.
The dataset creation process focused on Spontaneous Speech. Speakers were served prompts in their language, e.g., “What is your favorite meal, and how do you make it?” and asked to respond in a few sentences, resulting in voice clip recordings. The recordings were then listened to and transcribed, resulting in the alignment of voice and script labels. The choice of Spontaneous Speech over Scripted Speech was for the most part due to the complete lack of public domain text for these languages. An additional benefit was capturing authentic sentences from first-language speakers, which created a valuable and unique cultural corpus for the communities themselves and other researchers.
It did however create a set of other challenges. One was the difficulty that communities had remaining within a single defined language and variant for multiple sentences. Many of these languages are always spoken in a translanguaging context. In the highly multi-cultural context of Malaysian Borneo, in which formal education is provided in Malay, Mandarin Chinese and English, and in which there are hundreds of ethnic groups with distinct languages and variants, it is rare for several sentences to contain only one language. Kenyah or Penan might be more commonly spoken with heavy use of loan-words from Malay, and Fuzhou and Heng Hua might be blended with Hokkien or Malaysian Mandarin.
Ability to limit oneself to a register-variant when speaking spontaneously outside of the standard register context was also challenging. To collect Standard Malay, a formal variant found mostly in news-reading or examination contexts, it was found that using the more formal environment of university offices helped contributors.
Orthography provided a multi-layered challenge. Not all of the languages had a highly standardized orthography; for instance several did not have an official dictionary. Or, there might be several competing orthographical systems. For Malaysian Fuzhou, Traditional Script Chinese might be used by the older generations, Simplified Script Chinese by the next generations, and then Hanyu Pinyin by the younger generations. In the case of many of the Indigenous languages, the language might be predominantly oral, and when used in text form — the most commonly cited use case being SMS messages — its written form might follow the writer’s phonetic intuition, generally using a Latin script.
Each of these phenomena made transcription challenging. The socioeconomic context of some of the languages meant that there were no professional transcribers. This challenge was met by seeking out speakers with the highest possible formal education level, even if they no longer spoke the language daily (a common theme), and pairing them with first-language users (often older generations who spoke the language with more regularity but might have less formal education).
Unlike in the speech contributor group — which was deliberately diversified to capture the widest possible range of speaker profiles even within a small group — the transcription were kept in small groups, or co-located, in order to generate some level of standardization for the purpose of model training.
*
My final thought? I would love to know how many engineers working with speech datasets have pondered these kinds of nuances. I have been surprised by how lightly some technology developers sit to thee data. Language as it's really lived is a complex and deeply human affair, and understanding those complexities is essential to making sure machines can navigate them too.
If these sound like the kinds of computational linguistics and community data creation challenges you're interested in too, come say hi in the Common Voice Discord!