Companion Planting for AI

Companion Planting for AI
Photo by Victor Birai / Unsplash

As someone who spends a life-shortening amount of time glued to a computer, when I can escape the screen, I plant things.

In the first year I attempted to cultivate a little patch of earth in our newly-found rural corner of the UK, aphids annihilated my bean seeds, rabbits ravaged my carrots, and deer demolished my tulips. Grim-faced, I set up an elaborate system of netting and cages. I stopped just short of poison pellets. As a lifelong vegetarian I couldn’t quite get on board with the idea of extinguishing a creature’s life for the crime of consuming tasty vegetables. My deterrent efforts worked... only partially. They also incurred a penalty: when I legitimately wanted to pick something, navigating my avant-garde netting sculpture was a fairly comical endeavour.

My (more experienced, wiser) neighbour chuckled at my efforts. She suggested I plant marigolds with the beans, nasturtiums with the potatoes, and basil with the tomatoes. I was sceptical – but lo and behold, it worked. There are still nibbles on the spinach, but with this abundance, has come protection. The diversity of plants has helped the garden, as a whole,  thrive. This concept – companion planting —helps increase what’s available, which in turn reduces the load of wildlife consumption in any one spot, making the system, as a whole, more resilient.

…What does this have to do with AI?

The internet is presently the world’s most abundant source of training data. Companies have been crawling and scraping the internet with abandon, gathering and squeezing data out of the public web in a perversion of the concept of ‘harvest,’ straining and breaking the infrastructure on which we all rely. Creators, creatives, publishers, stewards and concerned citizens have challenged some of these practices with variable success.

My starting point is this. The internet is not a sustainable data source. It’s not representative (vast swathes of peoples, language and identities aren’t represented on the internet), and it’s getting worse (the AI slop proliferation problem). It is also (IMHO – I know others disagree) a violation of the concept of openness – which is not meant to be an excuse for land grab extractivism, but a channel for communal sharing and gift.

There are several interesting initiatives to try and solve for equitable compensation for internet content, from co-operatives trying to build bargaining power, to better preference communication protocols, to hosting solutions offering web monetisation features. I am glad there are people trying to protect livelihoods and broker more consent-ful practice between creators and technologists. But I also understand and empathise with concerns about the knock-on consequences of some of these enabling tactics – blocking crawlers has other consequences, blockchain technologies can carry significant environmental cost, poisoning data creates liability issues. All of these are the ‘netting and pellets approach’ to protecting the crops. Likely to help somewhat, but not likely to suffice, and likely to leave collateral damage in its wake.

So, what are the ‘marigolds’ here?

As long as the internet is treated as the dominant source of training data, it will be a battle ground. This is interesting because it is of moderate quality, expensive to process and getting less ‘organic’ by the day as AI internet slop accretes. What if we could unlock other data sources?

Expert estimates hold that the internet constitutes less than 1% of the world’s data. The rest is locked away in folders, files, on servers, devices, proprietary and private storage. The more I reflected on this number, the more intuitive it became to me. The other day we shot a video for Mozilla Data Collective. For that 1 minute of content, we filmed for an hour! For every photo we post on the website, we have dozens of out-takes. For every blog I post, how many memos, notes and drafts do I have? So. Very. Many. 

Imagine this: that diverse, representative datasets were abundant. How could this both protect the web from some of the more egregious current practices, bolster livelihoods and business models for under-threat industries, and be a lever for ensuring we are getting the inclusive tech we really want.

How do we pull that lever? We spoke to dozens of NLP communities, heritage archives, media organisations and creators, tech start-ups and product SMEs who had wonderful multilingual, multicultural, multimodal datasets. The biggest thing that stopped them from sharing datasets? Control. Knowing that they could share in ways that were aligned with their values and their goals. That they could use their datasets as a lever for change.

When we processed everything we had heard, it really boiled down to a dual framework of Data Agency & Fair Value Exchange. Data Agency is about meaningful choice and control: who gets to use your data, under what circumstances, and for what purpose? Fair Value Exchange is about you determining what constitutes the benefit you’d like to receive for the use of your datasets, be it compensation, recognition, sharing resulting tools, or data exchange. Mozilla Data Collective is the platform that makes that Data Agency & Fair Value Exchange possible: we are currently in Beta - but you can see our roadmap towards becoming fully featured over the next couple of months.  

Once we started workshopping how we could support that – the legal, technical, and community infrastructure that makes that possible (and believe us, it is all three, there are no silver bullet here) – we found that actually a lot of people were happy to share their datasets. From Armenian Q&A dialogues, to Bulgarian TTS, to Nahuatl ASR, to Saraiki Literature and hundreds more datasets from organisations and individuals all over the world, created and curated by communities who care about an AI future that is flourishing and bio-diverse, and who want to help build it from a place of meaningful participation.

If you’re ready to help create a flourishing data garden, and you’ve got some marigolds, or basil or lemon verbena seeds to share, then we’re building infrastructure that lets you reap the benefits of what you’ve sown. Learn more about Mozilla Data Collective.

Copyright EMLJ. All rights reserved.

With gratitude to Kathy Reid, ANU, for her review and ever-wise input.