Solving the AI data drought with community curated data

As industries and organisations increasingly adopt AI and ML technologies, quality data becomes even more critical than ever before. Today’s AI models, especially openly licensed models that make their training data available, have historically been trained on the corpus of information that is widely available online.

Improving and building new models requires new sources of data, but the lack of high-quality, representative sources of data that is publicly available on the internet creates a ‘data drought’ for developers who are looking to build the next generation of AI technologies. Platforms that enable and facilitate community-curated data sources offer a scalable and ethical solution, enabling a path forward for a more representative and democratic future for open source AI.

The costs and challenges of data production

Collecting, curating, and preparing data for use in AI/ML workflows is expensive. Accessing specialised data, especially in a way that respects the rights of the data contributor or creator, can inhibit or slow development. This has resulted in an AI ecosystem that relies heavily on unrepresentative data. Applications that utilise these types of data subsequently exacerbate systemic inequalities, reproducing biases found in the training data, which limits the usefulness and accuracy of the systems - sometimes with catastrophic consequences.

Other approaches may generate diverse data, but violate copyright and social norms of consent. As a result, data sources are increasingly being siloed, which creates barriers for developing open source and public interest AI, and for those who want to build transparent and observable systems. Today, many communities are faced with a false dichotomy of participation within the AI ecosystem: hand over their data entirely, or be excluded.

Many popular commercial AI applications have faced scrutiny and legal action over their use of copyrighted data in training their models, refusing to fairly compensate those whose work powers the systems generating them billions of dollars in revenue. Adapting the way that we think about the role of data stewardship and facilitating stronger relationships between data creators and data seekers shifts power back to those whose contributions are the actual source of “intelligent” machines.

Exploring sustainable sourcing through communal data governance

Enabling community curated and controlled sources of data is one potential solution for solving issues with data availability. There are existing proof points in crowdsourcing data through open source projects such as Common Voice, WikiData, and OpenStreetMap, all of which facilitate the collection and use of information from a wide community of contributors in service of creating high-quality resources for public use. Building new tools and platforms that facilitate curation and digitisation of global, community-curated knowledge can expand upon these ideas and shift our perspective to a more sustainable supply chain for human-generated data.

Centering global communities as the authors, stewards, and domain experts ensures that data is authentic and representative, which builds trust and accountability. This collective intelligence will be vital for the next generation of artificially intelligent software, and is especially critical given the increased amount of synthetic, AI-generated content that now makes up the majority of what we see online. For open source and public AI developers, increased access to diverse data sources will be key to creating alternatives to proprietary for-profit systems.

Equity through access and education

Ensuring that there are sustainable, accessible, and diverse sources of data for independent developers creates the necessary conditions to foster invention and innovation, grounded in the principles of human-centered AI set forth by the European Commission. Community-owned data collectives will enable a wider audience to participate in the development and deployment of AI technologies, leading to more equitable and inclusive outcomes and representative models. Projects that enable this - like Mozilla Data Collective - in turn unlock additional opportunities for new applications to be developed by underserved communities themselves, removing dependencies on extractive commercial providers and proprietary algorithms.

To build a future where AI truly represents global human interests, it is imperative that we invest in community data stewardship, accessible tools, and ethical data governance practices. Encouraging participatory design for AI systems at the dataset layer through community curation and data management tools can help solve the ongoing challenges related to inclusive representation, global perspectives, and ethical procurement of data sources for machine learning.

Ensuring that communities are educated, trained, and equipped with data literacy and governance skills through the forthcoming EU AI Skills Academy efforts will expand access to domain, language, and culture-specific expertise necessary for next-generation AI innovations. These practices will push us to go beyond techno-centric solutionism, and re-ground AI development in service of the people who power it.

About the Author

Liv Erickson is a computer scientist, creative technologist, and technology policy advocate who has been working on experiential computing technologies since 2010. As the Senior Product Lead for Mozilla Data Collective, she supports product development of a community-oriented data governance platform.