This is the first installment of a five-article series on where and why decentralization is needed in the AI pipelines. We will not only cover one specific AI task but include a variety of AI / ML-related areas, from language processing and vision to robotics. Instead of starting with decentralization and looking for where we can fit it into the AI processes, we will be looking at where AI practice needs decentralization and focus on how that integration can be designed.

When talking about AI, especially the decentralization of it, people tend to think solely about the model development stage. This leads to an unbalanced focus on the training itself and the compute supply problems that arise from it while ignoring other areas whose importance may be greater or equal to that of model development.

Overview of LLM Challenges. Designing LLMs relates to decisions taken before deployment. Behavioral challenges occur during deployment. Science challenges hinder academic progress.

In this series, we will have five articles covering different stages of the AI pipeline and the safety concerns that relate to all of those stages:

Data Collection: Quality, Copyrights & Ownership (you’re here!)
Model Building: Compute, Open Sourcing, Fine Tuning
RAG: Bringing in World Knowledge, Collective AI Memory
Safety: Manipulation, Deepfakes
Applications: AI - Human Interaction

Data Collection Overview

In the AI development process, data collection is the step in which relevant data is gathered and organized according to the identified goals and objectives. Of course, the type of data and how it is collected heavily depends on the specific use cases (you probably won’t be collecting a lot of images for a text-only model), but some general patterns emerge across all domains.

Data collection can include getting existing data from open or proprietary sources, generating new data specifically for the task at hand, augmenting low-quality/volume data with various techniques, and data labeling, among other techniques.

Data collection and quality techniques

Data collection and quality challenges cannot be solved in a single phase, but they require continuous attention throughout the entire process (ref). For this reason, we will keep coming back to the data issues in the following articles of the Towards Decentralized AI series. Still, most of the foundation of how data collection and decentralization relate to each other will be laid out here.

Deep learning challenges from a data-centric AI perspective

How Decentralization Can Solve Data Collection Problems

In order to explore the intersections of decentralization and data collection, first, we will look at the current challenges that AI teams/developers face during the data collection process. One of the most well-known ones today has become data volume, as the training data needed for generative models recently started, leading to questions about limited data supply, but the quality of that data is also a major factor for obvious reasons. Storage issues, privacy controls, and copyright are other challenges that have key roles in AI pipelines, and all have possible solutions that lie within decentralization.

Data Quality

With the ever-growing need for larger datasets for training, AI developers face a major challenge in both finding high-quality data to use and checking the quality of the data they collect. At current rates, it’s getting impossible to reliably check the quality of their training datasets (even GPT-3 had >45TB training data), and relying on automated methods can lead to inaccuracies and poor data quality.

Refinitiv Artificial Intelligence / Machine Learning Global Study

Some major problems in terms of data quality stem from the repeated, duplicate or outdated data that is collected. While these problems can be overcome with model editing and retrieval augmentation methods, the edited versions and augmentation datasets still require robust data collection and quality check processes. Here’s an example from the 2023 paper “Challenges and Applications of Large Language Models” (Jean Kaddour et al.) going over an example of outdated training data:

Community Curation & Review via Open Data Hubs

Over the years, open data hubs and data-sharing platforms proved their usefulness in the data science and machine learning world. Hugging Face and Kaggle are among the prime examples, with Hugging Face being more focused on training data and open models and Kaggle being home to competition-style predictive competitions and related datasets/notebooks. Not limiting it to ML-focused platforms, even human-curated knowledge sources like Wikipedia play a crucial role in the open knowledge movement. These play a huge part in creating available sources for training, fine-tuning, and retrieval.

By creating decentralized open data hubs, a large amount of people can curate, review, and rate the data that will be used for AI tasks. This shifts the burden of ensuring the quality of datasets from relatively small teams and masses of people while the contribution and rating processes run transparently without any single party’s manipulation or intervention. With the right incentive structures, these decentralized data hubs and data curation/review activities can ensure the quality of data coming in with high velocity and volume.

These decentralization solutions can apply to newly emerging data hubs as well as existing ones.

Information Markets

Another type of structure is information markets, where a dynamic and competitive environment ensures the most up-to-date and high-quality data sources in critical areas. These types of markets can be set up using smart contracts to ensure fairness and provide the necessary incentives to the participating parties. People from a wide range of backgrounds, from domain experts to everyday internet users, can participate in these markets in different capacities, contributing, reviewing, and monetizing data.

Ocean Protocol has been working on this for a long time, creating markets where users share data not for web3 use cases only but also for AI and data science applications.

Data Volume

Despite the major advancements in LLMs, the amount of data used for training these AI models is still tiny compared to the amount of input a human receives:

I've made that point before:
- LLM: 1E13 tokens x 0.75 word/token x 2 bytes/token = 1E13 bytes.
- 4 year old child: 16k wake hours x 3600 s/hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes/s = 1E15 bytes.

In 4 years, a child has seen 50 times more data than the biggest LLMs.… https://t.co/09atbzWsFP
— Yann LeCun (@ylecun) January 25, 2024

When it comes to the sheer volume of data being collected, providing social and economic incentives to contributing teams and individuals can speed up the process by bringing in new types of data that are not widely available yet or pushing more data sources to open up with the new monetization frameworks. Right now, even though there are TBs of data sources publicly available on the internet, this is concentrated in certain data formats (mostly text-based) and certain sectors/verticals.

Of course, the volume itself is not meaningful if the data quality is lacking, but when coupled with the curation and review processes mentioned in the previous section, it can be accelerated much more confidently, knowing the newly added volume will always be going through the necessary controls.

This also applies to the synthetic data. While the overcrowding of the training datasets with machine-generated data and differentiating it from human-generated ones pose a big challenge, even relying on these iterated processes of collective reviews and feedback can make a great difference in the effect of the synthetic data, essentially making it a much more useful part of the data collection process instead of a bottleneck.

Data Storage and Maintenance

When we talk about these massive datasets in different shapes and sizes, another key aspect that teams/developers need to deal with is the storage and maintenance of this data. When you are collecting large datasets, there should be a reliable storage method that ensures you will not lose, damage, or change the data unintentionally. The simplest risk involved is the monthly subscription fees for data storage, where the initial timeline of a project can limit how long these storage units will be maintained, although the collected data can be needed for much longer for different purposes. Similar issues could arise with on-site storage units with their dedicated hardware, as the subsequent projects will also need storage.

Another issue is the cost involved in the data storage. As the collected data grows, monthly payments for storage grow cumulatively, meaning that if you collect 5TB of data every month, what you will pay is not a flat fee, but instead, it will always be more than what you paid last month. Of course, paying for more than you need upfront and getting a discounted deal is always an option, but that means there will always be a portion of storage units that you are not utilizing, which leads to inefficiency both technically and financially.

Since many of the general use models have large overlaps in the data sources they use, such as Wikipedia pages, books, and crawled web page content, they collect and store mostly the same data in different storage solutions over and over, paying unnecessary fees that add up to millions of dollars for something that’s already paid for. The lack of coordination and collaboration is a serious setback against the efficiency of the system.

What decentralization can provide in the first issue, the data loss is clear. Just like you never lose your financial transaction history on a blockchain, you can also make sure you never lose the data you have collected. Using a decentralized solution dedicated to permanent storage, such as Arweave, you essentially mitigate the risk of losing data by opting out of time-limited storage options that only store as long as you keep paying a monthly fee. There is also an opportunity to create a shared data lake that different projects can use without going through the process of collecting and storing the same data in different places. This reduces the total cost and risk involved with data storage significantly while also making it much easier to build platforms where people collectively collaborate on data collection and review, as previously mentioned in the data quality and volume sections.

Privacy and Personally Identifiable Information

The privacy and the security of the users are always serious concerns for any type of data collection task, more so with the type of large volume of meaningful data that we have been talking about. A key term here is PII, or personally identifiable information, which refers to any type of data that can be used to identify an individual, such as their name, ID or phone number, address, biometric credentials, health history, etc. PII is all over the internet, and using data that originally has some name attached to it such as social media posts or blog entries is fine in principle, these should be stripped away up from any type of context that can lead to use of PII. Regulatory measures such as the recent announcement of The EU’s AI Act put a significant focus on privacy concerns like this.

Just like the data quality control problem, privacy and PII issues also get significantly harder to solve with the high volume of data that AI developers need to manage. An important difference here, though, is the lack of viability that some quality assurance solutions have in terms of privacy protection. You can create collaborative environments where thousands of people go over the collected datasets to rate their quality and usefulness, but outsourcing the privacy and PII controls of a dataset to a collaborative platform or a market is definitely not an option.

Though cleaning the collected datasets from PII and ensuring privacy standards are met are still possible, a complementary approach preventing these sensitive data from entering the datasets upfront can be much more effective. Blockchains enable various use cases for technologies like zero-knowledge proofs (ZKP) and decentralized identities (DID) that make the privacy-preserving approach the default mode of interaction for online activity.

Intellectual Property and Copyright

Another widely debated topic around data collection for AI is intellectual property rights and copyright infringement. There are many online platforms and media outlets that challenge AI companies on the use of material that is subject to copyright, saying that the training and use of these AI models create a competitive advantage over the rightful owners of the original content.

Though copyright and IP issues have always been a tricky topic, first, the mass adoption of social media and now the growing usage of AI chatbots have made it even tougher to resolve conflicts related to these matters. Luckily, however, one of the most well-known technologies related to blockchains and NFTs has been focused on this problem ever since its inception, though the crazy prices for cartoon profile pictures overshadowed the discourse for a while.

NFTs make the ownership and use rights of creative/intellectual material very clear and transparent thanks to the transparency and immutability of on-chain actions. These tokens can be used to verify and recognize which material is subject to what type of procedures, making both the data cleaning process and dealing with lawsuits easier. One example of how copyrights and licenses can be handled well is MonkeDAO’s SMB Gen3 collection. With each NFT, there is an NFT license living in the token’s metadata that provides information on how the token holder can use their NFT and create derivatives and what rights they have over it.

Proving the origin of an IP online has also become easier and more reliable through decentralization. With centralized social media and content platforms, there’s always a significant risk of losing access to your content or account or even the platform shutting down completely. This makes it very challenging to prove you were the first person to create/post that piece of content and risks your right to distribute and monetize it in the future. Decentralized alternatives like Zora enable creators to share their works in a way that makes the origin very clear, hence protecting their intellectual (or, mintellectual) property rights over it.

Risks and Challenges in Decentralization of Data Collection

While we went over many use cases where decentralization can help in solving data collection problems, there are still some risks and open problems involved with these methods, as is the case for any other solution.

One of the problems that decentralized solutions often introduce is the default anonymity of the users. As previously mentioned, there are definitely cases where this anonymity comes as a strength in terms of privacy, but this doesn’t fully cancel out the potential problems that can arise from it. When there is a regulatory issue, for example, related to copyrighted or harmful content, the offenders being anonymous can create an even bigger issue, putting the platform at risk. A now famous example of anonymity rights being criminalized is Tornado Cash, where one of the developers got arrested.

A similar concern can arise from permanently storing data on a decentralized network as there can be harmful content in the data uploaded from crowdsourced platforms, though these are already taken into account in the decentralized storage solutions through self-censoring models.

The last issue is the eventual tradeoff between volume and quality incentives since no matter how the platform is structured, there will always be people who will upload more data with lower quality or high-quality data with low volume. Collaborative platforms and information markets need to balance out these initially contradicting incentives to create an environment where quality checks help accelerate the higher volume contributions. These can be done through various points mechanisms where scoring weights are adjusted based on the priorities of the platform, ensuring the overall data quality in the ecosystem.

Conclusion

While there are still open problems remaining when it comes to data collection, rapid developments in AI and increasing adoption of its apps will require new solutions that will take the existing approaches to the next level. The power of decentralization mainly comes from the collective contribution possibilities where transparency and accountability ensure the best quality data flows in high volumes while preserving the users’ privacy as well as creator rights.

As we progress further in the intersecting platforms that bring decentralization and AI data collection together, there will be more opportunities to empower better coordination paradigms toward smoother data collection pipelines.

In the next article, we will talk about how decentralization can help further down the pipeline, improving the model-building and fine-tuning processes.

References

“Challenges and Applications of Large Language Models” (Jean Kaddour et al., 2023)
“Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective” (Steven Euijong Whang et al., 2022)
https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/
“Insights from the Refinitiv 2019 Artificial Intelligence / Machine Learning Global Study” (link)
https://twitter.com/ylecun/status/1750614681209983231
SMB Gen3 NFT License
https://zora.co/writings/mintellectual-property

Toward Decentralized AI, Part 1: Data Collection