The Organisation for Economic Co-operation and Development (OECD) has released a new report mapping the diverse mechanisms used to collect data for artificial intelligence (AI) training, underscoring their implications for privacy, competition, and governance. The paper provides a taxonomy of practices that shape the datasets behind modern AI systems, from chatbots to large language models (LLMs) .
The OECD notes that while model architecture often dominates AI debates, the performance and trustworthiness of AI systems remain heavily dependent on the quality and provenance of training data. Developers typically combine multiple sources to build large and diverse datasets, but each method presents trade-offs for developers, data subjects, and regulators .
The taxonomy identifies two broad categories: data collected directly from individuals and organisations, and data obtained from third-party providers. Direct collection includes “provided” and “observed” data, such as information entered into chatbots or metadata captured during online interactions. Increasingly, this also extends to re-purposed data from platforms that host user-generated content, like social media posts or purchase histories. Another emerging approach is voluntary data donation, or “data altruism,” where individuals or organisations contribute information for public interest research in fields like healthcare or urban planning .
Third-party mechanisms include commercial data licensing, where developers purchase access to datasets from publishers, marketplaces, or brokers. Companies such as OpenAI have already signed agreements with outlets like The Associated Press, Axel Springer, and Reddit to license content for AI training. Non-commercial practices, meanwhile, include open data initiatives—like ImageNet or Common Crawl—that have fueled breakthroughs in computer vision and natural language processing. However, one of the most widely used yet contentious practices is large-scale web scraping, which supplies much of the text data needed for LLMs. Regulators worldwide continue to debate the legality of scraping, particularly with respect to copyright, privacy, and security .
The report highlights mounting concerns about transparency, noting that most foundation model developers remain reluctant to disclose training datasets due to legal risks and competitive pressures. At the same time, experts warn of a potential “data crunch” by the early 2030s, as the supply of high-quality human-generated text data may be exhausted if current scaling trends continue. While some developers are turning to synthetic data and smaller, more efficient models, the OECD stresses that clear governance frameworks are critical for managing the balance between innovation and individual rights .
Privacy-enhancing technologies, such as secure processing environments and synthetic data generation, are emerging as tools to mitigate risks. Still, the OECD concludes that the complexity and variety of data collection mechanisms—and their overlapping legal, ethical, and economic implications—demand sustained international policy attention. As governments from the European Union to Japan roll out new AI laws, the report aims to give regulators and industry leaders a structured basis for addressing the data foundations of trustworthy AI.
Need Help?
If you’re wondering how these measures, or any other AI regulations and laws worldwide could impact you and your business, don’t hesitate to reach out to BABL AI. Their Audit Experts can address your concerns and questions while offering valuable insights.