Introduction
AI presents a fundamentally new challenge to earlier technological disruptions and copyright. It differs to the era of digital piracy and peer-to-peer file-sharing.1 It differs to the proliferation of search engines and news aggregators which reproduced text "snippets" of journalistic content in search results.2 It differs to questions of copyrightability of databases—including phone directories, financial, scientific and geographic databases.3 Earlier disruptions concerned the distribution and access to existing content. During all of these eras, the costs of reproducing digital content greatly reduced, whilst the costs of producing content remained high. This grounded the economics behind copyright law.
AI fundamentally tilts the economics of copyright on its head. AI content generation is beginning to replace rather than merely redistributes human creative output. While file-sharing redistributed existing music and films, and search engines provided access to existing web content, AI systems generate new content that may compete directly with human creators in their traditional markets.4
At the heart of generative AI, is a growing tension where AI developers rely on access to datasets. The copyright framework that supports this access, was designed for an earlier era. At its simplest, copyright law automatically grants exclusive rights over creative works produced by creators. As such, every image, sentence, or audio file on the Internet is likely protected by copyright. AI developers rely on large scale collection (through web scraping, copying and reproduction) of this data to train their models. Such activity could interfere with copyright protection.
Web Scraping & Text and Data Mining (TDM)
Web scraping itself is not a new technology. It emerged since the 1990s and is the process of automatically extracting data from websites through bots, crawlers or web scrapers. 'Scraped' data is then saved and stored, which is useful for later data processing and model training. The current legal landscape for web scraping involves website terms of use, access controls (through the Robots exclusion standard, or robots.txt) and access restrictions (through access application programming interfaces (APIs)). AI companies relied heavily on this technique to amass the datasets necessary to train their AI models.
Most jurisdictions approach the issue as one of text and data mining, which usually involves a system of exceptions and limitations to copyright protection. It remains largely unsettled as to how some exceptions—including the text and data mining ("TDM") exceptions—are to apply to AI developers. This remains a critical issue for AI developers as they need quality datasets to train AI models.
By comparison, in the US, AI developers may rely on the fair use doctrine, where, interpreted broadly, may permit the use of copyrighted works for training without prior authorisation. In the EU, the DSM Directive permits TDM unless rights holders opt-out (Article 4(3)). The EU's AI Act, also requires that training data be "sufficiently representative" and "complete" (Article 10(3)), while also mandating compliance with copyright law (Article 53(1c)). On one view, this impacts how AI developers are to both comply with a completeness requirement in their training data, whilst also being subject to compliance with opt-outs of rights holders.5
The issue is further complicated where there is no current generally recognised standards or protocols for machine-readable expression of a reservation of rights (opt-out) under Article 4 of the DSM Directive.6 Several attempts have been made including an opt-out registry established by Berlin-based initiative, Spawning.7 The registry claims to be host to opt-out information for "1.5 Billion URLs (and counting)".8
However, contrastingly, an audit of the LAION-5B dataset revealed that less than 1 in 1,000 of images contained license metadata, and less than 7 in 100,000 of images were allowed for unrestricted use.9 In those author's findings, currently, opt-out regimes struck a better balance between rightsholder interests and AI innovation than relying on obtaining licenses.10 This would change as opt-out rates grew and alternatives such as data markets, mandatory licensing or safe harbour provisions are developed.11 What is deduced is that the number of opt-outs is currently very small, perhaps due to there being no generally accepted standard for rights holders to opt-out.
One study considered the importance of adopting a dynamic approach to regulatory decision-making.12 It considered that while generous fair use (using data for AI training without compensating the creator) benefited all parties when abundant training data existed, it could hurt creators and consumers where such data was scarce. Similarly, stronger AI-copyrightability (AI content enjoying more copyright protection) could hinder AI development and reduce social welfare. The analysis highlighted the complex interplay between these two copyright issues.
A different study found that less restrictive copyright regimes correlated with greater AI research output, code contributions, patents and start-up formations.13 This innovation impact has an important role for policymakers in deciding which level of copyright protection to afford. The singular challenge is to ensure high-quality training data while providing appropriate compensation to rightsholders.14
Looking forward
The current state of research focuses on two main areas:
- Training data copyright: Whether using copyrighted works to train AI models violates copyright laws
- AI-generated works: The copyright status of works created by AI systems without human authorship or originality
Copyright issues begin with potential infringement cases against copyright holders at the model pre-training stage (data collection, web scraping and data pre-processing). Web scraping and illegal acquisition of copyrighted content from pirate libraries (such as Books3 and LibGen)15 play a role here. As one moves through the model training pipeline, following model evaluation and deployment, AI model outputs are then subject to further copyright issues, including whether AI-generated outputs are copyright-protected. The 'copyrightability' of AI-generated outputs are an emerging area as, depending on the degree of interaction, such outputs may be generated purely through AI or with the assistance of a human operator. In the latter case, there may be justification for copyright protection for the parts involving humans with a requisite degree of human authorship and originality. Other issues such as memorisation/regurgitation and algorithmic licensing are areas in which right holders are seeking to be compensated for copyright infringement in the outputs of an AI model.
Footnotes
- Brett Danaher, Michael D Smith and Rahul Telang, Piracy Landscape Study: Analysis of Existing and Emerging Research Relevant to Intellectual Property Rights (IPR) Enforcement of Commercial-Scale Piracy (USPTO Economic Working Paper No 2020-02, United States Patent and Trademark Office, April 2020).
- Axel Metzger, 'Crisis of Journalism – Solutions to Copyright Law' (2025) 8 JuristenZeitung 329.
- P B Hugenholtz, 'Something Completely Different: Europe's Sui Generis Database Right' in Susy Frankel and Daniel Gervais (eds), The Internet and the Emerging Importance of New Forms of Intellectual Property (Wolters Kluwer, 2016) 205.
- Bertin Martens, 'Economic arguments in favour of reducing copyright protection for generative AI inputs and outputs' (Working Paper 09/2024, Bruegel, September 2024), 2. Available at: LINK ↗ Accessed 30 June 2025.
- Viktor Shcherbakov, Irvin Dalaud and Christian Peukert, 'AI needs better data than the law allows' (Research Paper, 14 March 2025), 1. Available at: LINK ↗ Accessed 30 June 2025.
- Paul Keller and Zuzanna Warso, 'Defining best practices for opting out of ML training' (Policy Brief No 5, Open Future, 28 September 2023), 7. Available at: LINK ↗ Accessed 30 June 2025.
- Ibid, 8.
- See LINK ↗ Accessed 30 June 2025. For further discussion, see Shcherbakov et al.(n 5), 18.
- Shcherbakov et al. (n 5), 1.
- Shcherbakov et al. (n 5), 2.
- Shcherbakov et al. (n 5), 2.
- S Alex Yang and Angela Huyue Zhang, 'Generative AI and Copyright: A Dynamic Perspective' (Research Paper, 4 February 2024). Available at: LINK ↗ Accessed 30 June 2025.
- Christian Peukert, 'Copyright and the Dynamics of Innovation in Artificial Intelligence' (Conference Paper, Proceedings of the 58th Hawaii International Conference on System Sciences, 2025).
- Shcherbakov et al. (n 5), 18.
- OECD Artificial Intelligence Papers, Intellectual Property Issues in AI Trained on Scraped Data, February 2025, No. 33, 20.
