The question dominating the legal world for months is: Is the data training of Generative Artificial Intelligence (AI) models legal or not? This issue has sharply divided opinion in both technological and legal spheres.
On one side, Fair Use advocates assert that AI training is akin to human learning and therefore entirely legitimate.
Surprisingly, this view is also supported by tech giants like Google and OpenAI, companies that produce proprietary software and are usually very protective of their intellectual property.
On the other side, most content creators—whether textual, visual, or aural—have gradually realized that generative AI relies heavily on scraping and appropriating data from the global digital user base and countless artists worldwide. As a result, authors and their economic rights holders argue that companies should not only be held accountable for these actions but also adopt more ethical practices, regardless of forthcoming regulations. This concern appears to be largely ignored by the tech industry.
Developer companies often hide behind the mantra of industrial secrecy, claiming that disclosing their methods would be improper and contrary to their interests. Unfortunately, this global problem is not adequately addressed in the European Copyright Directive, which includes a controversial exception on data mining (in Italy Articles 70b and 70c of the LoA). We await the implementation of AI Act, which promises new regulations, including transparency about the data used in AI training. It suggests that providers of such models should publish detailed summaries of the content used for training.
In the United States, President Biden’s Executive Order on Artificial Intelligence seeks to better manage and regulate the issue. Against this backdrop, Republican Senator Adam Schiff has proposed the “Generative AI Copyright Disclosure Act,” a bill that would require companies to disclose their training methods. This legislative move comes as figures like Mira Murati, CTO of OpenAI, express uncertainty about the data used in training generative AI models.
The Fragility of the Sanctions System
The core issue, as repeatedly highlighted, is clear: the economics of Generative AI allow developer companies to weather challenges and fines imposed by national authorities, as seen in France with the Antitrust Authority and potentially in Italy with the Privacy Authority. Given OpenAI’s current valuation of $90 billion, penalties and legal procedures are manageable costs. It seems that large AI developers are willing to base their business models on legally dubious practices. National governments, fearing to fall behind in the technological race, are more inclined to explore development potential than to enforce difficult regulations.
In Europe, this regulatory barrier appears weak. The historical and global significance of these events—from both an industrial and labor perspective—is immense. Similar to Perestroika, which led to an economic oligarchy despite strict regulations, the current framework may be insufficient to control similar dynamics in technology and AI.
Legal Actions and Fair Use in Commercial Contexts
Major companies like The New York Times, Universal Music Group and Getty Images are suing AI developers such as OpenAI and Stability AI for copyright infringement. These tech companies deny any wrongdoing. The key legal question is whether storing billions of data points for economic profit can be considered Fair Use under U.S. doctrine.
In the U.S., technology companies rely on the broad exemptions provided by Fair Use, which historically allowed practices like Google Books’ digital library under the concept of “transformative” use. However, a recent Supreme Court ruling emphasized that Fair Use’s “purpose and character” factor does not apply in commercial contexts.
Towards Data Licensing
As legal and ethical debates intensify, it becomes evident that a potential solution lies in “Data Licensing.” AIs need continuous updates from new human content, prompting many rights holders to seek agreements ensuring a steady supply of fresh material. OpenAI has already secured about a dozen such agreements and plans to expand these partnerships. Major media groups, like Rupert Murdoch’s News Corp, are pursuing similar negotiations, anticipating that agreements will prevail over litigation in the long run. Platforms like Shutterstock, Reddit, and Tumblr are licensing their archives to AI companies, thus facilitating access to vast data repositories.
The Importance of Proprietary Content in AI Training
An alternative to this scenario is training AIs exclusively on proprietary data and text, but it is currently challenging to imagine how such tools could compete with the best general-purpose tools. Ultimately, navigating these issues through strategic partnerships and ongoing innovation in original and proprietary content will be essential.
This path is crucial for a future where legal frameworks must rapidly evolve to keep pace with technological advancements.
The original article was published on Agenda Digitale on April 22, 2024.