OpenAI/Anthropic: Copyright lawsuit explained

Avatar
Lisa Ernst · 15.11.2025 · Technology · 10 min

Who pays the price for seemingly effortlessly generated texts, code, or lyrics from systems like ChatGPT or Claude? A German court fined OpenAI for using protected song lyrics, while Anthropic reached a settlement with book authors. These proceedings revolve around the fundamental question of whether copyrighted works can be mass-copied for AI training, and under what conditions. At the same time, AI providers are entering into license agreements with media houses to access "clean" training data. The lawsuits are a stress test for the business model of generative AI, with consequences for publishers, creatives, developers, users, and the prices of the services.

AI and copyright law

Copyright lawsuits against OpenAI and Anthropic concern generative AI systems that learn patterns from large amounts of texts, images, music, or code to create new content. During training, data is technically copied, for example, by downloading books from shadow libraries, crawling websites, or ingesting licensed archives. OpenAI operates ChatGPT and the underlying language models, which are trained with web data, licensed content, and other sources. Anthropic develops Claude, a competing model that performs similar tasks and is based on large text corpora.

Legally, copyright, which protects the reproduction and use of works, meets the idea that AI training could fall under existing limitations, such as text recognition or data analysis. Examples include "Text and Data Mining" in the EU or "Fair Use" in the US. Fair Use is a flexible US rule that allows the use of copyrighted works without permission for criticism, research, or transformations. In Europe, there are narrowly defined exceptions and special text and data mining rules, the application of which depends on the license and purpose of use.

At OpenAI, conflicts began with lawsuits from individual authors over the use of their books for training. In 2023, 17 authors joined together under the Authors Guild to file a class-action lawsuit accusing OpenAI of copying their texts to train models that can generate stylistically similar content. A federal court in New York upheld key parts of the copyright allegations against OpenAI and Microsoft. At the end of 2023, the New York Times sued OpenAI and Microsoft, alleging that millions of its articles had been used for training without a license. The newspaper is demanding the destruction of all models based on its content and potentially billions in damages. This case is considered a significant confrontation between media and AI.

Eight US newspapers belonging to the Alden Global Capital group also sued OpenAI and Microsoft. In 2025, the digital publisher Ziff Davis filed a lawsuit. In total, over a dozen lawsuits are pending against OpenAI. There is also a dispute regarding the programming assistant GitHub Copilot, which is based on OpenAI technology, as developers accuse Microsoft, GitHub, and OpenAI of mass-incorporating licensed open-source repositories into training and violating license terms. A large part of the claims were dismissed in 2024, but some points may still be pursued.

Source: YouTube

In Europe, a Munich court ruled in the fall of 2025 that ChatGPT violated German copyright law because protected song lyrics from GEMA members were used without permission. OpenAI must pay damages. The court indicated that not only the output but also the copying for training can fall under protection rights.

At Anthropic, the conflict culminated in a class-action lawsuit by book authors who accused the company of using approximately seven million books, including pirated copies, for training. A US federal judge initially ruled that training on legally acquired book copies could be considered fair use under certain circumstances, but left it open how to proceed with illegally obtained copies. In September 2025, Anthropic reached a settlement of around $1.5 billion to resolve the class-action lawsuit. The settlement amount shows that copyright risks for AI providers can reach existential proportions.

Focus – OpenAI is at the center of copyright debates because its AI models are trained with large amounts of data.

Source: de.marketscreener.com

OpenAI is at the center of copyright debates because its AI models are trained with large amounts of data.

Other AI companies such as Meta and Stability AI have won initial judgments where courts classified training on certain book or image corpora as permissible use or dismissed lawsuits for formal reasons. The legal situation remains unclear.

Economic aspects

Many creatives, publishers, and associations are taking action against OpenAI and Anthropic because it concerns a new market. Anyone training a language model needs enormous amounts of data: texts from newspapers, books, forums, code repositories, song lyrics. For a long time, this data was "scraped" from the open web, often with the assumption that publicly accessible content could be used for analysis purposes. The lawsuits challenge this practice and thus also whether AI companies will have to systematically pay licensing fees in the future.

At the same time, an ecosystem of data brokers and specialized "Dataset Providers" is growing, offering curated, legally secured datasets for AI training. Companies like Rightsify or vAIsual have joined an industry association that advocates for stricter rules, transparency, and compensation, and sees itself as an counterpoint to "wild" web scraping.

Focus – The competition between OpenAI and Anthropic significantly shapes the economic landscape of the AI industry.

Source: aicamp.so

The competition between OpenAI and Anthropic significantly shapes the economic landscape of the AI industry.

For media houses, the goal is to prevent AI models from reading and summarizing their content and competing with their own platforms without compensation. However, many publishers also recognize that they possess reliable data streams that can be licensed, and they are entering into collaborations. OpenAI has agreements with Axel Springer, News Corp, the Financial Times, Le Monde, and the Associated Press to license content for training and output. Reddit earns part of its revenue through license deals with Google and OpenAI for access to forum posts.

The "hidden costs" of AI are gaining weight in political discourse. Training runs of large models consume enormous amounts of energy and water. If billions are also spent on licenses for books, articles, or music, the question arises whether today's subscription prices for AI services are cost-covering or are being cross-financed by investors and hidden subsidies.

Regulation and politics

At the regulatory level, the framework is shifting: The EU-AI-Act obliges providers of "General Purpose AI" like ChatGPT and Claude to be more transparent about training data and to adhere more strictly to copyright law. This initiates a transition from voluntary disclosure to mandatory obligations, which can create a conflict with trade secrets.

It is established that both OpenAI and Anthropic have processed large amounts of copyrighted content when training their models. Lawsuits and court documents detail which data sources and shadow libraries were presumably used. Initial courts are sending conflicting signals: A Munich court considers the use of protected song lyrics to be a violation of German copyright law, while US judges in proceedings against Meta and Anthropic lean towards fair use assessments, at least for legally acquired copies.

AI models can "memorize" a portion of their training data, meaning they can reproduce longer passages almost verbatim. This is more of a technical side effect but relevant for sensitive content and confidential data. Studies show single-digit percentages here.

It remains unclear whether courts will ultimately consider the mere training on copyrighted works permissible as long as the models do not later return extensive, identical passages and the use is considered sufficiently "transformative." Legal analyses emphasize that much depends on the specific design of the models, market impact, and relevant limitation regulations.

It is also unclear how high future licensing costs will settle and whether they will primarily favor large players with deep pockets. Data licensing packages with media houses are sometimes in the range of several hundred million dollars over several years, while open-source models continue to rely heavily on freely available data.

The claim that AI training is "theft" in every case because a model necessarily stores complete works and can output them one-to-one is false or misleading. Technical analyses show that models primarily learn patterns and statistical correlations. The problem is rather a small but relevant part of memorization and the question of whether the initial copying of training data was permitted. This relativizes simple comparisons like "the AI is a large-format copier."

Also misleading is the notion that a single judgment, such as that of the New York Times against OpenAI, could "ban" or shut down all generative AI. More realistic are step-by-step adjustments: more licenses, stricter transparency requirements, possibly new compensation models for training data, and additional technical protective measures.

Focus – A diagram showing different AI models and their affiliation with closed source or open source.

Source: user-added

A diagram showing different AI models and their affiliation with closed source or open source.

Author associations such as the Authors Guild emphasize that without effective compensation systems, the livelihoods of many writers are endangered and that AI companies are building multi-billion dollar businesses on "stolen" books. They demand clear rules under which works may either not be used without consent, or at least collective remuneration flows to collecting societies.

Large media houses are divided: The New York Times or individual US regional newspapers rely on lawsuits to secure bargaining power. Other publishers, such as Axel Springer, News Corp, or the Financial Times, have opted for extensive license agreements and see AI as an additional distribution channel and revenue source.

OpenAI and Anthropic emphasize that they comply with the law, respect creators, and are increasingly relying on licensed or legally clarified data. At the same time, they argue that a strict licensing regime for every single training use would limit the development of AI to a few corporations.

Civil society organizations like the Electronic Frontier Foundation warn against overextending copyright law. If research, open-source projects, or smaller companies no longer have fair-use-like leeway, innovation could end up in the hands of a few major players. On the other hand, creative associations and some legal experts call for new remuneration mechanisms and protective rights to be introduced, especially in view of AI, which go beyond classic usage scenarios.

Practical implications

As a creative, developer, or company using AI, you increasingly have choices. If you produce content, you can consciously license your works to platforms that adhere to transparent AI compensation models, for example, through collecting societies or dataset providers. At the same time, you can use technical protection measures, from robots.txt configurations to special "noai" meta tags, which are increasingly respected by major providers.

If you use AI services professionally, it is worth looking at the contract terms. Many providers now allow you to disable training use for certain data or offer separate "Enterprise" environments without reusing your input. Companies with sensitive or copyright-valuable content, in particular, should actively explore such options.

For categorizing headlines, simple verification steps help: See if an article links to specific court documents, if numbers are verifiably substantiated, and if legal risks are clearly separated from pure speculation. Original sources such as Courtlistener, Justia or court publications themselves, which supplement media reports, are helpful.

Behind the scenes, a technical shift is underway: Providers like Cloudflare now offer standard blocks for AI crawlers and are experimenting with "Pay per Crawl" models, where AI companies have to pay for access to content. This can strengthen the bargaining position of content platforms, but it can also lead to some content being visible only to paying AI providers.

Future perspectives

Despite the many lawsuits, judgments, and settlements, important questions remain open. Central is the dogmatic question of whether the mere copying of large amounts of copyrighted works for the purpose of training constitutes an independent use requiring a license—or whether it is more akin to analysis actions that fall under limitations like Text and Data Mining or Fair Use.

It is also unclear how detailed AI providers will have to disclose their training data in the future. The EU AI Act requires a "summary" of the content used, but expert papers discuss whether this means more than broad categories and example sources, and how far the protection of trade secrets extends.

Another unresolved issue is international fragmentation: While US courts heavily rely on fair use arguments, European courts tend to apply stricter copyright dogmatics and specific TDM rules. This could lead to AI models being trained or deployed differently depending on the region—with corresponding consequences for competitiveness and access to powerful systems.

Finally, it is still unclear how new technical standards—from AI-specific robots extensions to content signals like "ai-train" or "ai-input"—will have legal implications and whether courts will one day interpret them as explicit consent or objection instruments.

The copyright lawsuits against OpenAI and Anthropic do not mark a sudden break, but rather the visible symptom of a deeper structural change: creators, media, and platforms are struggling to define how their works will be used and compensated in the age of generative AI—and AI companies must learn that "just pulling everything from the internet" is neither politically nor legally sustainable in the long term. For you, this means: It is worth not seeing training data as an abstract mass, but as what they are—the work of millions of people. The clearer we define rules, compensation paths, and technical protection options, the more likely it is that AI systems will emerge that are powerful and, at the same time, respect the rights of those on whose shoulders they stand.

Share our post!