By Rachel Buckland ‘26
In late December 2024, The New York Times Company (“The Times”) sued OpenAI and Microsoft, claiming that OpenAI and Microsoft illegally used The Times’s intellectual property while developing AI tools such as ChatGPT. This case, if it goes to trial, could have far-reaching effects on both the future of AI tools and the media industry. But in order to determine whether The Times has a strong copyright infringement claim, one first needs to understand the underlying technology and how it uses The Times’s content.
The Technology
AI models, such as OpenAI’s ChatGPT and Microsoft’s Copilot, rely on large-language models (“LLMs”) which are a type of generative AI that “remembers” patterns and structures of language in order to translate, generate, or predict text. When developing an LLM, the model is trained using a massive set of text data, including books, articles, and websites (such as Wikipedia). Training is an iterative process where the model adjusts parameter values until it is able to correctly predict the next piece of text. Understandably, the quality of the text used during training has a big impact on resulting LLM capabilities. OpenAI admitted that “datasets we view as higher quality are sampled more frequently” during training, including content from The Times. With the increase in size of the datasets comes the increased likelihood of memorization, a phenomenon in which the LLM will regurgitate large parts of training documents when prompted. More common than these memorizations are hallucinations, which are coherent but factually incorrect LLM outputs.
The Complaint
At the heart of the complaint is that OpenAI and Microsoft rely on high-quality journalism to train their LLMs and have not compensated The Times for the usage of this content. The Times makes the claim that failing to compensate them for their intellectual property jeopardizes The Times’s ability to continue to produce quality independent journalism. Undoubtably, The Times’s content was used in training the defendants’ LLMs, especially considering recent licensing partnerships between OpenAI and media companies such as Axel-Springer and The Associated Press. Additionally, The Times strengthens its argument by alleging infringing instances of memorization. To generate these memorization outputs, 100 examples of which are included in the complaint, The Times input the article URL and a prompt containing a short snippet from the article. In response, the LLM output copies of the articles themselves, despite these articles being protected by a paywall. Since The Times's articles are copyrighted, this reproduction may be a violation of 17 USC §106. The Times also claims that the LLMs will closely summarize its articles, mimic its expressive style, and wrongly attribute false information to it in the form of hallucinations, all of which denigrate the value of The Times’s intellectual property and reputation.
OpenAI responded to the complaint by explaining that including URLs in the prompts violates OpenAI’s terms of service, and that OpenAI is constantly working to make its products more resistant to this “type of misuse.” Given the security and intellectual property concerns around memorization, whether OpenAI and Microsoft are able to demonstrate that memorization of training data is a flaw, or a feature could be determinative.
What’s Next?
It’s likely that this case will settle out of court because of the uncertain legal precedent and the disparity in economic bargaining power between The Times and its Big Tech opponents. In cases like Sony Corp of America v. Universal City Studios, Inc. and Authors Guild v. Google, courts have found that reproductions of copyrighted media may be transformative and therefore fair use. Both these precedents involved a change in the form of the media (digitization of books or tape recording of broadcasted content). If the case goes to trial, we could see new standards being set that could change the course of AI development, or we could see courts sticking to previous norms, resulting in economically devastating effects for the media industry.
Take-home Definitions:
LLM: large-language models – works by predicting words that are likely to follow a given string of text based on the potentially billions of examples used to train it.
Training: process of teaching an AI model to perform a certain task, such as producing natural language text in a variety of styles, by exposing it to large volumes of data.
Memorization: a behavior in LLMs where, given the right prompt, the model will repeat large portions of materials they were trained on. This behavior indicates that LLM parameters encode retrievable copies of many of those training works.
Hallucinations: LLM text outputs that are factually incorrect or misleading.