Did OpenAI Train GPT-4o on O’Reilly’s Paid Books?

OpenAI is once again in the spotlight over allegations of training its AI models on copyrighted material without proper authorization. This time, a new report from the AI Disclosures Project claims that OpenAI’s GPT-4o model was likely trained on paywalled books from O’Reilly Media — despite no licensing agreement in place.

AI models work by learning from vast amounts of data, including books, films, and online content, to generate text or images based on prompts. They don’t create from scratch but instead predict what comes next based on patterns they’ve previously seen. While AI companies have started to use synthetic data to fuel training, real-world content remains central due to its richness and variety. But using copyrighted material without consent raises serious ethical and legal concerns.

The AI Disclosures Project — a nonprofit launched in 2024 by O’Reilly Media founder Tim O’Reilly and economist Ilan Strauss — has now suggested that OpenAI’s GPT-4o has been trained on content it shouldn’t have had access to. The report, co-authored by O’Reilly, Strauss, and AI researcher Sruly Rosenblat, highlights how GPT-4o appears to “recognize” large volumes of content from O’Reilly Media’s paid, non-public catalog — more so than older models like GPT-3.5 Turbo.

GPT-4o Shows Strong Familiarity With O’Reilly’s Paywalled Content

According to the paper, GPT-4o showed a significant increase in recognition of O’Reilly book content compared to GPT-3.5 Turbo. The researchers used 13,962 paragraph excerpts from 34 different O’Reilly titles to assess whether the AI models had prior exposure to the material. They applied a technique called DE-COP, short for “Detecting Content Overlap in Pretraining,” which helps determine whether an AI model has likely been trained on specific copyrighted text.

The method works by testing how well a model can tell apart original human-written text from paraphrased AI-generated versions. A high level of precision in this task often signals that the model has already seen the content during training.

The results? GPT-4o had a significantly higher rate of recognition for paywalled O’Reilly book excerpts — especially those published before the model’s last known training cutoff — than any of OpenAI’s previous models.

No Clear Smoking Gun — But Serious Questions Remain

Despite the compelling evidence, the researchers are cautious about jumping to conclusions. They admit their findings don’t amount to definitive proof. One possibility, they suggest, is that users may have copied and pasted content from these books into ChatGPT, which could have exposed the model to such materials indirectly.

Further complicating the issue is the fact that the study didn’t include newer OpenAI models like GPT-4.5 or reasoning-optimized versions such as o3-mini and o1. It’s unclear whether those models were also exposed to the same unlicensed content or if GPT-4o was a one-off.

Still, the timing and nature of the findings are noteworthy, especially considering OpenAI’s ongoing legal challenges over how it sources training data. The company is currently facing lawsuits in the U.S. for allegedly infringing on copyright laws while building its AI systems.

OpenAI’s Content Strategy Has Faced Growing Scrutiny

OpenAI has made significant efforts to acquire high-quality data — legally and otherwise. It’s struck licensing agreements with news publishers, stock media platforms, and even social media networks. The company has also offered opt-out tools for content creators, though critics argue these mechanisms are flawed or insufficient.

At the same time, OpenAI has advocated for more relaxed legal frameworks around training AI on copyrighted data. The company has even recruited domain experts, including journalists and scientists, to help fine-tune its models’ capabilities. This broader trend has seen many AI firms seeking out specialized knowledge sources, whether licensed or not, to keep their systems competitive.

O’Reilly’s Accusation Highlights a Growing Tension in AI Development

If OpenAI did, in fact, use proprietary O’Reilly content without permission, it could add fuel to the ongoing debate over data usage ethics in the AI industry. Tim O’Reilly, who both leads O’Reilly Media and co-founded the watchdog group behind the report, is in a unique position to challenge OpenAI’s approach — both as a publisher and a technologist.

While OpenAI declined to comment on the findings, the report’s implications are already echoing across the tech and publishing industries. As the legal landscape around AI training data continues to evolve, this case could become a pivotal example of the challenges AI companies face when relying on third-party content to power their models.

Share with others