Common Crawl supplies paywalled content to AI companies despite publisher objections

Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.

Common Crawl supplies paywalled content to AI companies despite publisher objections

A nonprofit organization has been systematically supplying paywalled news articles to major AI companies for training large language models, according to an investigation published November 4, 2025, by The Atlantic's Alex Reisner. Common Crawl maintains archives containing millions of articles from major news organizations that readers typically must pay to access, enabling AI developers including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon to train their models on premium journalism without compensation to publishers.

The organization operates by scraping billions of webpages to build a massive archive of internet content measured in petabytes. While Common Crawl states on its website that it scrapes "freely available content" without going behind "paywalls," its scraper circumvents paywall mechanisms used by news publishers. On many news websites, users can briefly see full article text before browser code executes to check subscription status. Common Crawl's scraper never executes that code, capturing complete articles instead.

This technical workaround has resulted in Common Crawl's archives containing millions of articles from The Economist, Los Angeles Times, Wall Street Journal, New York Times, New Yorker, Harper's, and The Atlantic, according to Reisner's research. The archives have appeared in training data of thousands of AI models, with former Mozilla researcher Stefan Baack stating that "generative AI in its current form would probably not be possible without Common Crawl."

Publisher removal requests ignored

Common Crawl's executive director Rich Skrenta has publicly argued that AI models should access anything on the internet. "The robots are people too," Skrenta told Reisner, suggesting they should be allowed to "read the books" for free. Multiple news publishers requested that Common Crawl remove their articles to prevent AI training use. Common Crawl claims it complies with these requests, but research shows otherwise.

The New York Times sent a notice to Common Crawl in July 2023 requesting removal of previously scraped Times content. In their lawsuit against OpenAI, the Times noted that Common Crawl includes "at least 16 million unique records of content" from Times websites. In November 2023, Times spokesperson Charlie Stadtlander told Business Insider that the organization "simply asked that our content be removed, and were pleased that Common Crawl complied."

Reisner's investigation found many Times articles still present in the archives. When informed of this finding, Stadtlander stated: "Our understanding from them is that they have deleted the majority of the Times's content, and continue to work on full removal."

The Danish Rights Alliance (DRA), representing publishers and rights-holders in Denmark, described a similar interaction. Thomas Heldrup, the organization's head of content protection and enforcement, showed Reisner a redacted email exchange beginning in July 2024 requesting member content removal. In December 2024, more than six months after the initial request, Common Crawl's attorney wrote: "I confirm that Common Crawl has initiated work to remove your members' content from the data archive. Presently, approximately 50% of this content has been removed."

Other publishers received similar messages from Common Crawl, with some told after multiple follow-up emails that removal was 50 percent, 70 percent, and then 80 percent complete. By examining the petabytes of data, Reisner found large quantities of articles from the Times, DRA, and other publishers still present in Common Crawl's archives. The file storage system logs modification times of every file. None of the content files in Common Crawl's archives appears modified since 2016, suggesting no content removal in at least nine years.

Technical barriers to deletion

Skrenta initially told Reisner that removal requests are "a pain in the ass" but insisted the foundation complies. In a second conversation, Skrenta was more forthcoming, saying Common Crawl is "making an earnest effort" to remove content but that the file format storing its archives is meant "to be immutable. You can't delete anything from it." Skrenta did not answer questions about where the 50, 70, and 80 percent removal figures originate.

The nonprofit appears to be concealing this reality from website visitors. A search function on Common Crawl's website, the only nontechnical tool for viewing archive contents, returns misleading results for certain domains. A search for nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result, despite articles from NYTimes.com existing in most of these crawls.

Reisner discovered more than 1,000 other domains producing incorrect "no captures" results for several crawls. Most belong to publishers including BBC, Reuters, New Yorker, Wired, Financial Times, Washington Post, and The Atlantic. According to Reisner's research and Common Crawl's own disclosures, companies behind each publication have sent legal requests to the nonprofit. At least one publisher told Reisner it used this search tool and concluded its content had been removed from Common Crawl's archives.

Financial relationships with AI industry

Common Crawl received 15 years of near-exclusive financial support from the Elbaz Family Foundation Trust. In 2023, it received donations from OpenAI ($250,000) and Anthropic ($250,000), along with other organizations involved in AI development. Skrenta told Reisner that running Common Crawl costs "millions of dollars."

When training AI models, developers such as OpenAI and Google typically filter Common Crawl's archives to remove unwanted material including racism, profanity, and low-quality prose. Each developer and company employs its own filtering strategy, leading to proliferation of Common Crawl-based training datasets: c4 (created by Google), FineWeb, DCLM, and more than 50 others. These datasets have been downloaded tens of millions of times from Hugging Face and other sources.

Common Crawl doesn't only supply raw text. The organization has been helping assemble and distribute AI training datasets itself. Its developers have co-authored multiple papers about large language model training data curation. They sometimes appear at conferences showing AI developers how to use Common Crawl for training. Common Crawl even hosts several AI training datasets derived from its crawls, including one for Nvidia. In its paper on the dataset, Nvidia thanks certain Common Crawl developers for their advice.

Advertise on ppc land

Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.

Learn more

Industry response to web scraping

The marketing community faces mounting challenges from unauthorized AI training data collection. Over 35% of the world's top 1000 websites now block OpenAI's GPTBot web crawler, representing a seven-fold increase from August 2023 when only 5% blocked the crawler. Common Crawl's CCBot has become the scraper most widely blocked by the top 1,000 websites in the past year, surpassing even OpenAI's GPTBot.

More than 80 media executives gathered in New York during the week of July 30, 2025, under the IAB Tech Lab banner to address what many consider an existential threat to digital publishing. Mediavine Chief Revenue Officer Amanda Martin joined representatives from Google, Meta, and numerous other industry leaders in confronting AI companies that scrape publisher content without consent or compensation. Notably absent from the gathering were OpenAI, Anthropic, and Perplexity.

Cloudflare research released August 29, 2025, revealed stark imbalances between how much content AI platforms crawl for training purposes versus traffic they refer back to publishers. Anthropic crawled 38,000 pages for every referred page visit in July 2025, while OpenAI maintained a ratio of 1,091 crawls per referral. Training-related crawling now drives nearly 80% of all AI bot activity, representing an increase from 72% documented one year earlier.

However, blocking only prevents future content from being scraped. It doesn't affect webpages Common Crawl has already collected and stored in its archives, which the foundation adds to every few weeks with crawls containing 1 billion to 4 billion webpages. The foundation has been publishing these regular installments since 2013.

In 2020, OpenAI used Common Crawl's archives to train GPT-3. OpenAI claimed the program could generate "news articles which human evaluators have difficulty distinguishing from articles written by humans." In 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, initiating the generative AI boom. Many AI companies are now using publishers' articles to train models that summarize and paraphrase news, deploying those models in ways that reduce traffic to original sources.

A federal court dismissed Raw Story's lawsuit against OpenAI on November 10, 2024, citing lack of standing under the Digital Millennium Copyright Act. Raw Story Media and AlterNet Media collectively published over 400,000 articles allegedly scraped and included in OpenAI's training datasets WebText, WebText2, and Common Crawl. The court's decision suggests that simply having content included in AI training datasets, without specific instances of harmful use, may not be sufficient for legal standing.

IAB Europe released technical standards in September 2025 requiring AI platforms to compensate publishers for content ingestion. According to IAB Europe Data Analyst Dimitris Beis, the framework addresses "a paradigm of publisher remuneration for content ingestion" through three core mechanisms: content access controls, discovery protocols, and monetization APIs. The framework emerges from documented traffic disruptions, with referrals from AI platforms increasing 357% year-over-year in June 2025.

Cloudflare launched a pay-per-crawl service on July 2, 2025, allowing content creators to charge AI crawlers for access. The service affects major AI crawlers including CCBot (Common Crawl), ChatGPT-User (OpenAI), ClaudeBot (Anthropic), and GPTBot (OpenAI). Publishers control three distinct options for each crawler: allow free access, charge at configured domain-wide pricing, or block access entirely.

Fair use arguments and robot rights

AI companies have argued that using copyrighted material constitutes fair use. Skrenta has been framing the issue in terms of robot rights, sending a letter in 2023 urging the U.S. Copyright Office not "to hinder the development of intelligent machines." The letter included two illustrations of robots reading books.

When Reisner asked about publishers excluding themselves from what Skrenta called "Search 2.0," referring to generative AI products now widely used to find information online, Skrenta stated: "You shouldn't have put your content on the internet if you didn't want it to be on the internet."

Former Mozilla researcher Stefan Baack pointed out in his 2024 report that Common Crawl could require attribution whenever its scraped content is used. This would help publishers track use of their work, including when it appears in training data of AI models not supposed to have access. This is a common requirement for open datasets and would cost Common Crawl nothing.

When asked if he had considered this suggestion, Skrenta told Reisner he had read Baack's report but didn't plan on implementing the recommendation because it wasn't Common Crawl's responsibility. "We can't police that whole thing," Skrenta said. "It's not our job. We're just a bunch of dusty bookshelves."

Implications for marketing professionals

The developments create significant implications for the marketing community. TikTok emerged as the most scraped website in 2025, jumping from outside the top 10 with 321% traffic growth, according to research released September 9, 2025, by web scraping company Decodo. Video and social media platforms now represent 38% of all scraping activity, reflecting demand for multimodal AI training data.

Meta's leaked scraping operations revealed systematic data collection from approximately 6 million unique websites, according to documents published by Drop Site News on August 6, 2025. The comprehensive operation encompasses roughly 100,000 of the internet's most-trafficked domains, demonstrating the scope of modern data collection efforts for AI training.

Publishers already struggle with identity challenges, with 84% unable to identify more than 25% of their website visitors according to Wunderkind research. Traditional reliance on search traffic for building audience relationships through newsletter subscriptions, social media follows, and direct website bookmarks becomes compromised as AI features provide answers without directing users to source websites.

Reddit filed a lawsuit against Anthropic on June 4, 2025, alleging the AI company violated contractual agreements and engaged in unfair business practices by using Reddit content without authorization to train its Claude chatbot. The 28-page complaint seeks damages and injunctive relief for what Reddit characterizes as "commercial exploitation" of user-generated content valued at tens of billions of dollars.

The case documents reveal that Anthropic researchers, including CEO Dario Amodei, acknowledged using "Reddit comments" as training data to improve AI model performance. Reddit alleges this unauthorized use violates explicit terms in its User Agreement prohibiting commercial exploitation without written consent.

Throughout his conversation with Reisner, Skrenta expressed little respect for how original reporting works. He downplayed the importance of any particular newspaper or magazine, telling Reisner that The Atlantic is not a crucial part of the internet. "Whatever you're saying, other people are saying too, on other sites," Skrenta said.

Skrenta did express tremendous reverence for Common Crawl's archive, viewing it as a record of civilization's achievements. He told Reisner he wants to "put it on a crystal cube and stick it on the moon" so that "if the Earth blows up," aliens might be able to reconstruct human history. "The Economist and The Atlantic will not be on that cube," Skrenta told Reisner. "Your article will not be on that cube. This article."

Timeline

Summary

Who: Common Crawl, a nonprofit organization founded by Gil Elbaz and directed by Rich Skrenta, supplies archived web content to major AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon. Major news organizations including The New York Times, The Economist, Los Angeles Times, Wall Street Journal, The New Yorker, Harper's, The Atlantic, BBC, Reuters, Wired, Financial Times, and Washington Post have their paywalled content included in the archives.

What: Common Crawl maintains petabyte-scale archives containing millions of paywalled news articles that AI companies use to train large language models. The organization's scraper bypasses paywall mechanisms by never executing browser code that checks subscription status. Despite claiming compliance with publisher removal requests, investigation found no content files modified since 2016 and misleading search results concealing presence of publisher content. The organization received $250,000 donations each from OpenAI and Anthropic in 2023 and helps assemble AI training datasets.

When: Common Crawl has been scraping webpages since the early 2010s and publishing regular crawls since 2013. The New York Times requested content removal in July 2023. The Danish Rights Alliance initiated removal requests in July 2024. Alex Reisner published his investigation on November 4, 2025. OpenAI used Common Crawl archives to train GPT-3 in 2020, which led to ChatGPT's launch in 2022. No content files in the archives appear modified since 2016.

Where: Common Crawl operates by scraping billions of webpages globally to build archives measured in petabytes. The archived content has appeared in training data of thousands of AI models. Common Crawl adds new crawls to its archive every few weeks, each containing 1 billion to 4 billion webpages. The organization hosts AI training datasets including one for Nvidia and makes archives freely available for download through platforms including Hugging Face.

Why: This matters for the marketing community because publishers face declining traffic revenues as AI platforms consume content at unprecedented scales while providing minimal referrals. Training-related crawling now drives 79% of all AI bot activity, with Anthropic crawling 38,000 pages per referral according to August 2025 Cloudflare data. Publishers struggle with 84% unable to identify more than 25% of website visitors. Traditional audience-building strategies through search traffic become compromised as AI features provide answers without directing users to source websites. The developments represent an existential threat to digital publishing business models, with referrals from AI platforms increasing 357% year-over-year while publishers receive no compensation for content used in AI training and inference. More than 80 media executives convened in July 2025 to address these challenges, implementing technical standards for AI publisher compensation through frameworks established by IAB Tech Lab and IAB Europe.