Perplexity denies training AI models as Cloudflare documents stealth crawlers

Perplexity denied training models after Reddit's October 22 lawsuit, while Cloudflare documented the company using stealth crawlers to evade protections.

Luis Rijo

Oct 24, 2025 • 17 min read

Reddit mascot with trap illustrating forensic test post that caught Perplexity AI scraping content

Reddit filed a federal lawsuit on October 22, 2025, naming Perplexity AI and three data-scraping companies for allegedly circumventing technological controls to access platform content. The 41-page complaint filed in the United States District Court for the Southern District of New York targets SerpApi LLC, Oxylabs UAB, AWMProxy, and Perplexity AI, Inc., seeking damages and injunctive relief under the Digital Millennium Copyright Act.

Perplexity responded the same day, claiming it operates as an "application-layer company" that does not train AI models on content. "Perplexity, as an application-layer company, does not train AI models on content. Never has," the company stated in its response. The AI search company characterized Reddit's lawsuit as "a sad example of what happens when public data becomes a big part of a public company's business model."

Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.

The tension centers on contrasting technical documentation. According to the lawsuit, Reddit employed digital tracking techniques similar to marked bills used in theft investigations. The platform created a test post that could only be crawled by Google's search engine and was not accessible anywhere else on the internet. Within hours, queries to Perplexity's answer engine produced the contents of that test post.

"The only way that Perplexity could have obtained that Reddit content and then used it in its 'answer engine' is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content," the complaint states. This forensic evidence represents Reddit's central method for documenting unauthorized access patterns.

Follow on Google, Google News, X, LinkedIn, Mastodon, Bluesky, or via RSS

Cloudflare documented systematic access patterns

Cloudflare's August 4, 2024 technical report presents findings that raise questions about Perplexity's characterization of its operations. The web infrastructure company observed what it described as "stealth crawling behavior" after receiving complaints from customers who had explicitly blocked Perplexity through robots.txt files and web application firewall rules.

The report documented specific technical behaviors. "Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website's preferences," Cloudflare reported. The infrastructure company documented Perplexity using a generic browser user agent intended to impersonate Google Chrome on macOS when the declared crawler faced blocking.

Cloudflare's testing methodology involved creating multiple brand-new domains that had not been indexed by any search engine or made publicly accessible. These domains implemented robots.txt files with directives to stop all automated access. When researchers queried Perplexity AI with questions about these restricted domains, Perplexity provided detailed information regarding the exact content hosted on each domain.

The report documented traffic volume showing 20-25 million daily requests from Perplexity's declared crawler and an additional 3-6 million daily requests from the undeclared crawler using a generic Chrome user agent. Both crawlers attempted to access content contrary to web crawling norms outlined in RFC 9309.

Cloudflare CEO Matthew Prince described Perplexity's practices as resembling "North Korean hackers" rather than a reputable AI company. The company de-listed Perplexity as a verified bot and added heuristics to managed rules that block stealth crawling behavior.

Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.

Learn more

Reddit's forensic trap catches Perplexity

Reddit employed its own tracking methodology to examine whether Perplexity was accessing content through unauthorized means. The platform created what it described as a "test post" that could only be crawled by Google's search engine and was not otherwise accessible anywhere on the internet. Within hours, queries to Perplexity's answer engine produced the contents of that test post, according to the complaint.

The timing underscores the dramatic shift in Perplexity's citation patterns. Reddit sent a cease-and-desist letter to Perplexity in May 2024, demanding the company stop scraping Reddit data. At that time, Perplexity told Reddit it was not using Reddit content to train AI models and would respect Reddit's robots.txt directives.

After Reddit sent its cease-and-desist letter, the volume of Reddit data cited by Perplexity increased forty-fold. The increase was so dramatic that outside observers hypothesized Perplexity had entered a licensing deal with Reddit. No such agreement exists, according to the complaint.

Perplexity claims it "respects robots.txt directives" and "only crawls content in compliance with robots.txt," according to the company's help documentation. The forensic test contradicts this characterization by demonstrating content access that should have been impossible if Perplexity only used its declared crawlers that respect robots.txt files.

Data scraping companies face trafficking allegations

Reddit's complaint describes systematic circumvention of technological controls by three data-scraping service providers. According to information obtained through a subpoena issued to Google, SerpApi, Oxylabs, and AWMProxy accessed nearly three billion Google search engine results pages containing Reddit content during a two-week period in July 2025.

SerpApi accessed 784 million Google SERPs with Reddit data between July 1-6, 2025, and 1.06 billion between July 7-13, 2025, according to the complaint. Oxylabs accessed 333 million and 448 million respectively during the same periods, while AWMProxy accessed 218 million and 264 million.

The complaint alleges these companies circumvent Google's SearchGuard system, which is designed to prevent automated systems from accessing search results while allowing individual users access. SearchGuard prevents unauthorized access by imposing barrier challenges that cannot be solved by automated systems unless they take affirmative actions to circumvent the system.

SerpApi explicitly advertises its ability to bypass these restrictions. The company's website features a page titled "How to Scrape Google Search Results" that notes its service "bypass[es]" restrictions. The company markets features called "Ludicrous Speed" and "Ludicrous Speed Max" that use server resources to automatically create numerous parallel requests to circumvent Google's technological controls.

SerpApi CEO Julien Khaleghy has described the company's circumvention methods as including a tool for "creating fake browsers using a multitude of IP addresses that Google sees as normal users," according to reporting by The Information in August 2025.

Oxylabs similarly promotes its circumvention capabilities. The company's website states that "Oxylabs' Google Search API is constructed to bypass the technical challenges Google has implemented." The company offers over 62,000 IP addresses located in New York alone and advertises proxies from various locations including Bronx, Brooklyn, Buffalo, Manhattan, and Queens.

AWMProxy, previously operated by a Russian entity in connection with a cybercriminal botnet shut down in 2021, has apparently resumed operation. The company sells access to proxy services that conceal location and identity, according to the complaint.

Perplexity's business relationship with SerpApi

Perplexity publicly lists itself as a customer of SerpApi on the scraping company's website. The complaint notes that SerpApi's website features a page titled "They trust us" that includes Perplexity's logo alongside other major companies including Nvidia, Meta, Shopify, KPMG, Morgan Stanley, and the United Nations.

The relationship matters because it establishes a direct connection between Perplexity and a company that explicitly advertises its ability to circumvent technological controls. SerpApi markets its tool as providing a way to scrape Google web searches "at scale" for use in training large language models or other AI models. The company says it can run more than 100,000 searches per hour.

Perplexity acknowledged Reddit's importance as a data source in an August 2025 blog post, stating that "Reddit has emerged as the most cited domain across AI models globally" based on analysis of ChatGPT, Google AI Overviews, and Perplexity citations.

The complaint describes Perplexity's business model as utilizing content from original sources such as Reddit and placing that data into a massive retrieval-augmented generation database, which Perplexity then combines with another company's large language model to repackage original source data into answers to user queries.

Technical measures protecting Reddit content

Reddit has implemented automated anti-scraping systems and teams dedicated to detecting and protecting against unauthorized access. The platform has made significant investments in various anti-scraping tooling and systems that detect and prevent unauthorized access by scrapers and bots.

Reddit employs industry-standard automation-prevention techniques, including registered user-identification limits, IP-rate limits, captcha bot protection, and anomaly-detection tools. The platform's current robots.txt file provides clear notice that it prohibits all unapproved web crawlers from accessing any part of Reddit's website.

Reddit's User Agreement, which authorized humans agree to when accessing Reddit, prohibits "scraping the Services without Reddit's prior written consent." The platform only allows individuals to access its services if they agree to Reddit's User Agreement. Reddit prohibits bulk automated access except by separate agreement through defined application programming interfaces.

Google likewise implements technological control measures to prevent unauthorized access to its products and services. Google operates a search engine that responds with search engine results pages to human user queries. Google does not permit, and prohibits, unauthorized automated access to its search engine results pages.

Google's terms of service prohibit "using automated means to access content from any of our services in violation of the machine-readable instructions on our web pages (for example, robots.txt files that disallow crawling, training, or other activities)." Google utilizes SearchGuard, which is designed to prevent automated systems from accessing and obtaining wholesale search results while allowing individual users access to search results.

Reddit's licensing agreements and business model

Reddit has entered into partnership agreements with a select list of AI companies willing to abide by policies that protect intellectual property, privacy, and other rights of Reddit and Reddit users. On February 22, 2024, Reddit and Google announced a partnership that enabled programmatic access by Google to Reddit content for use in Google's products and services.

The complaint states that Reddit depends on contributions of its many communities who care about how their content will be treated. "When they post their copyrighted content to Reddit, Redditors place great trust in Reddit; they expect Reddit to respect their wishes and take its role as stewards of the content seriously," according to the complaint.

Reddit's business is damaged by unauthorized access because defendants deprive Reddit of the ability to control what entities have access to its data, how that data is used, and whether Reddit data is handled in a manner consistent with its Privacy Policy, Public Content Policy, and User Agreement. Reddit has lost licensing fees or other commercial opportunities it would obtain from arrangements with companies that might instead use similar techniques as defendants to gain unauthorized access.

Reddit is forced to invest significant resources into hardware, software, and personnel to improve its technical security systems, surveillance, and anti-scraping efforts to prevent unauthorized accessing and use of Reddit data.

Perplexity's response characterizes lawsuit as negotiating tactic

Perplexity's response characterized the lawsuit as a negotiating tactic related to Reddit's training data partnerships with Google and OpenAI. "So, why sue Perplexity? Our guess: it's about a show of force in Reddit's training data negotiations with Google and OpenAI. (Perplexity doesn't train foundation models!)" the company stated.

The response addressed Reddit's licensing discussions. "Reddit told the press we ignored them when they asked about licensing. Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content. Never has. So it is impossible for us to sign a license agreement to do so," Perplexity stated.

Perplexity described Reddit's business approach as strong-arm tactics. "A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong arm tactics just isn't how we do business," according to the response.

The company emphasized its citation practices. "What does Perplexity actually do with Reddit content? We summarize Reddit discussions, and we cite Reddit threads in answers, just like people share links to posts here all the time," Perplexity stated. The company noted it invented citations in AI so users can verify accuracy of AI-generated answers and follow citations to learn more.

Perplexity characterized Reddit's position as contrary to an open internet. "Reddit changed its mind this week on whether they want Perplexity users to find your public content on their journeys of learning. Reddit thinks that's their right. But it is the opposite of an open internet," the response stated.

Legal framework under Digital Millennium Copyright Act

The complaint brings six counts against the defendants. Count I alleges all four defendants violated 17 U.S.C. § 1201(a)(1)(A) by circumventing technological measures that effectively control access to copyrighted works. This section prohibits circumventing a technological measure that effectively controls access to a copyrighted work.

Counts II and III target SerpApi and Oxylabs specifically for trafficking in technology designed to circumvent technological measures under 17 U.S.C. § 1201(a)(2) and § 1201(b). These sections prohibit manufacturing, importing, offering to the public, providing, or trafficking in any technology that is primarily designed to circumvent technological measures.

Count IV alleges unfair competition against all defendants. The complaint states defendants have misappropriated Reddit's labor, skill, expenditures, and goodwill, and have displayed bad faith in doing so. By circumventing technological control measures to access and scrape Reddit data on an unauthorized and automated basis, defendants have brazenly disregarded technological control measures in which Reddit and Google have invested considerable resources.

Count V alleges unjust enrichment against all defendants. The complaint states defendants have been unjustly enriched at Reddit's expense, and equity and good conscience militate against defendants retaining the benefits they have obtained. Through circumventing technological control measures, defendants have gained access to and scraped Reddit data on an unauthorized and automated basis.

Count VI alleges civil conspiracy against SerpApi and Perplexity AI. The complaint alleges these defendants entered into one or more contracts or business agreements for the purpose of circumventing technological control measures to gain unauthorized access to Reddit data. Both companies committed overt acts in furtherance of their agreement, including selling or providing circumvention services, paying for such services, and conducting the circumvention.

Marketing community implications

This case matters for the marketing community because it highlights the complexity around how AI companies describe their operations publicly versus their technical implementations. Marketing professionals increasingly rely on AI-powered tools for competitive intelligence, content research, and market analysis. The legal and technical standards established in cases like this will determine what data access methods remain viable.

The developments occur alongside fundamental shifts in search behavior patterns. Zero-click searches increased from 56 percent to 69 percent since AI Overviews launched, while ChatGPT referrals grew 25x over recent periods. These changes directly affect how marketing professionals access information and how platforms monetize content through licensing agreements versus other access arrangements.

Over 35 percent of top websites now block AI crawlers, representing a seven-fold increase from August 2023. The trajectory indicates growing resistance to uncompensated content usage as publishers seek sustainable business models in an environment where AI systems extract value without proportional traffic returns.

The case presents questions about technical practices in the AI industry. For marketing professionals evaluating which tools and platforms to integrate into workflows, the technical documentation from companies like Cloudflare and the legal claims from platforms like Reddit provide perspectives on how different AI companies operate beyond their public statements.

Crawl-to-refer ratios exposed fundamental economic tensions between content creators and AI companies. Perplexity's ratio climbed from 54 crawls per referral in January to 195 by July 2025, indicating heavier data collection without proportional traffic returns. Anthropic maintained ratios as high as 286,930 crawls per referral in January before improving to 38,065 by July.

The lawsuit emerges as legal battles intensify across the AI industry over training data rights. Reddit filed a previous complaint against Anthropic on June 4, 2025, for breach of contract, unjust enrichment, trespass to chattels, and unfair competition. Multiple content platforms have implemented restrictions on AI crawler access, reflecting broader tensions between content creators and AI companies over data usage and compensation.

Reddit's relationship with AI companies

Reddit has positioned itself as a valuable data source for AI companies while establishing licensing requirements. The platform describes itself as one of the largest repositories of human conversation in existence, with over 100 million unique users engaging in discussions daily across hundreds of thousands of interest-based communities.

Reddit's vast corpus of human-generated content is widely seen as invaluable to AI companies because Reddit data provides real-time access to evolving and dynamic topics. "Reddit data and information constantly grows and regenerates as users come and interact with their communities and each other in genuine and authentic ways," according to the complaint.

The platform has become a "top-cited source" of data for AI companies. One AI executive characterized Reddit's body of data as "like manna from heaven," according to reporting by The Wall Street Journal in December 2024.

Reddit granted Google exclusive search access through a multimillion-dollar deal that effectively makes Google the only search engine capable of indexing and displaying recent Reddit content. This development came as a result of a deal between Reddit and Google, with significant implications for internet search and AI development.

Reddit published a Public Content Policy describing how it protects and licenses public user content. This policy explains what happens when users create, submit, and make content publicly available on the Reddit platform, and how and why Reddit protects and licenses public content and related data. Reddit published this policy in response to "more and more entities using unauthorized access (for example, by scraping or using data brokers) or misusing authorized access to collect public data in bulk, especially with the rise of use cases like generative AI."

Industry context and enforcement mechanisms

The case reflects broader industry developments around AI data access and content creator control. Cloudflare introduced tools to block AI scrapers in June 2024, empowering publishers to protect content from being used to train large language models without permission. The web security company unveiled features to combat unauthorized content scraping by AI bots.

Cloudflare expanded its enforcement capabilities with Robotcop in December 2024, a network-level enforcement system for robots.txt policies. The system integrates with Cloudflare's AI Audit dashboard to provide both visibility and enforcement capabilities, transforming robots.txt rules into Web Application Firewall rules that can be deployed across Cloudflare's network.

Cloudflare introduced AI Crawl Control expansion in August 2025, enabling customizable HTTP 402 "Payment Required" responses for content creators seeking AI crawler monetization. The technical framework leverages HTTP 402 status codes to signal payment requirements. When AI crawlers request content, they receive either successful access or a 402 Payment Required response containing pricing information and contact details.

The enforcement mechanisms address fundamental changes in content monetization. Traditional search engines provide referral traffic in exchange for crawling access, but AI crawlers extract substantial content volumes while providing minimal referral traffic to publishers, creating unsustainable economic relationships for content creators dependent on advertising revenue.

Relief sought and legal precedents

Reddit seeks injunctive relief enjoining defendants from accessing or using Reddit's website, servers, systems, and any data contained therein for the purpose of unlawful data scraping. The complaint requests orders preventing defendants from accessing or using Google's website, servers, systems, and any data contained therein for the purpose of unlawful scraping of Reddit data.

The complaint seeks orders preventing defendants from developing or distributing any technology or product used for unauthorized circumvention of technological control measures and scraping of Reddit data. Reddit requests orders preventing defendants from using any Reddit data previously obtained through circumvention of technological control measures, and from selling or offering for sale any Reddit data previously obtained through circumvention.

Reddit seeks actual, statutory, or other compensatory damages as determined by a jury. The complaint requests disgorgement of defendants' ill-gotten gains from unauthorized commercial exploitation of Reddit's data. Reddit seeks pre-judgment and post-judgment interest, costs of this action including attorneys' fees, and such other legal and equitable relief as the court deems just and proper.

The legal framework builds on precedents around web scraping and data access. A 2019 case between LinkedIn and HiQ Labs upheld the general legality of scraping publicly available websites. However, the Digital Millennium Copyright Act provides specific protections for technological measures that effectively control access to copyrighted works.

Congress enacted the DMCA to prevent exactly what defendants are doing: circumventing or bypassing technological measures that effectively control access to copyrighted works. Each of the defendants is profiting by evading technological control measures to access Reddit data it knows it does not have permission to access or use, according to the complaint.

Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.

Timeline

August 2023: Only 5 percent of top websites blocked GPTBot web crawler from OpenAI
February 22, 2024: Reddit and Google announced partnership enabling programmatic access to Reddit content
May 2024: Reddit sent cease-and-desist letter to Perplexity demanding it stop scraping Reddit data
June 2024: Perplexity faced accusations of scraping websites that had explicitly indicated they did not want to be crawled
June 29, 2024: Cloudflare introduced feature to block AI scrapers
July 3, 2024: Cloudflare revealed extensive AI bot activity across the internet
July 24, 2024: Reddit's exclusive search deal with Google raised concerns
August 3, 2024: Over 35 percent of top websites blocked AI crawlers
August 4, 2024: Cloudflare documented Perplexity using stealth crawlers to evade website no-crawl directives
September 23, 2024: Cloudflare launched AI Audit tools to give website owners control over AI scraping
December 10, 2024: Cloudflare introduced Robotcop network-level enforcement system for robots.txt policies
June 4, 2025: Reddit filed complaint against Anthropic for breach of contract, unjust enrichment, trespass to chattels, and unfair competition
July 1-13, 2025: SerpApi, Oxylabs, and AWMProxy accessed nearly 3 billion Google SERPs containing Reddit content during two-week period
August 28, 2025: Cloudflare announced AI Crawl Control expansion with HTTP 402 payment protocol
September 1, 2025: Cloudflare released crawl-to-refer analysis data
October 22, 2025: Reddit filed 41-page federal lawsuit against SerpApi, Oxylabs, AWMProxy, and Perplexity AI in United States District Court for the Southern District of New York

Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.

Summary

Who: Reddit, Inc. filed a lawsuit against SerpApi LLC, Oxylabs UAB, AWMProxy, and Perplexity AI, Inc. Reddit is a social media platform with over 100 million daily active users and hundreds of thousands of interest-based communities. SerpApi is a Texas limited liability company providing web-scraping tools. Oxylabs is a Lithuanian company offering proxy services for web-scraping. AWMProxy is a web domain selling proxy services, previously operated by a Russian entity in connection with a cybercriminal botnet shut down in 2021. Perplexity AI is a Delaware corporation operating an AI-powered answer engine.

What: Reddit filed a 41-page complaint alleging defendants circumvented technological control measures to access and scrape Reddit content without authorization. The complaint alleges violations of the Digital Millennium Copyright Act, unfair competition, unjust enrichment, and civil conspiracy. Reddit created a forensic test post that could only be crawled by Google's search engine. Within hours, Perplexity's answer engine produced the contents of that test post, demonstrating unauthorized access through circumvention of technological controls. According to information obtained through a subpoena issued to Google, the three data-scraping companies accessed nearly three billion Google search engine results pages containing Reddit content during a two-week period in July 2025. Perplexity responded by stating it operates as an application-layer company that does not train AI models on content and characterized the lawsuit as a negotiating tactic in Reddit's training data partnerships.

When: Reddit filed the lawsuit on October 22, 2025, in the United States District Court for the Southern District of New York. The complaint documents defendant activities including a two-week period in July 2025 when data-scraping companies accessed nearly three billion Google SERPs containing Reddit content. Reddit sent a cease-and-desist letter to Perplexity in May 2024. After Reddit sent its cease-and-desist letter, the volume of Reddit data cited by Perplexity increased forty-fold. Cloudflare published its technical report documenting Perplexity's stealth crawling behavior on August 4, 2024.

Where: The lawsuit was filed in the United States District Court for the Southern District of New York. The complaint alleges defendants unlawfully accessed and obtained Reddit data on computers located in New York and by using proxy servers located in New York. SerpApi has its principal place of business in Austin, Texas. Oxylabs has its principal place of business in Vilnius, Lithuania. Perplexity has its principal place of business in San Francisco, California, and maintains an office at 299 Park Ave., New York, New York. The complaint names Reddit's principal place of business at 303 2nd Street, South Tower, 5th Floor, San Francisco, California, with an office at One World Trade Center in New York, New York.

Why: This matters for the marketing community because it highlights questions about how AI companies describe their operations publicly versus their technical implementations. Perplexity's emphasis on not training foundation models addresses one aspect of the dispute, while the lawsuit centers on unauthorized access and circumvention of technological controls. The technical documentation from Cloudflare describes crawling behavior that appears at odds with Perplexity's characterization of its operations as simple content summarization. Marketing professionals increasingly rely on AI-powered tools for competitive intelligence and content research. The legal and technical standards established in this case will determine what data access methods remain viable. The case presents perspectives on technical practices in the AI industry. For marketing professionals evaluating which tools to integrate into workflows, the technical documentation from infrastructure companies and legal claims from platforms provide additional context beyond public statements. The case could establish important precedents for AI data access rights as zero-click searches increased from 56 percent to 69 percent, fundamentally shifting content discovery patterns and monetization strategies.