US publishers tell Common Crawl to stop scraping and delete archive

Common Crawl website. Picture: Shutterstock/IB Photography

Digital news publishers in the US have raised “significant legal concerns” over the scraping of their content by Common Crawl Foundation.

Trade body Digital Content Next (DCN), which represents many major US publishers, has sent a cease and desist letter via its lawyer to the web archive creator.

They called on Common Crawl to immediately stop “scraping, retaining, or sharing copyrighted, paywalled, subscriber-only, or otherwise protected content from DCN member companies in its datasets”.

They also requested that publisher content already in the Common Crawl datasets is removed.

Since 2008 Common Crawl has scraped billions of pages on the internet each month to create a free archive for the public and is often cited in academic research.

The database has been widely used to train major AI models, proving controversial because it gave them access to swathes of publisher articles including, allegedly, paywalled content.

Its CCBot is now one of the most blocked AI scrapers by many news websites who do not see the value exchange in allowing their content to be crawled.

Common Crawl accused of potentially ‘inaccurate or misleading’ statements to publishers

Common Crawl publishes a registry of all the website owners that have asked to opt out of being scraped, including major news publishers such as the BBC, The Guardian, the Financial Times, The Washington Post, News Corp, DMG Media, Advance Publications, Associated Press, Le Monde, Reuters and Hearst Newspapers. More than 900 news websites are included under an entry submitted by US trade association News/Media Alliance.

The DCN legal letter, seen by Press Gazette, shared concerns about whether Common Crawl is complying with opt-out instructions and whether it is removing content that had previously been scraped when instructed to do so.

“For example, DCN understands that Common Crawl has in some instances confirmed that it was complying with such instructions only to claim later, after significant delays, that the costs needed to address technical challenges prevented it from doing so,” the letter said.

DCN’s lawyers are looking at whether statements made by Common Crawl such as these “may have been inaccurate or misleading, thus potentially constituting legally actionable fraudulent or negligent misrepresentations”.

The copyright lawsuit filed by The New York Times against ChatGPT creator OpenAI at the end of 2023 cited Common Crawl as 60% of the training mix for the GPT-3 model. Common Crawl has since agreed to remove NYT content from its archives, and has confirmed a separate request from publishers represented by the Danish Rights Alliance. But The Atlantic reported in November that content from both were still available.

Common Crawl executive director Rich Skrenta denied “lying to publishers” following The Atlantic’s reporting, saying: “When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset.”

He added: “No one at Common Crawl has ever claimed this work was instantaneous or complete; rather, we have been open about its complexity and ongoing nature.”

Skrenta also denied that CCBot goes “behind paywalls” to scrape websites.

He declined to comment specifically in response to the DCN legal letter.

Common Crawl ‘flagrantly infringed’ copyrighted publisher content

The DCN letter claimed Common Crawl has “flagrantly infringed” copyrighted content by creating and distributing its datasets and by sharing them with AI companies knowing that they “are actively engaged in the reproduction of that protected content”.

The letter also argued that “copyright law is not an opt-out regime” so the system was working the wrong way round.

It said: “Common Crawl has undermined copyright owners’ right to control the use of their content by creating and distributing datasets that DCN understands to contain substantial volumes of original, protected content created by DCN members at significant cost.

“Such conduct would be legally problematic in and of itself. But Common Crawl has exacerbated this misappropriation by actively marketing its datasets ‘for free’ to for-profit entities for commercial purposes, such as developing AI tools or training AI large language models.

“In other words, Common Crawl is not only creating datasets containing digital content creators’ and owners’ original, protected content without permission or compensation, but is knowingly using its datasets to help for-profit AI companies develop competing or substitutive products and services.”

DCN chief executive Jason Kint said in a blog post that the legal notice “challenges a growing assumption that content created through substantial investment can be collected, stored, repurposed, and monetised simply because it is technically accessible”.

Skrenta told a Common Crawl forum on Monday that the body has been “contributing to the development of open standards for expressing content preferences and improving transparency across the AI ecosystem” including as part of a working group on standardising how website owners can share whether they want to be scraped for AI models development.

Skrenta said: “As AI systems become more dependent on web-scale data, we continue to advocate for mechanisms that give publishers, creators, and communities greater visibility into how their content is used.”

But in November Skrenta told The Atlantic of publisher content: “You shouldn’t have put your content on the internet if you didn’t want it to be on the internet.”

Common Crawl is primarily funded by the Elbaz Family Foundation Trust, having been founded by US tech entrepreneur Gil Elbaz, but has received donations from the likes of OpenAI and Anthropic.

A paper from the Mozilla Foundation in 2024 made the case that Common Crawl was a crucial ingredient in the rise of generative AI models.

“Generative AI in its current form would probably not be possible without Common Crawl, given that the vast majority of data used to train the original model behind OpenAI’s ChatGPT, the generative AI product that set off the current hype, came from it. The same is true for many models published since then.”

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our “Letters Page” blog

Source link

Skeptic Society Magazine

for honest conversations

Years

Authors

Filter by Month

Filter by Categories

Filter by Tags

US publishers tell Common Crawl to stop scraping and delete archive

Common Crawl accused of potentially ‘inaccurate or misleading’ statements to publishers

Common Crawl ‘flagrantly infringed’ copyrighted publisher content

Leave a Reply Cancel reply