Intellectual Property (IP), Artificial Intelligence (AI), Open Movements (OM) : AI crawlers

Thursday, July 3, 2025

Cloudflare Sidesteps Copyright Issues, Blocking AI Scrapers By Default; Forbes, July 2, 2025

Emma Woollacott , Forbes; Cloudflare Sidesteps Copyright Issues, Blocking AI Scrapers By Default

"IT service management company Cloudflare is striking back on behalf of content creators, blocking AI scrapers by default.

Web scrapers are bots that crawl the internet, collecting and cataloguing content of all types, and are used by AI firms to collect material that can be used to train their models.

Now, though, Cloudflare is allowing website owners to choose if they want AI crawlers to access their content, and decide how the AI companies can use it. They can opt to allow crawlers for certain purposes—search, for example—but block others. AI companies will have to obtain explicit permission from a website before scraping."

Thursday, July 25, 2024

Data Owners Are Increasingly Blocking AI Companies From Using Their IP; PetaPixel, July 22, 2024

MATT GROWCOOT, PetaPixel; Data Owners Are Increasingly Blocking AI Companies From Using Their IP

"Training data for generative AI models like Midjourney and ChatGPT is beginning to dry up, according to a new study.

The world of artificial intelligence moves fast. While court cases attempt to decide whether using copyrighted text, images, and video to train AI models is “fair use”, as tech companies argue, those same firms are already running out of new data to harvest.

As generative AI has proliferated and become well-known, there has been a well-documented backlash and many have taken action by denying access to their online data — including photographers.

An MIT research group led the study which looked at 14,000 web domains that are included in three major AI training data sets.

The study, published by the Data Provenance System, discovered an “emerging crisis in consent” as online publishers pull up the drawbridge by not giving permission to AI crawlers.

The researchers looked at the C4, RefineWeb, and Dolma data sets and found that five percent of all the data is now restricted. But that number jumps to 25 percent when looking at the highest-quality sources. Generative AI needs a good caliber of data to produce good models."