Showing posts with label Data Provenance Initiative. Show all posts
Showing posts with label Data Provenance Initiative. Show all posts

Thursday, July 25, 2024

Data Owners Are Increasingly Blocking AI Companies From Using Their IP; PetaPixel, July 22, 2024

 MATT GROWCOOT, PetaPixel; Data Owners Are Increasingly Blocking AI Companies From Using Their IP

"Training data for generative AI models like Midjourney and ChatGPT is beginning to dry up, according to a new study.

The world of artificial intelligence moves fast. While court cases attempt to decide whether using copyrighted text, images, and video to train AI models is “fair use”, as tech companies argue, those same firms are already running out of new data to harvest. 

As generative AI has proliferated and become well-known, there has been a well-documented backlash and many have taken action by denying access to their online data — including photographers.

An MIT research group led the study which looked at 14,000 web domains that are included in three major AI training data sets. 

The study, published by the Data Provenance System, discovered an “emerging crisis in consent” as online publishers pull up the drawbridge by not giving permission to AI crawlers. 

The researchers looked at the C4, RefineWeb, and Dolma data sets and found that five percent of all the data is now restricted. But that number jumps to 25 percent when looking at the highest-quality sources. Generative AI needs a good caliber of data to produce good models."

Tuesday, July 23, 2024

The Data That Powers A.I. Is Disappearing Fast; The New York Times, July 19, 2024

Kevin Roose , The New York Times; The Data That Powers A.I. Is Disappearing Fast

"For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt."