Intellectual Property (IP), Artificial Intelligence (AI), Open Movements (OM) : data owners

Friday, December 27, 2024

The AI revolution is running out of data. What can researchers do?; Nature, December 11, 2024

Nicola Jones, Nature; The AI revolution is running out of data. What can researchers do?

"A prominent study1 made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets...

Several lawsuits are now under way attempting to win compensation for the providers of data being used in AI training. In December 2023, The New York Times sued OpenAI and its partner Microsoft for copyright infringement; in April this year, eight newspapers owned by Alden Global Capital in New York City jointly filed a similar lawsuit. The counterargument is that an AI should be allowed to read and learn from online content in the same way as a person, and that this constitutes fair use of the material. OpenAI has said publicly that it thinks The New York Times lawsuit is “without merit”.

If courts uphold the idea that content providers deserve financial compensation, it will make it harder for both AI developers and researchers to get what they need — including academics, who don’t have deep pockets. “Academics will be most hit by these deals,” says Longpre. “There are many, very pro-social, pro-democratic benefits of having an open web,” he adds."

Thursday, July 25, 2024

Data Owners Are Increasingly Blocking AI Companies From Using Their IP; PetaPixel, July 22, 2024

MATT GROWCOOT, PetaPixel; Data Owners Are Increasingly Blocking AI Companies From Using Their IP

"Training data for generative AI models like Midjourney and ChatGPT is beginning to dry up, according to a new study.

The world of artificial intelligence moves fast. While court cases attempt to decide whether using copyrighted text, images, and video to train AI models is “fair use”, as tech companies argue, those same firms are already running out of new data to harvest.

As generative AI has proliferated and become well-known, there has been a well-documented backlash and many have taken action by denying access to their online data — including photographers.

An MIT research group led the study which looked at 14,000 web domains that are included in three major AI training data sets.

The study, published by the Data Provenance System, discovered an “emerging crisis in consent” as online publishers pull up the drawbridge by not giving permission to AI crawlers.

The researchers looked at the C4, RefineWeb, and Dolma data sets and found that five percent of all the data is now restricted. But that number jumps to 25 percent when looking at the highest-quality sources. Generative AI needs a good caliber of data to produce good models."