Showing posts with label AI training data sets. Show all posts
Showing posts with label AI training data sets. Show all posts

Tuesday, July 23, 2024

The Data That Powers A.I. Is Disappearing Fast; The New York Times, July 19, 2024

Kevin Roose , The New York Times; The Data That Powers A.I. Is Disappearing Fast

"For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt."

Sunday, November 5, 2023

Artists may “poison” AI models before Copyright Office can issue guidance; Ars Technica, November 3, 2023

 , Ars Technica ; Artists may “poison” AI models before Copyright Office can issue guidance

"Rather than rely on opting out of future AI training data sets—or, as OpenAI recommends, blocking AI makers' web crawlers from accessing and scraping their sites in the future—artists are figuring out how to manipulate their images to block AI models from correctly interpreting their content."