Showing posts with label AI training data sets. Show all posts
Showing posts with label AI training data sets. Show all posts

Sunday, December 8, 2024

There’s No Longer Any Doubt That Hollywood Writing Is Powering AI; The Atlantic, November 18, 2024

Alex Reisner , The Atlantic; There’s No Longer Any Doubt That Hollywood Writing Is Powering AI

"Editor’s note: This analysis is part of The Atlantic’s investigation into the OpenSubtitles data set. You can access the search tool directly hereFind The Atlantic's search tool for books used to train AI here.

For as long as generative-AI chatbots have been on the internet, Hollywood writers have wondered if their work has been used to train them. The chatbots are remarkably fluent with movie references, and companies seem to be training them on all available sources. One screenwriter recently told me he’s seen generative AI reproduce close imitations of The Godfather and the 1980s TV show Alf, but he had no way to prove that a program had been trained on such material.

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. Not just on The Godfather and Alf, but on more than 53,000 other movies and 85,000 other TV episodes: Dialogue from all of it is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies. I recently downloaded this data set, which I saw referenced in papers about the development of various large language models (or LLMs). It includes writing from every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The WireThe Sopranos, and Breaking Bad. It even includes prewritten “live” dialogue from Golden Globes and Academy Awards broadcasts."

Tuesday, July 23, 2024

The Data That Powers A.I. Is Disappearing Fast; The New York Times, July 19, 2024

Kevin Roose , The New York Times; The Data That Powers A.I. Is Disappearing Fast

"For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt."

Sunday, November 5, 2023

Artists may “poison” AI models before Copyright Office can issue guidance; Ars Technica, November 3, 2023

 , Ars Technica ; Artists may “poison” AI models before Copyright Office can issue guidance

"Rather than rely on opting out of future AI training data sets—or, as OpenAI recommends, blocking AI makers' web crawlers from accessing and scraping their sites in the future—artists are figuring out how to manipulate their images to block AI models from correctly interpreting their content."