Intellectual Property (IP), Artificial Intelligence (AI), Open Movements (OM) : web archiving

Thursday, February 1, 2024

The economy and ethics of AI training data; Marketplace.org, January 31, 2024

Matt Levin, Marketplace.org; The economy and ethics of AI training data

"Maybe the only industry hotter than artificial intelligence right now? AI litigation.

Just a sampling: Writer Michael Chabon is suing Meta. Getty Images is suing Stability AI. And both The New York Times and The Authors Guild have filed separate lawsuits against OpenAI and Microsoft.

At the heart of these cases is the allegation that tech companies illegally used copyrighted works as part of their AI training data.

For text focused generative AI, there’s a good chance that some of that training data originated from one massive archive: Common Crawl.

“Common Crawl is the copy of the internet. It’s a 17-year archive of the internet. We make this freely available to researchers, academics and companies,” said Rich Skrenta, who heads the nonprofit Common Crawl Foundation."