Intellectual Property (IP), Artificial Intelligence (AI), Open Movements (OM) : AI training

Showing posts with label AI training. Show all posts

Sunday, December 8, 2024

There’s No Longer Any Doubt That Hollywood Writing Is Powering AI; The Atlantic, November 18, 2024

Alex Reisner , The Atlantic; There’s No Longer Any Doubt That Hollywood Writing Is Powering AI

"Editor’s note: This analysis is part of The Atlantic’s investigation into the OpenSubtitles data set. You can access the search tool directly here. Find The Atlantic's search tool for books used to train AI here.

For as long as generative-AI chatbots have been on the internet, Hollywood writers have wondered if their work has been used to train them. The chatbots are remarkably fluent with movie references, and companies seem to be training them on all available sources. One screenwriter recently told me he’s seen generative AI reproduce close imitations of The Godfather and the 1980s TV show Alf, but he had no way to prove that a program had been trained on such material.

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. Not just on The Godfather and Alf, but on more than 53,000 other movies and 85,000 other TV episodes: Dialogue from all of it is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies. I recently downloaded this data set, which I saw referenced in papers about the development of various large language models (or LLMs). It includes writing from every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The Wire, The Sopranos, and Breaking Bad. It even includes prewritten “live” dialogue from Golden Globes and Academy Awards broadcasts."

Tuesday, November 5, 2024

The Heart of the Matter: Copyright, AI Training, and LLMs; SSRN, November 1, 2024

Daniel J. Gervais, Vanderbilt University - Law School

Noam Shemtov, Queen Mary University of London, Centre for Commercial Law Studies

Haralambos Marmanis, Copyright Clearance Center

Catherine Zaller Rowland, Copyright Clearance Center

SSRN; The Heart of the Matter: Copyright, AI Training, and LLMs

"Abstract

This article explores the intricate intersection of copyright law and large language models (LLMs), a cutting-edge artificial intelligence technology that has rapidly gained prominence. The authors provide a comprehensive analysis of the copyright implications arising from the training, fine-tuning, and use of LLMs, which often involve the ingestion of vast amounts of copyrighted material. The paper begins by elucidating the technical aspects of LLMs, including tokenization, word embeddings, and the various stages of LLM development. This technical foundation is crucial for understanding the subsequent legal analysis. The authors then delve into the copyright law aspects, examining potential infringement issues related to both inputs and outputs of LLMs. A comparative legal analysis is presented, focusing on the United States, European Union, United Kingdom, Japan, Singapore, and Switzerland. The article scrutinizes relevant copyright exceptions and limitations in these jurisdictions, including fair use in the US and text and data mining exceptions in the EU. The authors highlight the uncertainties and challenges in applying these legal concepts to LLMs, particularly in light of recent court decisions and legislative developments. The paper also addresses the potential impact of the EU's AI Act on copyright considerations, including its extraterritorial effects. Furthermore, it explores the concept of "making available" in the context of LLMs and its implications for copyright infringement. Recognizing the legal uncertainties and the need for a balanced approach that fosters both innovation and copyright protection, the authors propose licensing as a key solution. They advocate for a combination of direct and collective licensing models to provide a practical framework for the responsible use of copyrighted materials in AI systems.

This article offers valuable insights for legal scholars, policymakers, and industry professionals grappling with the copyright challenges posed by LLMs. It contributes to the ongoing dialogue on adapting copyright law to technological advancements while maintaining its fundamental purpose of incentivizing creativity and innovation."

Wednesday, June 7, 2023

Japan Declares AI Training Data Fair Game and ‘Will Not Enforce Copyright’; PetaPixel, June 5, 2023

MATT GROWCOOT , PetaPixel; Japan Declares AI Training Data Fair Game and ‘Will Not Enforce Copyright’

"In the first such declaration of its kind, Japan has seemingly asserted that it will not enforce copyrights when it comes to training generative artificial intelligence (AI) programs.

Japan’s minister of education, culture, sports, science, and technology recently said that it is possible to take content from any source and use it for “information analysis.”

According to a Japanese political website, Liberal Democrat minister Keiko Nagoaka clearly stated at a committee meeting that AI companies can use whatever data they want to train generative AI programs."

Monday, January 16, 2023

DeviantArt, Midjourney Face Lawsuit for Using 'Billions of Copyrighted' Images in AI Art; CBR, January 15, 2023

BRIAN CRONIN, CBR ; DeviantArt, Midjourney Face Lawsuit for Using 'Billions of Copyrighted' Images in AI Art

"A lawsuit on behalf of a group of plaintiff artists has been filed in the United States District Court for the Northern District of California against three companies: Stability AI, DeviantArt, and Midjourney, over the alleged infringement of the copyright of the artists in the creation of so-called "artificial intelligence" art."