Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Sunday, March 24, 2024

Generative AI could leave users holding the bag for copyright violations; The Conversation, March 22, 2024

Professor of Information Systems, Michigan State University , The Conversation; ; Generative AI could leave users holding the bag for copyright violations

"How to build guardrails

Legal scholars have dubbed the challenge in developing guardrails against copyright infringement into AI tools the “Snoopy problem.” The more a copyrighted work is protecting a likeness – for example, the cartoon character Snoopy – the more likely it is a generative AI tool will copy it compared to copying a specific image."

Wednesday, March 20, 2024

Google hit with $270M fine in France as authority finds news publishers’ data was used for Gemini; TechCrunch, March 20, 2024

Natasha LomasRomain Dillet , TechCrunch; Google hit with $270M fine in France as authority finds news publishers’ data was used for Gemini

"In a never-ending saga between Google and France’s competition authority over copyright protections for news snippets, the Autorité de la Concurrence announced a €250 million fine against the tech giant Wednesday (around $270 million at today’s exchange rate).

According to the competition watchdog, Google disregarded some of its previous commitments with news publishers. But the decision is especially notable because it drops something else that’s bang up-to-date — by latching onto Google’s use of news publishers’ content to train its generative AI model Bard/Gemini.

The competition authority has found fault with Google for failing to notify news publishers of this GenAI use of their copyrighted content. This is in light of earlier commitments Google made which are aimed at ensuring it undertakes fair payment talks with publishers over reuse of their content."

Monday, March 11, 2024

Nvidia sued over AI training data as copyright clashes continue; Ars Technica, March 11, 2024

  , Ars Technica Nvidia sued over AI training data as copyright clashes continue

"Book authors are suing Nvidia, alleging that the chipmaker's AI platform NeMo—used to power customized chatbots—was trained on a controversial dataset that illegally copied and distributed their books without their consent.

In a proposed class action, novelists Abdi Nazemian (Like a Love Story), Brian Keene (Ghost Walk), and Stewart O’Nan (Last Night at the Lobster) argued that Nvidia should pay damages and destroy all copies of the Books3 dataset used to power NeMo large language models (LLMs).

The Books3 dataset, novelists argued, copied "all of Bibliotek," a shadow library of approximately 196,640 pirated books. Initially shared through the AI community Hugging Face, the Books3 dataset today "is defunct and no longer accessible due to reported copyright infringement," the Hugging Face website says.

According to the authors, Hugging Face removed the dataset last October, but not before AI companies like Nvidia grabbed it and "made multiple copies." By training NeMo models on this dataset, the authors alleged that Nvidia "violated their exclusive rights under the Copyright Act." The authors argued that the US district court in San Francisco must intervene and stop Nvidia because the company "has continued to make copies of the Infringed Works for training other models.""

Thursday, March 7, 2024

Introducing CopyrightCatcher, the first Copyright Detection API for LLMs; Patronus AI, March 6, 2024

Patronus AI; Introducing CopyrightCatcher, thefirst Copyright Detection API for LLMs

"Managing risks from unintended copyright infringement in LLM outputs should be a central focus for companies deploying LLMs in production.

  • On an adversarial copyright test designed by Patronus AI researchers, we found that state-of-the-art LLMs generate copyrighted content at an alarmingly high rate 😱
  • OpenAI’s GPT-4 produced copyrighted content on 44% of the prompts.
  • Mistral’s Mixtral-8x7B-Instruct-v0.1 produced copyrighted content on 22% of the prompts.
  • Anthropic’s Claude-2.1 produced copyrighted content on 8% of the prompts.
  • Meta’s Llama-2-70b-chat produced copyrighted content on 10% of the prompts.
  • Check out CopyrightCatcher, our solution to detect potential copyright violations in LLMs. Here’s the public demo, with open source model inference powered by Databricks Foundation Model APIs. 🔥

LLM training data often contains copyrighted works, and it is pretty easy to get an LLM to generate exact reproductions from these texts1. It is critical to catch these reproductions, since they pose significant legal and reputational risks for companies that build and use LLMs in production systems2. OpenAI, Anthropic, and Microsoft have all faced copyright lawsuits on LLM generations from authors3, music publishers4, and more recently, the New York Times5.

To check whether LLMs respond to your prompts with copyrighted text, you can use CopyrightCatcher. It detects when LLMs generate exact reproductions of content from text sources like books, and highlights any copyrighted text in LLM outputs. Check out our public CopyrightCatcher demo here!

Thursday, February 29, 2024

The Intercept, Raw Story and AlterNet sue OpenAI for copyright infringement; The Guardian, February 28, 2024

, The Guardian ; The Intercept, Raw Story and AlterNet sue OpenAI for copyright infringement

"OpenAI and Microsoft are facing a fresh round of lawsuits from news publishers over allegations that their generative artificial intelligence products violated copyright laws and illegally trained by using journalists’ work. Three progressive US outlets – the Intercept, Raw Story and AlterNet – filed suits in Manhattan federal court on Wednesday, demanding compensation from the tech companies.

The news outlets claim that the companies in effect plagiarized copyright-protected articles to develop and operate ChatGPT, which has become OpenAI’s most prominent generative AI tool. They allege that ChatGPT was trained not to respect copyright, ignores proper attribution and fails to notify users when the service’s answers are generated using journalists’ protected work."

Thursday, February 1, 2024

The economy and ethics of AI training data; Marketplace.org, January 31, 2024

 Matt Levin, Marketplace.org;  The economy and ethics of AI training data

"Maybe the only industry hotter than artificial intelligence right now? AI litigation. 

Just a sampling: Writer Michael Chabon is suing Meta. Getty Images is suing Stability AI. And both The New York Times and The Authors Guild have filed separate lawsuits against OpenAI and Microsoft. 

At the heart of these cases is the allegation that tech companies illegally used copyrighted works as part of their AI training data. 

For text focused generative AI, there’s a good chance that some of that training data originated from one massive archive: Common Crawl

“Common Crawl is the copy of the internet. It’s a 17-year archive of the internet. We make this freely available to researchers, academics and companies,” said Rich Skrenta, who heads the nonprofit Common Crawl Foundation."

Saturday, January 27, 2024

Training Generative AI Models on Copyrighted Works Is Fair Use; ARL Views, January 23, 2024

Katherine Klosek, Director of Information Policy and Federal Relations, Association of Research Libraries (ARL), and Marjory S. Blumenthal, Senior Policy Fellow, American Library Association (ALA) Office of Public Policy and Advocacy |, ARL Views; Training Generative AI Models on Copyrighted Works Is Fair Use

"In a blog post about the case, OpenAI cites the Library Copyright Alliance (LCA) position that “based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.” LCA explained this position in our submission to the US Copyright Office notice of inquiry on copyright and AI, and in the LCA Principles for Copyright and AI.

LCA is not involved in any of the AI lawsuits. But as champions of fair use, free speech, and freedom of information, libraries have a stake in maintaining the balance of copyright law so that it is not used to block or restrict access to information. We drafted the principles on AI and copyright in response to efforts to amend copyright law to require licensing schemes for generative AI that could stunt the development of this technology, and undermine its utility to researchers, students, creators, and the public. The LCA principles hold that copyright law as applied and interpreted by the Copyright Office and the courts is flexible and robust enough to address issues of copyright and AI without amendment. The LCA principles also make the careful and critical distinction between input to train an LLM, and output—which could potentially be infringing if it is substantially similar to an original expressive work.

On the question of whether ingesting copyrighted works to train LLMs is fair use, LCA points to the history of courts applying the US Copyright Act to AI."

Friday, January 26, 2024

George Carlin Estate Sues Creators of AI-Generated Comedy Special in Key Lawsuit Over Stars’ Likenesses; The Hollywood Reporter, January 25, 2024

 Winston Cho, The Hollywood Reporter ; George Carlin Estate Sues Creators of AI-Generated Comedy Special in Key Lawsuit Over Stars’ Likenesses

"The complaint seeks a court order for immediate removal of the special, as well as unspecified damages. It’s among the first legal actions taken by the estate of a deceased celebrity for unlicensed use of their work and likeness to manufacture a new, AI-generated creation and was filed as Hollywood is sounding the alarm over utilization of AI to impersonate people without consent or compensation...

According to the complaint, the special was created through unauthorized use of Carlin’s copyrighted works.

At the start of the video, it’s explained that the AI program that created the special ingested five decades of Carlin’s original stand-up routines, which are owned by the comedian’s estate, as training materials, “thereby making unauthorized copies” of the copyrighted works...

If signed into law, the proposal, called the No AI Fraud Act, could curb a growing trend of individuals and businesses creating AI-recorded tracks using artists’ voices and deceptive ads in which it appears a performer is endorsing a product. In the absence of a federal right of publicity law, unions and trade groups in Hollywood have been lobbying for legislation requiring individuals’ consent to use their voice and likeness."

Tuesday, January 2, 2024

Copyright law is AI's 2024 battlefield; Axios, January 2, 2023

 Megan Morrone , Axios; Copyright law is AI's 2024 battlefield

"Looming fights over copyright in AI are likely to set the new technology's course in 2024 faster than legislation or regulation.

Driving the news: The New York Times filed a lawsuit against OpenAI and Microsoft on December 27, claiming their AI systems' "widescale copying" constitutes copyright infringement.

The big picture: After a year of lawsuits from creators protecting their works from getting gobbled up and repackaged by generative AI tools, the new year could see significant rulings that alter the progress of AI innovation. 

Why it matters: The copyright decisions coming down the pike — over both the use of copyrighted material in the development of AI systems and also the status of works that are created by or with the help of AI — are crucial to the technology's future and could determine winners and losers in the market."

Sunday, December 31, 2023

Boom in A.I. Prompts a Test of Copyright Law; The New York Times, December 30, 2023

J. Edward Moreno , The New York Times; Boom in A.I. Prompts a Test of Copyright Law

"The boom in artificial intelligence tools that draw on troves of content from across the internet has begun to test the bounds of copyright law...

Data is crucial to developing generative A.I. technologies — which can generate text, images and other media on their own — and to the business models of companies doing that work.

“Copyright will be one of the key points that shapes the generative A.I. industry,” said Fred Havemeyer, an analyst at the financial research firm Macquarie.

A central consideration is the “fair use” doctrine in intellectual property law, which allows creators to build upon copyrighted work...

“Ultimately, whether or not this lawsuit ends up shaping copyright law will be determined by whether the suit is really about the future of fair use and copyright, or whether it’s a salvo in a negotiation,” Jane Ginsburg, a professor at Columbia Law School, said of the lawsuit by The Times...

Competition in the A.I. field may boil down to data haves and have-nots...

“Generative A.I. begins and ends with data,” Mr. Havemeyer said."

Thursday, December 28, 2023

AI starts a music-making revolution and plenty of noise about ethics and royalties; The Washington Times, December 26, 2023

 Tom Howell Jr. , The Washington Times ; AI starts a music-making revolution and plenty of noise about ethics and royalties

"“Music’s important. AI is changing that relationship. We need to navigate that carefully,” said Martin Clancy, an Ireland-based expert who has worked on chart-topping songs and is the founding chairman of the IEEE Global AI Ethics Arts Committee...

The Biden administration, the European Union and other governments are rushing to catch up with AI and harness its benefits while controlling its potentially adverse societal impacts. They are also wading through copyright and other matters of law.

Even if they devise legislation now, the rules likely will not go into effect for years. The EU recently enacted a sweeping AI law, but it won’t take effect until 2025.

“That’s forever in this space, which means that all we’re left with is our ethical decision-making,” Mr. Clancy said.

For now, the AI-generated music landscape is like the Wild West. Many AI-generated songs are hokey or just not very good."

Wednesday, December 27, 2023

The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work; The New York Times, December 27, 2023

Michael M. Grynbaum and , The New York Times; The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work

"The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle over the unauthorized use of published work to train artificial intelligence technologies.

The Times is the first major American media organization to sue the companies, the creators of ChatGPT and other popular A.I. platforms, over copyright issues associated with its written works. The lawsuit, filed in Federal District Court in Manhattan, contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information.

The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.” It also calls for the companies to destroy any chatbot models and training data that use copyrighted material from The Times."

Monday, December 18, 2023

AI could threaten creators — but only if humans let it; The Washington Post, December 17, 2023

, The Washington Post; AI could threaten creators — but only if humans let it

"A broader rethinking of copyright, perhaps inspired by what some AI companies are already doing, could ensure that human creators get some recompense when AI consumes their work, processes it and produces new material based on it in a manner current law doesn’t contemplate. But such a shift shouldn’t be so punishing that the AI industry has no room to grow. That way, these tools, in concert with human creators, can push the progress of science and useful arts far beyond what the Framers could have imagined."

Tuesday, November 21, 2023

Patent Poetry: Judge Throws Out Most of Artists’ AI Copyright Infringement Claims; JD Supra, November 20, 2023

  Adam PhilippAEON LawJD Supra; Patent Poetry: Judge Throws Out Most of Artists’ AI Copyright Infringement Claims

"One of the plaintiffs’ theories of infringement was that the output images based on the Training Images are all infringing derivative works.

The court noted that to support that claim the output images would need to be substantially similar to the protected works. However, noted the court,

none of the Stable Diffusion output images provided in response to a particular Text Prompt is likely to be a close match for any specific image in the training data.

The plaintiffs argued that there was no need to show substantial similarity when there was direct proof of copying. The judge was skeptical of that argument.

This is just one of many AI-related cases making its way through the courts, and this is just a ruling on a motion rather than an appellate court decision. Nevertheless, this line of analysis will likely be cited in other cases now pending.

Also, this case shows the importance of artists registering their works with the Copyright Office before seeking to sue for infringement."

Saturday, October 28, 2023

An AI engine scans a book. Is that copyright infringement or fair use?; Columbia Journalism Review, October 26, 2023

 MATHEW INGRAM, Columbia Journalism Review; An AI engine scans a book. Is that copyright infringement or fair use?

"Determining whether LLMs training themselves on copyrighted text qualifies as fair use can be difficult even for experts—not just because AI is complicated, but because the concept of fair use is, too."

Thursday, October 26, 2023

Why I let an AI chatbot train on my book; Vox, October 25, 2023

, Vox; Why I let an AI chatbot train on my book

"What’s “fair use” for AI?

I think that training a chatbot for nonprofit, educational purposes, with the express permission of the authors of the works on which it’s trained, seems okay. But do novelists like George R.R. Martin or John Grisham have a case against for-profit companies that take their work without that express permission?

The law, unfortunately, is far from clear on this question." 

Friday, October 20, 2023

Music publishers sue Amazon-backed AI company over song lyrics; The Guardian, October 19, 2023

  and agencies, The Guardian; Music publishers sue Amazon-backed AI company over song lyrics

"Music publishers Universal Music, ABKCO and Concord Publishing sued the artificial intelligence company Anthropic in Tennessee federal court on Wednesday, accusing it of misusing “innumerable” copyrighted song lyrics to train its chatbot Claude.

The lawsuit said Anthropic violates the publishers’ rights through its use of lyrics from at least 500 songs ranging from the Beach Boys’ God Only Knows and the Rolling Stones’ Gimme Shelter to Mark Ronson and Bruno Mars’ Uptown Funk and Beyoncé’s Halo.

The lawsuit accused Anthropic of infringing the publishers’ copyrights by copying their lyrics without permission as part of the “massive amounts of text” that it scrapes from the internet to train Claude to respond to human prompts."

Thursday, October 19, 2023

AI is learning from stolen intellectual property. It needs to stop.; The Washington Post, October 19, 2023

William D. Cohan , The Washington Post; AI is learning from stolen intellectual property. It needs to stop.

"The other day someone sent me the searchable database published by Atlantic magazine of more than 191,000 e-books that have been used to train the generative AI systems being developed by Meta, Bloomberg and others. It turns out that four of my seven books are in the data set, called Books3. Whoa.

Not only did I not give permission for my books to be used to generate AI products, but I also wasn’t even consulted about it. I had no idea this was happening. Neither did my publishers, Penguin Random House (for three of the books) and Macmillan (for the other one). Neither my publishers nor I were compensated for use of my intellectual property. Books3 just scraped the content away for free, with Meta et al. profiting merrily along the way. And Books3 is just one of many pirated collections being used for this purpose...

This is wholly unacceptable behavior. Our books are copyrighted material, not free fodder for wealthy companies to use as they see fit, without permission or compensation. Many, many hours of serious research, creative angst and plain old hard work go into writing and publishing a book, and few writers are compensated like professional athletes, Hollywood actors or Wall Street investment bankers. Stealing our intellectual property hurts."

Authors sue Meta, Microsoft, Bloomberg in latest AI copyright clash; Reuters, October 18, 2023

, Reuters ; Authors sue Meta, Microsoft, Bloomberg in latest AI copyright clash

"A group of writers including former Arkansas governor Mike Huckabee and best-selling Christian author Lysa TerKeurst have filed a lawsuit in New York federal court that accuses Meta (META.O), Microsoft (MSFT.O) and Bloomberg of using their work to train artificial intelligence systems without permission.

The proposed class-action copyright lawsuit filed on Tuesday said that the companies used the controversial "Books3" dataset, which the writers said contains thousands of pirated books, to teach their large language models how to respond to human prompts."

Wednesday, October 18, 2023

A.I. May Not Get a Chance to Kill Us if This Kills It First; Slate, October 17, 2023

 SCOTT NOVER, Slate; A.I. May Not Get a Chance to Kill Us if This Kills It First

"There is a disaster scenario for OpenAI and other companies funneling billions into A.I. models: If a court found that a company was liable for copyright infringement, it could completely halt the development of the offending model."