Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Thursday, February 20, 2025

AI and Copyright: Expanding Copyright Hurts Everyone—Here’s What to Do Instead; Electronic Frontier Foundation (EFF), February 19, 2025

TORI NOBLE, Electronic Frontier Foundation (EFF); AI and Copyright: Expanding Copyright Hurts Everyone—Here’s What to Do Instead


[Kip Currier: No, not everyone. Not requiring Big Tech to figure out a way to fairly license or get permission to use the copyrighted works of creators unjustly advantages these deep pocketed corporations. It also inequitably disadvantages the economic and creative interests of the human beings who labor to create copyrightable content -- authors, songwriters, visual artists, and many others.

The tell is that many of these same Big Tech companies are only too willing to file copyright infringement lawsuits against anyone whom they allege is infringing their AI content to create competing products and services.]


[Excerpt]


"Threats to Socially Valuable Research and Innovation 

Requiring researchers to license fair uses of AI training data could make socially valuable research based on machine learning (ML) and even text and data mining (TDM) prohibitively complicated and expensive, if not impossible. Researchers have relied on fair use to conduct TDM research for a decade, leading to important advancements in myriad fields. However, licensing the vast quantity of works that high-quality TDM research requires is frequently cost-prohibitive and practically infeasible.  

Fair use protects ML and TDM research for good reason. Without fair use, copyright would hinder important scientific advancements that benefit all of us. Empirical studies back this up: research using TDM methodologies are more common in countries that protect TDM research from copyright control; in countries that don’t, copyright restrictions stymie beneficial research. It’s easy to see why: it would be impossible to identify and negotiate with millions of different copyright owners to analyze, say, text from the internet."

Monday, February 17, 2025

Copyright battles loom over artists and AI; Financial Times, February 16, 2025

louise.lucas@ft.com, Financial Times ; Copyright battles loom over artists and AI

"Artists are the latest creative industry to gripe about the exploitative nature of artificial intelligence. More than 3,000 have written to protest against plans by Christie’s to auction art created using AI."

Sunday, February 16, 2025

Court filings show Meta paused efforts to license books for AI training; TechCrunch, February 14, 3025

Kyle Wiggers, TechCrunch; Court filings show Meta paused efforts to license books for AI training

"According to one transcript, Sy Choudhury, who leads Meta’s AI partnership initiatives, said that Meta’s outreach to various publishers was met with “very slow uptake in engagement and interest.”

“I don’t recall the entire list, but I remember we had made a long list from initially scouring the Internet of top publishers, et cetera,” Choudhury said, per the transcript, “and we didn’t get contact and feedback from — from a lot of our cold call outreaches to try to establish contact.”

Choudhury added, “There were a few, like, that did, you know, engage, but not many.”

According to the court transcripts, Meta paused certain AI-related book licensing efforts in early April 2023 after encountering “timing” and other logistical setbacks. Choudhury said some publishers, in particular fiction book publishers, turned out to not in fact have the rights to the content that Meta was considering licensing, per a transcript.

“I’d like to point out that the — in the fiction category, we quickly learned from the business development team that most of the publishers we were talking to, they themselves were representing that they did not have, actually, the rights to license the data to us,” Choudhury said. “And so it would take a long time to engage with all their authors.”"

Friday, February 14, 2025

AI companies flaunt their theft. News media has to fight back – so we're suing. | Opinion; USA Today, February 13, 2025

 Danielle Coffey, USA Today; AI companies flaunt their theft. News media has to fight back – so we're suing. | Opinion

"Danielle Coffey is president & CEO of the News/Media Alliance, which represents 2,000 news and magazine media outlets worldwide...

This is not an anti-AI lawsuit or an effort to turn back the clock. We love technology. We use it in our businesses. Artificial intelligence will help us better serve our customers, but only if it respects intellectual property. That’s the remedy we’re seeking in court.

When it suits them, the AI companies assert similar claims to ours. Meta's lawsuit accused Bright Data of scraping data in violation of its terms of use. And Sam Altman of OpenAI has complained that DeepSeek illegally copied its algorithms.

Good actors, responsible technologies and potential legislation offer some hope for improving the situation. But what is urgently needed is what every market needs: reinforcement of legal protections against theft."

Thursday, February 13, 2025

News publishers sue Cohere for copyright and trademark infringement; Axios, February 13, 2025

"More than a dozen major U.S. news organizations on Thursday said they were suing Cohere, an enterprise AI company, claiming the tech startup illegally repurposed their work and did so in a way that tarnished their brands.

Why it matters: The lawsuit represents the first official legal action against an AI company organized by the News Media Alliance — the largest news media trade group in the U.S...

  • The NMA members participating in the lawsuit include Advance Local Media, Condé Nast, The Atlantic, Forbes Media, The Guardian, Business Insider, The Los Angeles Times, McClatchy Media Company, Newsday, Plain Dealer Publishing Company, Politico, The Republican Company, Toronto Star Newspapers, and Vox Media.

Between the lines: The complaint was filed shortly after the U.S. Copyright Office changed its copyright registration processes to make them faster for digital publishers.

  • Previously, the process by which digital publishers had to file for copyright protections for individual works was extremely cumbersome, limiting their ability to seek protection. 

Because of those changes, Coffey explained, NMA and the publishers who are suing Cohere were able to identify thousands of specific examples of Cohere verbatim copying their copyright-protected works."

Wednesday, February 12, 2025

Court: Training AI Model Based on Copyrighted Data Is Not Fair Use as a Matter of Law; The National Law Review, February 11, 2025

Joseph A. MeckesJoseph Grasser of Squire Patton Boggs (US) LLP   - Global IP and Technology Law Blog,  The National Law Review; Court: Training AI Model Based on Copyrighted Data Is Not Fair Use as a Matter of Law

"In what may turn out to be an influential decision, Judge Stephanos Bibas ruled as a matter of law in Thompson Reuters v. Ross Intelligence that creating short summaries of law to train Ross Intelligence’s artificial intelligence legal research application not only infringes Thompson Reuters’ copyrights as a matter of law but that the copying is not fair use. Judge Bibas had previously ruled that infringement and fair use were issues for the jury but changed his mind: “A smart man knows when he is right; a wise man knows when he is wrong.”

At issue in the case was whether Ross Intelligence directly infringed Thompson Reuters’ copyrights in its case law headnotes that are organized by Westlaw’s proprietary Key Number system. Thompson Reuters contended that Ross Intelligence’s contractor copied those headnotes to create “Bulk Memos.” Ross Intelligence used the Bulk Memos to train its competitive AI-powered legal research tool. Judge Bibas ruled that (i) the West headnotes were sufficiently original and creative to be copyrightable, and (ii) some of the Bulk Memos used by Ross were so similar that they infringed as a matter of law...

In other words, even if a work is selected entirely from the public domain, the simple act of selection is enough to give rise to copyright protection."

Monday, February 10, 2025

Meta staff torrented nearly 82TB of pirated books for AI training — court records reveal copyright violations; Tom's Hardware, February 9, 2025

 , Tom's Hardware; Meta staff torrented nearly 82TB of pirated books for AI training — court records reveal copyright violations

"Facebook parent-company Meta is currently fighting a class action lawsuit alleging copyright infringement and unfair competition, among others, with regards to how it trained LLaMA. According to an X (formerly Twitter) post by vx-underground, court records reveal that the social media company used pirated torrents to download 81.7TB of data from shadow libraries including Anna’s Archive, Z-Library, and LibGen. It then used this information to train its AI models.

The evidence, in the form of written communication, shows the researchers’ concerns about Meta’s use of pirated materials. One senior AI researcher said way back in October 2022, “I don’t think we should use pirated material. I really need to draw a line here.” While another one said, “Using pirated material should be beyond our ethical threshold,” then they added, “SciHub, ResearchGate, LibGen are basically like PirateBay or something like that, they are distributing content that is protected by copyright and they’re infringing it.”"

Tuesday, January 28, 2025

Elton John backs Paul McCartney in criticising proposed overhaul to UK copyright system; The Guardian, January 27, 2025

, The Guardian ; Elton John backs Paul McCartney in criticising proposed overhaul to UK copyright system

"Elton John has backed Paul McCartney in criticising a proposed overhaul of the UK copyright system, and has called for new rules to prevent tech companies from riding “roughshod over the traditional copyright laws that protect artists’ livelihoods”.

John has backed proposed amendments to the data (use and access) bill that would extend existing copyright protections, when it goes before a vote in the House of Lords on Tuesday.

The government is also consulting on an overhaul of copyright laws that would result in artists having to opt out of letting AI companies train their models using their work, rather than an opt-in model...

John told the Sunday Times that he felt “wheels are in motion to allow AI companies to ride roughshod over the traditional copyright laws that protect artists’ livelihoods. This will allow global big tech companies to gain free and easy access to artists’ work in order to train their artificial intelligence and create competing music. This will dilute and threaten young artists’ earnings even further. The musician community rejects it wholeheartedly.”

He said that “challenging financial situations” and increased touring costs made it “harder than ever for new and emerging musicians to make the finances of the industry stack up to sustain a fledgling career”, and added that the UK’s place on the world stage as “a leader in arts and popular culture is under serious jeopardy” without robust copyright protection.

“It is the absolute bedrock of artistic prosperity, and the country’s future success in the creative industries depends on it.”

The government consultation runs until 25 February and will explore how to improve trust between the creative and AI sectors, and how creators can license and get paid for use of their material."

Saturday, January 25, 2025

Paul McCartney: Don't let AI rip off artists; BBC, January 25, 2025

Laura Kuenssberg, BBC; Paul McCartney: Don't let AI rip off artists

"Sir Paul McCartney has told the BBC proposed changes to copyright law could allow "rip off" technology that might make it impossible for musicians and artists to make a living.

The government is considering an overhaul of the law that would allow AI developers to use creators' content on the internet to help develop their models, unless the rights holders opt out.

In a rare interview for Sunday with Laura Kuenssberg, Sir Paul said "when we were kids in Liverpool, we found a job that we loved, but it also paid the bills", warning the proposals could remove the incentive for writers and artists and result in a "loss of creativity". 

The government said it aimed to deliver legal certainty through a copyright regime that provided creators with "real control" and transparency."

Sunday, January 19, 2025

Congress Must Change Copyright Law for AI | Opinion; Newsweek, January 16, 2025

 Assistant Professor of Business Law, Georgia College and State University , Newsweek; Congress Must Change Copyright Law for AI | Opinion

"Luckily, the Constitution points the way forward. In Article I, Section 8, Congress is explicitly empowered "to promote the Progress of Science" through copyright law. That is to say, the power to create copyrights isn't just about protecting content creators, it's also about advancing human knowledge and innovation.

When the Founders gave Congress this power, they couldn't have imagined artificial intelligence, but they clearly understood that intellectual property laws would need to evolve to promote scientific progress. Congress therefore not only has the authority to adapt copyright law for the AI age, it has the duty to ensure our intellectual property framework promotes rather than hinders technological progress.

Consider what's at risk with inaction...

While American companies are struggling with copyright constraints, China is racing ahead with AI development, unencumbered by such concerns. The Chinese Communist Party has made it clear that they view AI supremacy as a key strategic goal, and they're not going to let intellectual property rights stand in their way.

The choice before us is clear, we can either reform our copyright laws to enable responsible AI development at home or we can watch as the future of AI is shaped by authoritarian powers abroad. The cost of inaction isn't just measured in lost innovation or economic opportunity, it is measured in our diminishing ability to ensure AI develops in alignment with democratic values and a respect for human rights.

The ideal solution here isn't to abandon copyright protection entirely, but to craft a careful exemption for AI training. This could even include provisions for compensating content creators through a mandated licensing framework or revenue-sharing system, ensuring that AI companies can access the data they need while creators can still benefit from and be credited for their work's use in training these models.

Critics will argue that this represents a taking from creators for the benefit of tech companies, but this misses the broader picture. The benefits of AI development flow not just to tech companies but to society as a whole. We should recognize that allowing AI models to learn from human knowledge serves a crucial public good, one we're at risk of losing if Congress doesn't act."

Saturday, January 18, 2025

News organizations sue OpenAI over copyright infringement claims; Jurist.org, January 16, 2025

 , Jurist.org; News organizations sue OpenAI over copyright infringement claims

"The case centers on allegations that OpenAI unlawfully utilized copyrighted content from various publishers, including The New York Times, to train its generative AI models and the hearing could determine whether OpenAI will face trial.

The plaintiffs claim that ChatGPT’s ability to generate human-like responses stems from the unauthorized use of their work without permission or compensation to develop their large language models (LLMs). OpenAI and its financial backer Microsoft argue that its use of data is protected under the fair use doctrine, which allows limited use of copyrighted material without permission for purposes such as commentary, criticism or education.

Additionally, OpenAI’s legal team asserts that The New York Times has not demonstrated actual harm resulting from their practices and that its use of the copyrighted material is transformative as it does not replicate the content verbatim. On the other hand, the plaintiffs are arguing copyright infringement because OpenAI removed identifiable information such as author bylines and publication details when using the content. They also contend that the LLMs absorb and reproduce expressions from the training data without genuine understanding."

Thursday, January 16, 2025

In AI copyright case, Zuckerberg turns to YouTube for his defense; TechCrunch, January 15, 2025

, TechCrunch ; In AI copyright case, Zuckerberg turns to YouTube for his defense

"Meta CEO Mark Zuckerberg appears to have used YouTube’s battle to remove pirated content to defend his own company’s use of a data set containing copyrighted e-books, reveals newly released snippets of a deposition he gave late last year.

The deposition, which was part of a complaint submitted to the court by plaintiffs’ attorneys, is related to the AI copyright case Kadrey v. Meta. It’s one of many such cases winding through the U.S. court system that’s pitting AI companies against authors and other IP holders. For the most part, the defendants in these cases – AI companies – claim that training on copyrighted content is “fair use.” Many copyright holders disagree."

Wednesday, January 15, 2025

'The New York Times' takes OpenAI to court. ChatGPT's future could be on the line; NPR, January 14, 2025

 , NPR; 'The New York Times' takes OpenAI to court. ChatGPT's future could be on the line

"A group of news organizations, led by The New York Times, took ChatGPT maker OpenAI to federal court on Tuesday in a hearing that could determine whether the tech company has to face the publishers in a high-profile copyright infringement trial.

Three publishers' lawsuits against OpenAI and its financial backer Microsoft have been merged into one case. Leading each of the three combined cases are the Times, The New York Daily News and the Center for Investigative Reporting.

Other publishers, like the Associated Press, News Corp. and Vox Media, have reached content-sharing deals with OpenAI, but the three litigants in this case are taking the opposite path: going on the offensive."

Monday, January 6, 2025

OpenAI holds off on promise to creators, fails to protect intellectual property; The American Bazaar, January 3, 2025

 Vishnu Kamal, The American Bazaar; OpenAI holds off on promise to creators, fails to protect intellectual property

"OpenAI may yet again be in hot water as it seems that the tech giant may be reneging on its earlier assurances. Reportedly, in May, OpenAI said it was developing a tool to let creators specify how they want their works to be included in—or excluded from—its AI training data. But seven months later, this feature has yet to see the light of day.

Called Media Manager, the tool would “identify copyrighted text, images, audio, and video,” OpenAI said at the time, to reflect creators’ preferences “across multiple sources.” It was intended to stave off some of the company’s fiercest critics, and potentially shield OpenAI from IP-related legal challenges...

OpenAI has faced various legal challenges related to its AI technologies and operations. One major issue involves the privacy and data usage of its language models, which are trained on large datasets that may include publicly available or copyrighted material. This raises concerns over privacy violations and intellectual property rights, especially regarding whether the data used for training was obtained with proper consent.

Additionally, there are questions about the ownership of content generated by OpenAI’s models. If an AI produces a work based on copyrighted data, it is tricky to determine who owns the rights—whether it’s OpenAI, the user who prompted the AI, or the creators of the original data.

Another concern is the liability for harmful content produced by AI. If an AI generates misleading or defamatory information, legal responsibility could fall on OpenAI."

Friday, January 3, 2025

U.S. Copyright Office to Begin Issuing Further AI Guidance in January 2025; The National Law Review, January 2, 2025

 John Hines of The Sedona Conference  , The National Law Review; U.S. Copyright Office to Begin Issuing Further AI Guidance in January 2025

"Parts 2 and 3, which have not yet been released, will be of heightened interest to content creators and to individuals and businesses involved in developing and deploying AI technologies. Ultimate regulatory and legislative determinations could materially recalibrate the scope of ownership and protection afforded to works of authorship, and the stakes are extremely high...

Part 2 of the report, which the Copyright Office expects to publish “after the New Year Holiday,” will address the copyrightability of AI-generated works, and more specifically, how the nature and degree of such use affects copyrightability and registrability. Current law is clear that to be copyrightable, a work must be created by a human. E.g., Thaler v. Perlmutter, 678 F.Supp. 140 (D.DC 2023), on appeal. However assistive tools are used in virtually all creation, from pencils to cameras to photo-editing software programs. In the context of registrability, the Copyright Office offered the following distinction in its March 2023 guidance: “[W]hether the ‘work’ is basically one of human authorship, with the computer [or other device] merely being an assisting instrument, or whether the traditional elements of authorship in the work (literary, artistic, or musical expression or elements of selection, arrangement, etc.) were actually conceived and executed not by man but by a machine.” In Part 2, the Copyright Office will have an additional opportunity to explore these and related issues – this time with the advantage of the many comments offered through the Notice of Inquiry process.

Part 3 of the report, which the Copyright Office anticipates releasing “in the first quarter of 2025,” will focus on issues associated with training data. AI models, depending on their size and scope, may train on millions of documents—many of which are copyrighted or copyrightable— acquired from the Internet or through acquisition of various robust databases. Users of “trained” AI technologies will typically input written prompts to generate written content or images, depending on the model (Sora is now available to generate video). The output is essentially a prediction based on a correlation of values in the model (extracted from the training data) and values that are derived from the user prompts.

Numerous lawsuits, perhaps most notably the case that The New York Times filed against Microsoft and OpenAI, have alleged that the use of data to train AI models constitutes copyright infringement. In many cases there may be little question of copying in the course of uploading data to train the models. Among a variety of issues, a core common issue will be whether the use of the data for training purposes is fair use. Content creators, of course, point to the fact that they have built their livelihoods and/or businesses around their creations and that they should be compensated for what is a violation of their exclusive rights."

Tuesday, December 31, 2024

Column: A Faulkner classic and Popeye enter the public domain while copyright only gets more confusing; Los Angeles Times, December 31, 2024

Michael Hiltzik , Los Angeles Times; Column: A Faulkner classic and Popeye enter the public domain while copyright only gets more confusing

"The annual flow of copyrighted works into the public domain underscores how the progressive lengthening of copyright protection is counter to the public interest—indeed, to the interests of creative artists. The initial U.S. copyright act, passed in 1790, provided for a term of 28 years including a 14-year renewal. In 1909, that was extended to 56 years including a 28-year renewal.

In 1976, the term was changed to the creator’s life plus 50 years. In 1998, Congress passed the Copyright Term Extension Act, which is known as the Sonny Bono Act after its chief promoter on Capitol Hill. That law extended the basic term to life plus 70 years; works for hire (in which a third party owns the rights to a creative work), pseudonymous and anonymous works were protected for 95 years from first publication or 120 years from creation, whichever is shorter.

Along the way, Congress extended copyright protection from written works to movies, recordings, performances and ultimately to almost all works, both published and unpublished.

Once a work enters the public domain, Jenkins observes, “community theaters can screen the films. Youth orchestras can perform the music publicly, without paying licensing fees. Online repositories such as the Internet Archive, HathiTrust, Google Books and the New York Public Library can make works fully available online. This helps enable both access to and preservation of cultural materials that might otherwise be lost to history.”"

Anthropic Agrees to Enforce Copyright Guardrails on New AI Tools; Bloomberg Law, December 30, 2024

 Annelise Levy, Bloomberg Law; Anthropic Agrees to Enforce Copyright Guardrails on New AI Tools

"Anthropic PBC must apply guardrails to prevent its future AI tools from producing infringing copyrighted content, according to a Monday agreement reached with music publishers suing the company for infringing protected song lyrics. 

Eight music publishers—including Universal Music Corp. and Concord Music Group—and Anthropic filed a stipulation partly resolving the publishers’ preliminary injunction motion in the US District Court for the Northern District of California. The publishers’ request that Anthropic refrain from using unauthorized copies of lyrics to train future AI models remains pending."

Monday, December 30, 2024

Key IP Issues for the Next President and Congress to Tackle: AI and Patent Subject Matter Eligibility; IP Watchdog, December 29, 2024

RYAN J. MALLOY, IP Watchdog; Key IP Issues for the Next President and Congress to Tackle: AI and Patent Subject Matter Eligibility

"The debates surrounding the 2024 election focused on “hot button” issues like abortion, immigration, and transgender rights. But several important IP issues also loom over the next administration and Congress. These issues include AI-generated deepfakes, the use of copyrighted works for AI training, the patentability of AI-assisted inventions, and patent subject matter eligibility more generally. We might see President Trump and the 119th Congress tackle some or all of these issues in the next term."

Sunday, December 29, 2024

AI's assault on our intellectual property must be stopped; Financial Times, December 21, 2024

Kate Mosse, Financial Times; AI's assault on our intellectual property must be stopped

"Imagine my dismay, therefore, to discover that those 15 years of dreaming, researching, planning, writing, rewriting, editing, visiting libraries and archives, translating Occitan texts, hunting down original 13th-century documents, becoming an expert in Catharsis, apparently counts for nothing. Labyrinth is just one of several of my novels that have been scraped by Meta's large language model. This has been done without my consent, without remuneration, without even notification. This is theft...

AI companies present creators as being against change. We are  not. Every artist I know is already engaging with AI in one way or another. But a distinction needs to be made between AI that can be used in brilliant ways -- for example, medical diagnosis -- and the foundations of AI models, where companies are essentially stealing creatives' work for their own profit. We should not forget that the AI companies rely on creators to build their models. Without strong copyright law that ensures creators can earn a living, AI companies will lack the high-quality material that is essential for their future growth."

Friday, December 27, 2024

The AI Boom May Be Too Good to Be True; Wall Street Journal, December 26, 2024

Josh Harlan, Wall Street Journal; The AI Boom May Be Too Good to Be True

 "Investors rushing to capitalize on artificial intelligence have focused on the technology—the capabilities of new models, the potential of generative tools, and the scale of processing power to sustain it all. What too many ignore is the evolving legal structure surrounding the technology, which will ultimately shape the economics of AI. The core question is: Who controls the value that AI produces? The answer depends on whether AI companies must compensate rights holders for using their data to train AI models and whether AI creations can themselves enjoy copyright or patent protections.

The current landscape of AI law is rife with uncertainty...How these cases are decided will determine whether AI developers can harvest publicly available data or must license the content used to train their models."