Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Wednesday, February 11, 2026

Adam Schiff And John Curtis Introduce Bill To Require Tech To Disclose Copyrighted Works Used In AI Training Models; Deadline, February 10, 2026

Ted Johnson, Deadline; Adam Schiff And John Curtis Introduce Bill To Require Tech To Disclose Copyrighted Works Used In AI Training Models

"Sen. Adam Schiff (D-CA) and Sen. John Curtis (R-UT) are introducing a bill that touches on one of the hottest Hollywood-tech debates in the development of AI: The use of copyrighted works in training models.

The Copyright Labeling and Ethical AI Reporting Act would require companies file a notice with the Register of Copyrights that detail the copyrighted works used to train datasets for an AI model. The notice would have to be filed before a new model is publicly released, and would apply retroactively to models already available to consumers.

The Copyright Office also would be required to establish a public database of the notices filed. There also would be civil penalties for failure to disclose the works used."

Friday, February 6, 2026

Publishers Strike Back Against Google in Infringement Suit; Publishers Weekly, February 6, 2026

Jim Milliot , Publishers Weekly; Publishers Strike Back Against Google in Infringement Suit

"The Association of American Publishers continued its fight this week to allow two of its members, Hachette Book Group and Cengage, to join a class action copyright infringement lawsuit against Google and its generative AI product Gemini. The lawsuit was first brought by a group of illustrators and writers in 2023.

In mid-January the AAP filed its first motion to allow the two publishers to take part in the lawsuit that is now before Judge Eumi K. Lee in the U.S. District Court for the Northern District of California. Earlier this week the AAP filed its reply to Google’s motion asking the court to block AAP’s request.

At the core of Google’s argument is the notion that the publishers should have asked to intervene sooner, as well as the assertion that publishers have no interest in the case because they don’t own authors works.

In its response, AAP argues that it was only when the case reached class certification that the publishers’ interests became clear. The new filing also rebuts Google’s other claim that publishers’ don’t own any rights.

“Google’s professed misunderstanding of ownership exemplifies exactly the kind of value that Proposed Intervenors bring to the case,” the AAP stated, arguing that both HBG and Cengage own certain rights to the works in question and that “scores” of other publishers will be impacted by the litigation."

Thursday, February 5, 2026

‘In the end, you feel blank’: India’s female workers watching hours of abusive content to train AI; The Guardian, February 5, 2026

 Anuj Behal, The Guardian; ‘In the end, you feel blank’: India’s female workers watching hours of abusive content to train AI


[Kip Currier: The largely unaddressed plight of content moderators became more real for me after reading this haunting 9/9/24 piece in the Washington Post, "I quit my job as a content moderator. I can never go back to who I was before."

As mentioned in the graphic article's byline, content moderator Alberto Cuadra spoke with journalist Beatrix Lockwood. Maya Scarpa's illustrations poignantly give life to Alberto Cuadra's first-hand experiences and ongoing impacts from the content moderation he performed for an unnamed tech company. I talk about Cuadra's experiences and the ethical issues of content moderation, social media, and AI in my Ethics, Information, and Technology book.]


[Excerpt]

"Murmu, 26, is a content moderator for a global technology company, logging on from her village in India’s Jharkhand state. Her job is to classify images, videos and text that have been flagged by automated systems as possible violations of the platform’s rules.

On an average day, she views up to 800 videos and images, making judgments that train algorithms to recognise violence, abuse and harm.

This work sits at the core of machine learning’s recent breakthroughs, which rest on the fact that AI is only as good as the data it is trained on. In India, this labour is increasingly performed by women, who are part of a workforce often described as “ghost workers”.

“The first few months, I couldn’t sleep,” she says. “I would close my eyes and still see the screen loading.” Images followed her into her dreams: of fatal accidents, of losing family members, of sexual violence she could not stop or escape. On those nights, she says, her mother would wake and sit with her...

“In terms of risk,” she says, “content moderation belongs in the category of dangerous work, comparable to any lethal industry.”

Studies indicate content moderation triggers lasting cognitive and emotional strain, often resulting in behavioural changes such as heightened vigilance. Workers report intrusive thoughts, anxiety and sleep disturbances.

A study of content moderators published last December, which included workers in India, identified traumatic stress as the most pronounced psychological risk. The study found that even where workplace interventions and support mechanisms existed, significant levels of secondary trauma persisted."

Friday, January 30, 2026

The $1.5 Billion Reckoning: AI Copyright and the 2026 Regulatory Minefield; JD Supra, January 27, 2026

Rob Robinson, JD Supra ; The $1.5 Billion Reckoning: AI Copyright and the 2026 Regulatory Minefield

"In the silent digital halls of early 2026, the era of “ask for forgiveness later” has finally hit a $1.5 billion brick wall. As legal frameworks in Brussels and New Delhi solidify, the wild west of AI training data is being partitioned into clearly marked zones of liability and license. For those who manage information, secure data, or navigate the murky waters of eDiscovery, this landscape is no longer a theoretical debate—it is an active regulatory battlefield where every byte of training data carries a price tag."

Music publishers sue Anthropic for $3B over ‘flagrant piracy’ of 20,000 works; TechCrunch, January 29, 2026

Amanda Silberling, TechCrunch; Music publishers sue Anthropic for $3B over ‘flagrant piracy’ of 20,000 works 

"A cohort of music publishers led by Concord Music Group and Universal Music Group are suing Anthropic, saying the company illegally downloaded more than 20,000 copyrighted songs, including sheet music, song lyrics, and musical compositions.

The publishers said in a statement on Wednesday that the damages could amount to more than $3 billion, which would be one of the largest non-class action copyright cases filed in U.S. history.

This lawsuit was filed by the same legal team from the Bartz v. Anthropic case, in which a group of fiction and nonfiction authors similarly accused the AI company of using their copyrighted works to train products like Claude."

Tuesday, January 27, 2026

YouTubers sue Snap for alleged copyright infringement in training its AI models; TechCrunch, January 26, 2026

Sarah Perez, TechCrunch; YouTubers sue Snap for alleged copyright infringement in training its AI models

"A group of YouTubers who are suing tech giants for scraping their videos without permission to train AI models has now added Snap to their list of defendants. The plaintiffs — internet content creators behind a trio of YouTube channels with roughly 6.2 million collective subscribers — allege that Snap has trained its AI systems on their video content for use in AI features like the app’s “Imagine Lens,” which allows users to edit images using text prompts.

The plaintiffs earlier filed similar lawsuits against Nvidia, Meta, and ByteDance over similar matters.

In the newly filed proposed class action suit, filed on Friday in the U.S. District Court for the Central District of California, the YouTubers specifically call out Snap for its use of a large-scale, video-language dataset known as HD-VILA-100M, and others that were designed for only academic and research purposes. To use these datasets for commercial purposes, the plaintiffs claim Snap circumvented YouTube’s technological restrictions, terms of service, and licensing limitations, which prohibit commercial use."

Monday, January 26, 2026

Search Engines, AI, And The Long Fight Over Fair Use; Electronic Frontier Foundation (EFF), January 23, 2026

JOE MULLIN , Electronic Frontier Foundation (EFF); Search Engines, AI, And The Long Fight Over Fair Use

"We're taking part in Copyright Week, a series of actions and discussions supporting key principles that should guide copyright policy. Every day this week, various groups are taking on different elements of copyright law and policy, and addressing what's at stake, and what we need to do to make sure that copyright promotes creativity and innovation.

Long before generative AI, copyright holders warned that new technologies for reading and analyzing information would destroy creativity. Internet search engines, they argued, were infringement machines—tools that copied copyrighted works at scale without permission. As they had with earlier information technologies like the photocopier and the VCR, copyright owners sued.

Courts disagreed. They recognized that copying works in order to understand, index, and locate information is a classic fair use—and a necessary condition for a free and open internet.

Today, the same argument is being recycled against AI. It’s whether copyright owners should be allowed to control how others analyze, reuse, and build on existing works."

Sunday, January 25, 2026

How researchers got AI to quote copyrighted books word for word; Le Monde, January 24, 2026

 , Le Monde; How researchers got AI to quote copyrighted books word for word

"Where does artificial intelligence acquire its knowledge? From an enormous trove of texts used for training. These typically include vast numbers of articles from Wikipedia, but also a wide range of other writings, such as the massive Books3 dataset, which aggregates nearly 200,000 books without the authors' permission. Some proponents of conversational AI present these training datasets as a form of "universal knowledge" that transcends copyright law, adding that, protected or not, AIs do not memorize these works verbatim and only store fragmented information.

This argument has been challenged by a series of studies, the latest of which, published in early January by researchers at Stanford University and Yale University, is particularly revealing. Ahmed Ahmed and his coauthors managed to prompt four mainstream AI programs, disconnected from the internet to ensure no new information was retrieved, to recite entire pages from books."

Friday, January 23, 2026

Actors And Musicians Help Launch “Stealing Isn’t Innovation” Campaign To Protest Big Tech’s Use Of Copyrighted Works In AI Models; Deadline, January 22, 2026

Ted Johnson , Deadline; Actors And Musicians Help Launch “Stealing Isn’t Innovation” Campaign To Protest Big Tech’s Use Of Copyrighted Works In AI Models

"A long list of musicians, content creators and actors are among those who have signed on to a new campaign to protest tech giants’ use of copyrighted works in their AI models.

The list of signees includes actors like Scarlett Johansson and Cate Blanchett, music groups like REM and authors like Brad Meltzer. 

The ‘Stealing Isn’t Innovation” campaign is being led by the Human Artistry Campaign. It states that “respect and protect” the Creative community, “some of the biggest tech companies, many backed by private equity and other funders, are using American creators’ work to build AI platforms without authorization or regard for copyright law.”"

Copyright Law Set to Govern AI Under Trump’s Executive Order; Bloomberg Law, January 23, 2026

Michael McLaughlin , Bloomberg Law; Copyright Law Set to Govern AI Under Trump’s Executive Order


[Kip Currier: I posted this Bloomberg Law article excerpt to the Canvas site for the graduate students in my Intellectual Property and Open Movements course this term, along with the following note:

Copyright law is the potential giant-slayer vis-a-vis AI tech companies that have used copyrighted works as AI training data, without permission or compensation.

Information professionals who have IP acumen (e.g. copyright law and fair use familiarity) will have vital advantages on the job market and in their organizations.]


[Excerpt] 

"The legal landscape for artificial intelligence is entering a period of rapid consolidation. With President Donald Trump’s executive order in December 2025 establishing a national AI framework, the era of conflicting state-level rules may be drawing to a close.

But this doesn’t signal a reduction in AI-related legal risk. It marks the beginning of a different kind of scrutiny—one centered not on regulatory innovation but on the most powerful legal instrument already available to federal courts: copyright law.

The lesson emerging from recent AI litigation, most prominently Bartz v. Anthropic PBC, is that the greatest potential liability to AI developers doesn’t come from what their models generate. It comes from how those models were trained, and from the provenance of the content used in that training.

As the federal government asserts primacy over AI governance, the decisive question will be whether developers can demonstrate that their training corpora were acquired lawfully, licensed appropriately (unless in the public domain), and documented thoroughly."

Sunday, January 18, 2026

Publishers seek to join lawsuit against Google over AI training; Reuters, January 15, 2026

  , Reuters; Publishers seek to join lawsuit against Google over AI training

"Publishers Hachette Book Group and Cengage Group asked a California federal court on Thursday for permission to intervene in a proposed class action lawsuit against Google over the alleged misuse of copyrighted material used to train its artificial intelligence systems.

The publishers said in their proposed complaint that the tech company "engaged in one of the most prolific infringements of copyrighted materials in history" to build its AI capabilities, copying content from Hachette books and Cengage textbooks without permission...

The lawsuit currently involves groups of visual artists and authors who sued Google for allegedly misusing their work to train its generative AI systems. The case is one of many high-stakes lawsuits brought by artists, authors, music labels and other copyright owners against tech companies over their AI training."

Publishers seek to join lawsuit against Google over AI training; Reuters, January 15, 2026

 , Reuters; Publishers seek to join lawsuit against Google over AI training

"Publishers Hachette Book Group and Cengage Group asked a California federal court on Thursday for permission to intervene in a proposed class action lawsuit against Google over the alleged misuse of copyrighted material used to train its artificial intelligence systems.

The publishers said in their proposed complaint that the tech company "engaged in one of the most prolific infringements of copyrighted materials in history" to build its AI capabilities, copying content from Hachette books and Cengage textbooks without permission...

The lawsuit currently involves groups of visual artists and authors who sued Google for allegedly misusing their work to train its generative AI systems. The case is one of many high-stakes lawsuits brought by artists, authors, music labels and other copyright owners against tech companies over their AI training."

Friday, January 16, 2026

AI’S MEMORIZATION CRISIS: Large language models don’t “learn”—they copy. And that could change everything for the tech industry.; The Atlantic, January 9, 2026

 Alex Reisner, The Atlantic; AI’S MEMORIZATION CRISISLarge language models don’t “learn”—they copy. And that could change everything for the tech industry

"On tuesday, researchers at Stanford and Yale revealed something that AI companies would prefer to keep hidden. Four popular large language models—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—have stored large portions of some of the books they’ve been trained on, and can reproduce long excerpts from those books."

Extracting books from production language models; Cornell University, January 6, 2026

Ahmed AhmedA. Feder CooperSanmi KoyejoPercy Liang, Cornell University; Extracting books from production language models

"Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs."

Wednesday, January 14, 2026

Britain seeks 'reset' in copyright battle between AI and creators; Reuters, January 13, 2026

Reuters; Britain seeks 'reset' in copyright battle between AI and creators

"British technology minister Liz Kendall said on Tuesday the government was seeking a "reset" on plans to overhaul copyright rules to accommodate artificial intelligence, pledging to protect creators while unlocking AI's economic potential.

Creative industries worldwide are grappling with legal and ethical challenges posed by AI systems that generate original content after being trained on popular works, often without compensating the original creators."

 

Tuesday, January 13, 2026

‘Clock Is Ticking’ For Creators On AI Content Copyright Claims, Experts Warn; Forbes, January 9, 2026

 Rob Salkowitz, Forbes; ‘Clock Is Ticking’ For Creators On AI Content Copyright Claims, Experts Warn

"Despite this string of successes, creators like BT caution that content owners need to move quickly to secure any kind of terms. “A lot of artists have their heads in the sand with respect to AI,” he said. “The fact is, if they don’t come to some kind of agreement, they may end up with nothing.”

The concern is that AI models are increasingly being trained on synthetic data: that is, on the output of AI systems, rather than on content attributable to any individual creator or rights owner. Gartner estimates that 75% of AI training data in 2026 will be synthetic. That number could hit 100% by 2030. Once the tech companies no longer need human-produced content, they will stop paying for it.

“The quality of outputs from AI systems has been improving dramatically, which means that it is possible to train on synthetic data without risking model collapse,” said Dr. Daniela Braga, founder and CEO of the data training firm Defined.ai, in a separate interview at CES. “The window is definitely closing for individual rights owners to secure favorable terms.”

Other experts suggest that these claims may be overstated.

Braga says the best way creators can protect themselves is to do business with ethical companies willing to provide compensation for high-quality human-produced content and represent the superior value of that content to their customers. As models grow in capabilities, the need will shift from sheer volume of data to data that is appropriately tagged and annotated to fit easily into specific use cases.

There remain some profound questions around the sustainability of AI from a business standpoint, with demand for services among enterprise and consumers lagging the massive, and massively expensive, build-out of capacity. For some artists opposed to generative AI in its entirety, there may be the temptation to wait it out until the bubble bursts. After all, these artists created their work to be enjoyed by humans, not to be consumed in bulk by machines threatening their livelihoods. In light of those objections, the prospect of a meager payout might seem unappealing."

Friday, January 9, 2026

Thursday, January 8, 2026

OpenAI Must Turn Over 20 Million ChatGPT Logs, Judge Affirms; Bloomberg Law, January 5, 2026

 , Bloomberg Law; OpenAI Must Turn Over 20 Million ChatGPT Logs, Judge Affirms

"OpenAI Inc. will have to turn over 20 million anonymized ChatGPT logs in a consolidated AI copyright case after it failed to convince a federal judge to throw out a magistrate judge’s order the company said insufficiently weighed privacy concerns.

Magistrate Judge Ona T. Wang sufficiently considered privacy concerns against the material’s relevance to the ongoing litigation in her discovery ruling in favor of news organization plaintiffs in five lawsuits, District Judge Sidney H. Stein said in an order Monday. She rejected OpenAI’s arguments it should be allowed to run a search of the 20 million-log sample and produce conversations implicating the plaintiffs’ works, saying no case law requires the court to order the least burdensome discovery possible."

Monday, January 5, 2026

AI copyright battles enter pivotal year as US courts weigh fair use; Reuters, January 5, 2026

  , Reuters; AI copyright battles enter pivotal year as US courts weigh fair use

"The sprawling legal fight over tech companies' vast copying of copyrighted material to train their artificial intelligence systems could be entering a decisive phase in 2026.

After a string of fresh lawsuits and a landmark settlement in 2025, the new year promises to bring a wave of rulings that could define how U.S. copyright law applies to generative AI. At stake is whether companies like OpenAI, Google and Meta can rely on the legal doctrine of fair use to shield themselves from liability – or if they must reimburse copyright holders, which could cost billions."

Monday, December 22, 2025

OpenAI, Anthropic, xAI Hit With Copyright Suit from Writers; Bloomberg Law, December 22, 2025

 Annelise Levy, Bloomberg Law; OpenAI, Anthropic, xAI Hit With Copyright Suit from Writers

"Writers including Pulitzer Prize-winning journalist John Carreyrou filed a copyright lawsuit accusing six AI giants of using pirated copies of their books to train large language models.

The complaint, filed Monday in the US District Court for the Northern District of California, claims Anthropic PBC, Google LLCOpenAI Inc.Meta Platforms Inc., xAI Corp., and Perplexity AI Inc. committed a “deliberate act of theft.”

It is the first copyright lawsuit against xAI over its training process, and the first suit brought by authors against Perplexity...

Carreyrou is among the authors who opted out of a $1.5 billion class-action settlement with Anthropic."