Showing posts with label AI training data. Show all posts
Showing posts with label AI training data. Show all posts

Thursday, August 14, 2025

Japan’s largest newspaper, Yomiuri Shimbun, sues AI startup Perplexity for copyright violations; NiemanLab, August 11, 2025

ANDREW DECK  , NiemanLab; Japan’s largest newspaper, Yomiuri Shimbun, sues AI startup Perplexity for copyright violations

"The Yomiuri Shimbun, Japan’s largest newspaper by circulation, has sued the generative AI startup Perplexity for copyright infringement. The lawsuit, filed in Tokyo District Court on August 7, marks the first copyright challenge by a major Japanese news publisher against an AI company.

The filing claims that Perplexity accessed 119,467 articles on Yomiuri’s site between February and June of this year, based on an analysis of its company server logs. Yomiuri alleges the scraping has been used by Perplexity to reproduce the newspaper’s copyrighted articles in responses to user queries without authorization.

In particular, the suit claims Perplexity has violated its “right of reproduction” and its “right to transmit to the public,” two tenets of Japanese law that give copyright holders control over the copying and distribution of their work. The suit seeks nearly $15 million in damages and demands that Perplexity stop reproducing its articles...

Japan’s copyright law allows AI developers to train models on copyrighted material without permission. This leeway is a direct result of a 2018 amendment to Japan’s Copyright Act, meant to encourage AI developmentin the country’s tech sector. The law does not, however, allow for wholesale reproduction of those works, or for AI developers to distribute copies in a way that will “unreasonably prejudice the interests of the copyright owner."

Wednesday, August 13, 2025

Judge rejects Anthropic bid to appeal copyright ruling, postpone trial; Reuters, August 12, 2025

 , Reuters; Judge rejects Anthropic bid to appeal copyright ruling, postpone trial

"A federal judge in California has denied a request from Anthropic to immediately appeal a ruling that could place the artificial intelligence company on the hook for billions of dollars in damages for allegedly pirating authors' copyrighted books.

U.S. District Judge William Alsup said on Monday that Anthropic must wait until after a scheduled December jury trial to appeal his decision that the company is not shielded from liability for pirating millions of books to train its AI-powered chatbot Claude."

Monday, August 11, 2025

Boston Public Library aims to increase access to a vast historic archive using AI; NPR, August 11, 2025

 , NPR ; Boston Public Library aims to increase access to a vast historic archive using AI

"Boston Public Library, one of the oldest and largest public library systems in the country, is launching a project this summer with OpenAI and Harvard Law School to make its trove of historically significant government documents more accessible to the public.

The documents date back to the early 1800s and include oral histories, congressional reports and surveys of different industries and communities...

Currently, members of the public who want to access these documents must show up in person. The project will enhance the metadata of each document and will enable users to search and cross-reference entire texts from anywhere in the world. 

Chapel said Boston Public Library plans to digitize 5,000 documents by the end of the year, and if all goes well, grow the project from there...

Harvard University said it could help. Researchers at the Harvard Law School Library's Institutional Data Initiative are working with libraries, museums and archives on a number of fronts, including training new AI models to help libraries enhance the searchability of their collections. 

AI companies help fund these efforts, and in return get to train their large language models on high-quality materials that are out of copyright and therefore less likely to lead to lawsuits. (Microsoft and OpenAI are among the many AI players targeted by recent copyright infringement lawsuits, in which plaintiffs such as authors claim the companies stole their works without permission.)"

Saturday, August 9, 2025

News Corp CEO Robert Thomson slams AI firms for stealing copyrighted material like Trump’s ‘Art of the Deal’; New York Post, August 6, 2025

Ariel Zilber, New York Post ; News Corp CEO Robert Thomson slams AI firms for stealing copyrighted material like Trump’s ‘Art of the Deal’

"The media executive said the voracious appetite of the AI firms to train their bots on proprietary content without paying for it risks eroding America’s edge over rival nations.

“Much is made of the competition with China, but America’s advantage is ingenuity and creativity, not bits and bytes, not watts but wit,” he said.

“To undermine that comparative advantage by stripping away IP rights is to vandalize our virtuosity.”"

AI industry horrified to face largest copyright class action ever certified; Ars Technica, August 8, 2025

ASHLEY BELANGER, Ars Technica ; AI industry horrified to face largest copyright class action ever certified

"AI industry groups are urging an appeals court to block what they say is the largest copyright class action ever certified. They've warned that a single lawsuit raised by three authors over Anthropic's AI training now threatens to "financially ruin" the entire AI industry if up to 7 million claimants end up joining the litigation and forcing a settlement.

Last week, Anthropic petitioned to appeal the class certification, urging the court to weigh questions that the district court judge, William Alsup, seemingly did not. Alsup allegedly failed to conduct a "rigorous analysis" of the potential class and instead based his judgment on his "50 years" of experience, Anthropic said.

If the appeals court denies the petition, Anthropic argued, the emerging company may be doomed. As Anthropic argued, it now "faces hundreds of billions of dollars in potential damages liability at trial in four months" based on a class certification rushed at "warp speed" that involves "up to seven million potential claimants, whose works span a century of publishing history," each possibly triggering a $150,000 fine.

Confronted with such extreme potential damages, Anthropic may lose its rights to raise valid defenses of its AI training, deciding it would be more prudent to settle, the company argued. And that could set an alarming precedent, considering all the other lawsuits generative AI (GenAI) companies face over training on copyrighted materials, Anthropic argued."

Tuesday, August 5, 2025

Robot Art Riles Artists; ABA Litigation Section, June 25, 2025

 James Michael Miller, ABA Litigation Section; Robot Art Riles Artists

"Visual artists have survived a motion to dismiss their class claims brought against generative artificial intelligence (AI) companies related to the companies’ use of the artists’ visual works without consent. The plaintiffs claimed that the defendants’ text-to-image AI products were trained in part on their copyrighted works. ABA Litigation Section leaders agree that the case sets up a showdown between copyright interests and the “democratization” of art through AI."

Wednesday, July 30, 2025

Insuring Intellectual Property – Examining AI and Fair Use; The National Law Review, July 29, 2025

 Michael S. LevineGeoffrey B. FehlingArmin GhiamMadalyn "Mady" Moore of Hunton Andrews Kurth   - Publications, The National Law Review; Insuring Intellectual Property – Examining AI and Fair Use

"The frequency of lawsuits involving the development and deployment of AI technologies is increasing by the day. Recent lawsuits seeking to hold companies directly and secondarily liable for “joint enterprises” based on use (or alleged misuse) of copyrighted works for training AI models serve as important reminders about the protections that intellectual property (IP) insurance can offer to cover the risks associated with copyright infringement claims.

Recently, a California federal district court ruled that it was “fair use” for an AI software company to use copyrighted books to train its large language models (LLMs). However, the court also found the company’s unauthorized possession of over seven million pirated books that it downloaded from the internet (apparently for free) amounted to copyright infringement independent from whether the books were ultimately used to train the LLMs. In contrast, where the company purchased books before scanning them into digital files, the use was a permissible “fair use.”

The court’s order in Bartz et al. v. Anthropic PBC, No. 3:24-cv-05417 (N.D. Cal. June 23, 2025), highlights the nuanced permissible use of copyrighted training data and underscores why policyholders engaged in the use of copyrighted material should acquire and maintain robust IP insurance that will reliably respond to claims of alleged infringement."

European Creators Slam AI Act Implementation, Warn Copyright Protections Are Failing; The Hollywood Reporter; July 30, 2025

 Scott Roxborough, The Hollywood Reporter; European Creators Slam AI Act Implementation, Warn Copyright Protections Are Failing

"The coalition is asking for the European Commission to revisit its implementation of the AI Act to ensure the law ” lives up to its promise to safeguard European intellectual property rights in the age of generative AI.”

Tuesday, July 29, 2025

Meta pirated and seeded porn for years to train AI, lawsuit says; Ars Technica, July 28, 2025

 ASHLEY BELANGER  , Ars Technica; Meta pirated and seeded porn for years to train AI, lawsuit says

"Porn sites may have blown up Meta's key defense in a copyright fight with book authors who earlier this year said that Meta torrented "at least 81.7 terabytes of data across multiple shadow libraries" to train its AI models.

Meta has defeated most of the authors' claims and claimed there is no proof that Meta ever uploaded pirated data through seeding or leeching on the BitTorrent network used to download training data. But authors still have a chance to prove that Meta may have profited off its massive piracy, and a new lawsuit filed by adult sites last week appears to contain evidence that could help authors win their fight, TorrentFreak reported.

The new lawsuit was filed last Friday in a US district court in California by Strike 3 Holdings—which says it attracts "over 25 million monthly visitors" to sites that serve as "ethical sources" for adult videos that "are famous for redefining adult content with Hollywood style and quality."

After authors revealed Meta's torrenting, Strike 3 Holdings checked its proprietary BitTorrent-tracking tools designed to detect infringement of its videos and alleged that the company found evidence that Meta has been torrenting and seeding its copyrighted content for years—since at least 2018. Some of the IP addresses were clearly registered to Meta, while others appeared to be "hidden," and at least one was linked to a Meta employee, the filing said."

Monday, July 28, 2025

A copyright lawsuit over pirated books could result in ‘business-ending’ damages for Anthropic; Fortune, July 28, 2025

 BEATRICE NOLAN , Fortune; A copyright lawsuit over pirated books could result in ‘business-ending’ damages for Anthropic

"A class-action lawsuit against Anthropic could expose the AI company to billions in copyright damages over its alleged use of pirated books from shadow libraries like LibGen and PiLiMi to train its models. While a federal judge ruled that training on lawfully obtained books may qualify as fair use, the court will hold a separate trial to address the allegedly illegal acquisition and storage of copyrighted works. Legal experts warn that statutory damages could be severe, with estimates ranging from $1 billion to over $100 billion."

Friday, July 25, 2025

Trump’s Comments Undermine AI Action Plan, Threaten Copyright; Publishers Weekly, July 23, 2025

Ed Nawotka  , Publishers Weekly; Trump’s Comments Undermine AI Action Plan, Threaten Copyright

"Senate bill proposes 'opt-in' legislation

Trump's comments come on the heels of the introduction, by U.S. senators Josh Hawley (R-Mo.) and Richard Blumenthal (D-Conn.), of the AI Accountability and Personal Data Protection Act this past Monday following a hearing last week on AI companies' copyright infringement. The bipartisan legislation aims to hold AI firms liable for using copyrighted works or personal data without acquiring explicit consent to train AI models. It would empower individuals—including writers, artists, and content creators—to sue companies in federal court if their data or copyrighted works are used without consent. It also supports class action lawsuits and advocates for violators to pay robust penalties.

"AI companies are robbing the American people blind while leaving artists, writers, and other creators with zero recourse," said Hawley. "It’s time for Congress to give the American worker their day in court to protect their personal data and creative works. My bipartisan legislation would finally empower working Americans who now find their livelihoods in the crosshairs of Big Tech’s lawlessness."

"This bill embodies a bipartisan consensus that AI safeguards are urgent—because the technology is moving at accelerating speed, and so are dangers to privacy," added Blumenthal. "Enforceable rules can put consumers back in control of their data, and help bar abuses. Tech companies must be held accountable—and liable legally—when they breach consumer privacy, collecting, monetizing or sharing personal information without express consent. Consumers must be given rights and remedies—and legal tools to make them real—not relying on government enforcement alone."

Thursday, July 24, 2025

Donald Trump Is Fairy-Godmothering AI; The Atlantic, July 23, 2025

 Matteo Wong , The Atlantic; Donald Trump Is Fairy-Godmothering AI

"In a sense, the action plan is a bet. AI is already changing a number of industries, including software engineering, and a number of scientific disciplines. Should AI end up producing incredible prosperity and new scientific discoveries, then the AI Action Plan may well get America there faster simply by removing any roadblocks and regulations, however sensible, that would slow the companies down. But should the technology prove to be a bubble—AI products remain error-prone, extremely expensive to build, and unproven in many business applications—the Trump administration is more rapidly pushing us toward the bust. Either way, the nation is in Silicon Valley’s hands...

Once the red tape is gone, the Trump administration wants to create a “dynamic, ‘try-first’ culture for AI across American industry.” In other words, build and test out AI products first, and then determine if those products are actually helpful—or if they pose any risks.

Trump gestured toward other concessions to the AI industry in his speech. He specifically targeted intellectual-property laws, arguing that training AI models on copyrighted books and articles does not infringe upon copyright because the chatbots, like people, are simply learning from the content. This has been a major conflict in recent years, with more than 40 related lawsuits filed against AI companies since 2022. (The Atlantic is suing the AI company Cohere, for example.) If courts were to decide that training AI models with copyrighted material is against the law, it would be a major setback for AI companies. In their official recommendations for the AI Action Plan, OpenAI, Microsoft, and Google all requested a copyright exception, known as “fair use,” for AI training. Based on his statements, Trump appears to strongly agree with this position, although the AI Action Plan itself does not reference copyright and AI training.

Also sprinkled throughout the AI Action Plan are gestures toward some MAGA priorities. Notably, the policy states that the government will contract with only AI companies whose models are “free from top-down ideological bias”—a reference to Sacks’s crusade against “woke” AI—and that a federal AI-risk-management framework should “eliminate references to misinformation, Diversity, Equity, and Inclusion, and climate change.” Trump signed a third executive order today that, in his words, will eliminate “woke, Marxist lunacy” from AI models...

Looming over the White House’s AI agenda is the threat of Chinese technology getting ahead. The AI Action Plan repeatedly references the importance of staying ahead of Chinese AI firms, as did the president’s speech: “We will not allow any foreign nation to beat us; our nation will not live in a planet controlled by the algorithms of the adversaries,” Trump declared...

But whatever happens on the international stage, hundreds of millions of Americans will feel more and more of generative AI’s influence—on salaries and schools, air quality and electricity costs, federal services and doctor’s offices. AI companies have been granted a good chunk of their wish list; if anything, the industry is being told that it’s not moving fast enough. Silicon Valley has been given permission to accelerate, and we’re all along for the ride."

Donald Trump Says AI Companies Can’t Be Expected To Pay For All Copyrighted Content Used In Their Training Models: “Not Do-Able”; Deadline, July 23, 2025

 Ted JohnsonTom Tapp, Deadline; Donald Trump Says AI Companies Can’t Be Expected To Pay For All Copyrighted Content Used In Their Training Models: “Not Do-Able”

 

[Kip Currier: Don't be fooled by the flimflam rhetoric in Trump's AI Action Plan unveiled yesterday (July 23, 2025). Where Trump's AI Action Plan says “We must ensure that free speech flourishes in the era of AI and that AI procured by the Federal government objectively reflects truth rather than social engineering agendas", it's actually the exact opposite: the Trump plan is censorious and will "cancel out" truth (e.g. on climate science, misinformation and disinformation, etc.) in Orwellian fashion.]


[Excerpt]

"The plan is a contrast to Trump’s predecessor, Joe Biden, who focused on the government’s role in ensuring that the technology was safe.

The Trump White House plan also recommends updating federal procurement guidelines “to ensure that the government only contracts with frontier large language model (LLM) developers who ensure that their systems are objective and free from top-down ideological bias.” Also recommended is revising the National Institute of Standards and Technology AI Risk Management Framework to remove references to misinformation, DEI and climate change.

“We must ensure that free speech flourishes in the era of AI and that AI procured by the Federal government objectively reflects truth rather than social engineering agendas,” the plan says."

Wednesday, July 23, 2025

Trump derides copyright and state rules in AI Action Plan launch; Politico, July 23, 2025

 MOHAR CHATTERJEE , Politico; Trump derides copyright and state rules in AI Action Plan launch

"President Donald Trump criticized copyright enforcement efforts and state-level AI regulations Wednesday as he launched the White House’s AI Action Plan on a mission to dominate the industry.

In remarks delivered at a “Winning the AI Race” summit hosted by the All-In Podcast and the Hill and Valley Forum in Washington, Trump said stringent copyright enforcement was unrealistic for the AI industry and would kneecap U.S. companies trying to compete globally, particularly against China.

“You can’t be expected to have a successful AI program when every single article, book or anything else that you’ve read or studied, you’re supposed to pay for,” he said. “You just can’t do it because it’s not doable. ... China’s not doing it.”

Trump’s comments were a riff as his 28-page AI Action Plan did not wade into copyright and administration officials told reporters the issue should be left to the courts to decide.

Trump also signed three executive orders. One will fast track federal permitting, streamline reviews and “do everything possible to expedite construction of all major AI infrastructure projects,” Trump said. Another expands American exports of AI hardware and software. A third order bans the federal government from procuring AI technology “that has been infused with partisan bias or ideological agendas,” as Trump put it...

Trump echoed tech companies’ complaints about state AI laws creating a patchwork of regulation. “You can’t have one state holding you up,” he said. “We need one common sense federal standard that supersedes all states, supersedes everybody.”"

Wave of copyright lawsuits hit AI companies like Cambridge-based Suno; WBUR, July 23, 2025

 

 WBUR; Wave of copyright lawsuits hit AI companies like Cambridge-based Suno

"Suno, a Cambridge company that generates AI music, faces multiple lawsuits alleging it illegally trained its model on copyrighted work. Peter Karol of Suffolk Law School and Bhamati Viswanathan of Columbia University Law School's Kernochan Center for Law, Media, and the Arts join WBUR's Morning Edition to explain how the suits against Suno fit into a broader legal battle over the future of creative work.

This segment aired on July 23, 2025. Audio will be available soon."

Tuesday, July 22, 2025

Commentary: A win-win-win path for AI in America; The Post & Courier, July 22, 2025

Keith Kupferschmid, The Post & Courier; Commentary: A win-win-win path for AI in America

"Contrary to claims that these AI training deals are impossible to make at scale, a robust free market is already emerging in which hundreds (if not thousands) of licensed deals between AI companies and copyright owners have been reached. New research shows it is possible to create fully licensed data sets for AI.

No wonder one federal judge recently called claims that licensing is impractical “ridiculous,” given the billions at stake: “If using copyrighted works to train the models is as necessary as the companies say, they will figure out a way to compensate copyright holders.” Just like AI companies don’t dispute that they have to pay for energy, infrastructure, coding teams and the other inputs their operations require, they need to pay for creative works as well.

America’s example to the world is a free-market economy based on the rule of law, property rights and freedom to contract — so, let the market innovate solutions to these new (but not so new) licensing challenges. Let’s construct a pro-innovation, pro-worker approach that replaces the false choice of the AI alarmists with a positive, pro-America pathway to leadership on AI."

Senators Introduce Bill To Restrict AI Companies’ Unauthorized Use Of Copyrighted Works For Training Models; Deadline, July 21, 2025

Ted Johnson , Deadline; Senators Introduce Bill To Restrict AI Companies’ Unauthorized Use Of Copyrighted Works For Training Models

"Sen. Josh Hawley (R-MO) and Sen. Richard Blumenthal (D-CT) introduced legislation on Monday that would restrict AI companies from using copyrighted material in their training models without the consent of the individual owner.

The AI Accountability and Personal Data Protection Act also would allow individuals to sue companies that uses their personal data or copyrighted works without their “express, prior consent.”

The bill addresses a raging debate between tech and content owners, one that has already led to extensive litigation. Companies like OpenAI have argued that the use of copyrighted materials in training models is a fair use, while figures including John Grisham and George R.R. Martin have challenged that notion."

Sunday, July 20, 2025

AI guzzled millions of books without permission. Authors are fighting back.; The Washington Post, July 19, 2025

  , The Washington Post; AI guzzled millions of books without permission. Authors are fighting back.


[Kip Currier: I've written this before on this blog and I'll say it again: technology companies would never allow anyone to freely vacuum up their content and use it without permission or compensation. Period. Full Stop.]


[Excerpt]

"Baldacci is among a group of authors suing OpenAI and Microsoft over the companies’ use of their work to train the AI software behind tools such as ChatGPT and Copilot without permission or payment — one of more than 40 lawsuits against AI companies advancing through the nation’s courts. He and other authors this week appealed to Congress for help standing up to what they see as an assault by Big Tech on their profession and the soul of literature.

They found sympathetic ears at a Senate subcommittee hearing Wednesday, where lawmakers expressed outrage at the technology industry’s practices. Their cause gained further momentum Thursday when a federal judge granted class-action status to another group of authors who allege that the AI firm Anthropic pirated their books.

“I see it as one of the moral issues of our time with respect to technology,” Ralph Eubanks, an author and University of Mississippi professor who is president of the Authors Guild, said in a phone interview. “Sometimes it keeps me up at night.”

Lawsuits have revealed that some AI companies had used legally dubious “torrent” sites to download millions of digitized books without having to pay for them."

Judge Rules Class Action Suit Against Anthropic Can Proceed; Publishers Weekly, July 18, 2025

Jim Milliot , Publishers Weekly; Judge Rules Class Action Suit Against Anthropic Can Proceed

"In a major victory for authors, U.S. District Judge William Alsup ruled July 17 that three writers suing Anthropic for copyright infringement can represent all other authors whose books the AI company allegedly pirated to train its AI model as part of a class action lawsuit.

In late June, Alsup of the Northern District of California, ruled in Bartz v. Anthropic that the AI company's training of its Claude LLMs on authors' works was "exceedingly transformative," and therefore protected by fair use. However, Alsup also determined that the company's practice of downloading pirated books from sites including Books3, Library Genesis, and Pirate Library Mirror (PiLiMi) to build a permanent digital library was not covered by fair use.

Alsup’s most recent ruling follows an amended complaint from the authors looking to certify classes of copyright owners in a “Pirated Books Class” and in a “Scanned Books Class.” In his decision, Alsup certified only a LibGen and PiLiMi Pirated Books Class, writing that “this class is limited to actual or beneficial owners of timely registered copyrights in ISBN/ASIN-bearing books downloaded by Anthropic from these two pirate libraries.”

Alsup stressed that “the class is not limited to authors or author-like entities,” explaining that “a key point is to cover everyone who owns the specific copyright interest in play, the right to make copies, either as the actual or as the beneficial owner.” Later in his decision, Alsup makes it clear who is covered by the ruling: “A beneficial owner...is someone like an author who receives royalties from any publisher’s revenues or recoveries from the right to make copies. Yes, the legal owner might be the publisher but the author has a definite stake in the royalties, so the author has standing to sue. And, each stands to benefit from the copyright enforcement at the core of our case however they then divide the benefit.”"

US authors suing Anthropic can band together in copyright class action, judge rules; Reuters, July 17, 2025

 , Reuters; US authors suing Anthropic can band together in copyright class action, judge rules

"A California federal judge ruled on Thursday that three authors suing artificial intelligence startup Anthropic for copyright infringement can represent writers nationwide whose books Anthropic allegedly pirated to train its AI system.

U.S. District Judge William Alsup said the authors can bring a class action on behalf of all U.S. writers whose works Anthropic allegedly downloaded from "pirate libraries" LibGen and PiLiMi to create a repository of millions of books in 2021 and 2022."