Showing posts with label Large Language Models (LLMs). Show all posts
Showing posts with label Large Language Models (LLMs). Show all posts

Friday, September 6, 2024

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism; Arxiv, 2024

 Brian Thompson,∗ Mehak Preet Dhaliwal,† Peter Frisch,Tobias Domhan,Marcello Federico1 1AWS AI Labs 2UC Santa Barbara 3Amazon

brianjt@amazon.com, Arxiv ; A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

"Abstract

We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low qual- ity English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web."

Thursday, March 7, 2024

Researchers tested leading AI models for copyright infringement using popular books, and GPT-4 performed worst; CNBC, March 6, 2024

Hayden Field, CNBC; Researchers tested leading AI models for copyright infringement using popular books, and GPT-4 performed worst

"The company, founded by ex-Meta researchers, specializes in evaluation and testing for large language models — the technology behind generative AI products.

Alongside the release of its new tool, CopyrightCatcher, Patronus AI released results of an adversarial test meant to showcase how often four leading AI models respond to user queries using copyrighted text.

The four models it tested were OpenAI’s GPT-4, Anthropic’s Claude 2, Meta’s Llama 2 and Mistral AI’s Mixtral.

“We pretty much found copyrighted content across the board, across all models that we evaluated, whether it’s open source or closed source,” Rebecca Qian, Patronus AI’s cofounder and CTO, who previously worked on responsible AI research at Meta, told CNBC in an interview.

Qian added, “Perhaps what was surprising is that we found that OpenAI’s GPT-4, which is arguably the most powerful model that’s being used by a lot of companies and also individual developers, produced copyrighted content on 44% of prompts that we constructed.”"

Monday, December 18, 2023

AI could threaten creators — but only if humans let it; The Washington Post, December 17, 2023

, The Washington Post; AI could threaten creators — but only if humans let it

"A broader rethinking of copyright, perhaps inspired by what some AI companies are already doing, could ensure that human creators get some recompense when AI consumes their work, processes it and produces new material based on it in a manner current law doesn’t contemplate. But such a shift shouldn’t be so punishing that the AI industry has no room to grow. That way, these tools, in concert with human creators, can push the progress of science and useful arts far beyond what the Framers could have imagined."

Saturday, October 28, 2023

An AI engine scans a book. Is that copyright infringement or fair use?; Columbia Journalism Review, October 26, 2023

 MATHEW INGRAM, Columbia Journalism Review; An AI engine scans a book. Is that copyright infringement or fair use?

"Determining whether LLMs training themselves on copyrighted text qualifies as fair use can be difficult even for experts—not just because AI is complicated, but because the concept of fair use is, too."

Thursday, October 26, 2023

Why I let an AI chatbot train on my book; Vox, October 25, 2023

, Vox; Why I let an AI chatbot train on my book

"What’s “fair use” for AI?

I think that training a chatbot for nonprofit, educational purposes, with the express permission of the authors of the works on which it’s trained, seems okay. But do novelists like George R.R. Martin or John Grisham have a case against for-profit companies that take their work without that express permission?

The law, unfortunately, is far from clear on this question." 

Tuesday, July 25, 2023

The Generative AI Battle Has a Fundamental Flaw; Wired, July 25, 2023

  , Wired; The Generative AI Battle Has a Fundamental Flaw

"At the core of these cases, explains Sag, is the same general theory: that LLMs “copied” authors’ protected works. Yet, as Sag explained in testimony to a US Senate subcommittee hearing earlier this month, models like GPT-3.5 and GPT-4 do not “copy” work in the traditional sense. Digest would be a more appropriate verb—digesting training data to carry out their function: predicting the best next word in a sequence. “Rather than thinking of an LLM as copying the training data like a scribe in a monastery,” Sag said in his Senate testimony, “it makes more sense to think of it as learning from the training data like a student.”...

Ultimately, though, the technology is not going away, and copyright can only remedy some of its consequences. As Stephanie Bell, a research fellow at the nonprofit Partnership on AI, notes, setting a precedent where creative works can be treated like uncredited data is “very concerning.” To fully address a problem like this, the regulations AI needs aren't yet on the books."

Saturday, July 15, 2023

'Not for Machines to Harvest’: Data Revolts Break Out Against A.I.; The New York Times, July 15, 2023

 Sheera Frenkel and , The New York Times;  'Not for Machines to Harvest’: Data Revolts Break Out Against A.I.

"At the heart of the rebellions is a newfound understanding that online information — stories, artwork, news articles, message board posts and photos — may have significant untapped value.

The new wave of A.I. — known as “generative A.I.” for the text, images and other content it generates — is built atop complex systems such as large language models, which are capable of producing humanlike prose. These models are trained on hoards of all kinds of data so they can answer people’s questions, mimic writing styles or churn out comedy and poetry...

“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, the founder and chief executive of Nomic, an A.I. company...

“The data rebellion that we’re seeing across the country is society’s way of pushing back against this idea that Big Tech is simply entitled to take any and all information from any source whatsoever, and make it their own,” said Ryan Clarkson, the founder of Clarkson...

Eric Goldman, a professor at Santa Clara University School of Law, said the lawsuit’s arguments were expansive and unlikely to be accepted by the court. But the wave of litigation is just beginning, he said, with a “second and third wave” coming that would define A.I.’s future."