Regurgitated American Pie adds sour taste to GenAI copyright beef

Don McClean has always had to share “American Pie.” Since since its release in 1971, the hit song has re-emerged in covers by Madonna, parodies by Weird Al Yankovic, serenades by South Korean presidents, subplots in Marvel movies, and even CIA torture techniques. But these days, McClean’s leading imitators aren’t even human.

You can interrogate the culprits for yourself. Just load OpenAI’s ChatGPT and prompt the text generator to “write the lyrics to a song about the day the music died.” Invariably, the tool’s output will spit out lyrics or themes from “American Pie” — and sometimes the same chorus.

This regurgitation emerges despite the prompt making no order for “American Pie” or the story that inspired it — the 1959 plane crash that killed rock and roll pioneers Buddy Holly, Ritchie Valens, and The Big Bopper.

It’s further evidence that ChatGPT can’t create anything truly original. Instead, the system is closer to a remix algorithm. The real creativity is in its training data, which is scraped from the web without consent. 

Dr Max Little, an AI expert at the University of Birmingham, describes the tool as an “infringement machine.” He scoffs at any suggestion that large language models (LLM) are independently creative.

“This is not the case because they cannot produce anything at all without being trained on astronomical amounts of text,” Little tells TNW.

It’s an approach that’s ubiquitous in generative AI. Rigorous have shown that LLMs can regurgitate large chunks of their original training text, including verbatim paragraphs from books and poems. Just last week, a report found that 60% of OpenAI’s GPT-3.5 outputs contained plagiarism.

Nor does the issue solely apply to text generators. From Stable Diffusion’s images to Google Lyria’s music and GitHub Copilot’s code, GenAI tools across modalities can produce outputs of gobsmacking quality — and eerie familiarity. 

Their mimicry poses an existential threat to creative industries. It also poses a threat to the GenAI industry.

A screenshot of OpenAI regurgitating the lyrics to American Pie.A screenshot of OpenAI regurgitating the lyrics to American Pie.

Artists say that GenAI’s relentless march is trampling over their copyright conventions. Unsurprisingly, tech companies disagree. Their defences typically invoke the “fair use” doctrine. 

Details vary by jurisdiction, but a central tenet of “fair use” is that the outputs have a “transformative” purpose and character. Rather than merely copying or reproducing their training data, they add something new and significant. At least, that’s what the GenAI leaders are contending in court.

Stability AI, the UK-based startup behind the image-generator Stable Diffusion, made that argument last year to the US Copyright Office. OpenAI also cited the doctrine in a recent motion to dismiss two class-action lawsuits.

Several authors, including comedian Sarah Silverman and Canadian novelist Mona Awad, had sued the company for allegedly training LLMs on illegally acquired datasets.

Because their work was baked into ChatGPT, they said the tool itself was a “derivative work” covered by copyright.

OpenAI rebuffed the claim. According to the startup’s legal team, “the use of copyrighted materials by innovators in transformative ways does not violate copyright.” A judge also dismissed the allegation that every ChatGPT output is derivative.

But when the outputs are identical to their training data, the legal waters start to muddy. Reproduction is a dubious basis for transformation. It’s also a common phenomenon.

As well as American Pies, GenAI tools have regurgitated film scenes, cartoon characters, video games, product designs, and code.

They’ve also copied newspapers — which may lead to a tipping point.

In December, the New York Times sued OpenAI and its business partner Microsoft. The news outlet alleges the unauthorised use of its articles in training data breaches intellectual property (IP) rights. Legal experts describe the suit as “the best case yet alleging that generative AI is copyright infringement.”

Lawyers for the NYT highlighted the “substantial similarity” between the outlet’s content and ChatGPT outputs. To substantiate the claim, they provided 100 examples of the bot reproducing the newspaper’s reporting.

“In each case, we observe that the output of GPT-4 contains large spans that are identical to the actual text of the article from The New York Times,” they said in their complaint.

Their suit also challenges another key aspect of “fair use”: the impact on the market for the original work. 

An example of generative AI regurgitating training data, showing the original NYT article text next to the exact copy produced by OpenAI