The New York Times lawsuit against OpenAI could have major implications for the development of machine intelligence

    <span rang=Tada Images / Shutterstock” src=”https://s.yimg.com/ny/api/res/1.2/F02vCzZwG739o5HX3.xdHg–/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzOQ–/https://media.zenfs.com/en/the_conversation_464/ba6aa7f90c672900fb4026e582fcb91f “data- ” />

In 1954, the Guardian’s science correspondent reported on “electronic brains”, which had a type of memory that could allow them to retrieve information, such as airline seat allocations, in seconds.

Today the idea of ​​computers storing information is so common that we don’t even think about what words like “memory” mean. In the 1950s, however, this language was new to most people, and the idea of ​​an “electronic brain” was heavy with possibility.

In 2024, your microwave has more computing power than anything called a 1950s brain, but the world of artificial intelligence is creating new challenges for language – and for lawyers. Last month, the New York Times newspaper filed a lawsuit against OpenAI and Microsoft, owners of the popular AI-based text generation tool ChatGPT, over their alleged use of Times articles in the data they use to train (improve) and test. their systems.

They claim that OpenAI has violated copyright by using their journalism as part of the process of creating ChatGPT. In doing so, the lawsuits claim, they have created a competitive product that threatens their business. OpenAI’s response has been very cautious so far, but a key principle outlined in a statement released by the company is that their use of online data falls under the so-called “fair use” principle. This is why, according to OpenAI, they shift the work to something new in the process – the text generated by ChatGPT.

At the heart of this question is the issue of data use. What data are companies like OpenAI entitled to use, and what do concepts like “transformation” mean in these contexts? Questions like these, related to the data we train AI systems, or models, like ChatGPT on, are still an intense academic battleground. The law often lags behind industry behaviour.

If you’ve used AI to answer emails or summarize work for you, you might see ChatGPT as an end that justifies the means. However, perhaps we should be concerned if the only way to achieve this is to exempt certain corporate entities from laws that apply to everyone else.

This could not only change the nature of the debate about copyright laws like this, but it could change the way societies structure their legal systems.


Read more: ChatGPT: what the law says about who owns the copyright of AI-generated content


Basic questions

Cases like these can raise difficult questions about the future of legal systems, but they can also call into question the future of AI models themselves. The New York Times believes that ChatGPT threatens the newspaper’s long-term existence. At this point, OpenAI says in its statement that it is collaborating with news organizations to provide novel opportunities in journalism. He says the company’s goals are to “support a healthy news ecosystem” and “be a good partner”.

Even if we believe that AI systems are an essential part of the future of our society, it seems like a bad idea to destroy the data sources that were trained in the first place. This is a concern shared by creative endeavors such as the New York Times, authors such as George RR Martin, and the online encyclopedia Wikipedia.

Advocates of large-scale data collection – such as that used to power Large Language Models (LLMs), the technology behind AI chatbots such as ChatGPT – argue that AI systems “transform” the data they interact with. training by “learning” from their data sets and then to create something new.

Sam Altman

Effectively, what they mean is that researchers provide data written by humans and ask these systems to guess the next words in the sentence, as if they were dealing with a real question from a user. By hiding and then revealing these responses, researchers can provide a binary “yes” or “no” answer that helps push AI systems toward accurate predictions. It is for this reason that LLMs require large areas of written texts.

If we copied the articles from the New York Times website and charged people for access, most people would agree that it would be “systematic theft on a massive scale” (as the lawmakers put it it’s a newspaper). But improving the accuracy of AI by using data to guide it, as shown above, is more complicated than this.

Firms like OpenAI don’t store their training data so they argue that the New York Times articles added to the dataset aren’t actually being reused. A counter-argument to this defense of AI, however, is that there is evidence that systems such as ChatGPT can “drift” verbatim passages from their training data. OpenAI says this is a “rare bug”.

However, it suggests that these systems store and remember some of the data they are trained on – unintentionally – and can recall it verbatim when stimulated in specific ways. This would avoid any pay walls that a for-profit publication might put in place to protect its intellectual property.

Use of language

But it is likely to have a more long-term impact on the way we legislate in situations like this than our use of language. Most AI researchers will tell you that the word “learning” is a gross and imprecise word to describe what AI is actually doing.

The question must be asked whether the law as it currently stands is sufficient to protect and support people as society undergoes a massive shift into the age of AI. Whether something adds to an existing copyrighted work in a way that is different from the original is called “transformative use” and is a protection used by OpenAI.

However, these laws were designed to encourage people to remix, reassemble and experiment with work that had already been released into the outside world. The same laws were not really designed to protect multi-billion dollar technology products that operate at a speed and scale many orders of magnitude greater than any human writer could ever aspire to.

The problems with many of the defenses of large-scale data collection and use are that they rely on strange uses of the English language. We say that AI “learns”, that it “understands”, that it can “think”. However, these are analogies, not precise technical language.

Just as in 1954, when people looked at the modern equivalent of a broken calculator and called it a “brain”, we are using the old language to tackle entirely new concepts. No matter what we call it, systems like ChatGPT don’t work like our brains, and AI systems don’t play the same role in society that humans play.

Just as we had to develop new words and a new common understanding of technology to make sense of computers in the 1950s, we may need to develop new language and new laws to help protect our society in the 2020s.

This article from The Conversation is republished under a Creative Commons license. Read the original article.

The conversationThe conversation

The conversation

Mike Cook does not work for, consult with, own shares in or receive funding from any company or organization that would benefit from this article this, and has disclosed no relevant affiliations beyond their academic appointment.

Leave a Reply

Your email address will not be published. Required fields are marked *