AI ‘gold rush’ for chatbot training data from human-written text could result

Artificial intelligence systems like ChatGPT will soon outgrow what makes them smarter – the tens of trillions of words written and shared by people online.

A new study released Thursday by research group Epoch AI projects that tech companies will consume the supply of publicly available training data for AI language models in about a decade — sometime between 2026 and 2032.

Comparing it to a “literary gold rush” that consumes finite natural resources, Tamay Besiroglu, author of the study, said the field of AI could face challenges in maintaining the current pace of progress once it drains reserves. of human writing.

In the short term, tech companies like ChatGPT maker OpenAI and Google are racing to get quality data sources and sometimes pay to train their big AI language models – for example, by signing deals to take advantage of the steady flow upcoming sentences. from Reddit forums and news media outlets.

In the long term, new blogs, news articles and social media commentary will not be enough to maintain the current trajectory of AI development, forcing companies to tap into sensitive data now considered private – such as emails or text messages – or rely on the less reliable “synthetic data” that the chatbots throw out themselves.

“There is a big obstacle here,” Besiroglu said. “If you start hitting those limits on how much data you have, you can’t scale your models efficiently anymore. And increasing models is probably the most important way to expand their capabilities and improve the quality of their output.”

The researchers first made their predictions two years ago – shortly before ChatGPT’s debut – in a working paper predicting that high-quality text data crunching will be available by 2026. A lot has changed since then, with include new techniques that have enabled AI researchers to make better use of the data they already have and sometimes “overtrain” the same sources over and over again.

But there are limits, and after further research, Epoch now plans to retire public text data sometime in the next two to eight years.

The team’s latest study is peer-reviewed and is to be presented at the International Conference on Machine Learning in Vienna, Austria this summer. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by advocates of effective altruism—a philanthropic movement that has put money into mitigating AI’s worst risks.

Besiroglu said that AI researchers realized more than a decade ago that the performance of AI systems could be significantly improved by strongly expanding two key ingredients – computing power and massive internet data stores.

The amount of text data fed into AI language models is growing about 2.5 times a year, and computing has increased about 4 times a year, according to the Aga study. Facebook’s parent company Meta Platforms recently claimed that the biggest version of its upcoming model Llama 3 – which has not yet been released – has trained on up to 15 trillion tokens, each of which can represent a piece of a word. expression.

But the data clutter is worth worrying about.

“I think it’s important to remember that we don’t necessarily need to train more and more models,” said Nicolas Papernot, an assistant professor of computer engineering at the University of Toronto and a researcher at the Vector Institute for Artificial Intelligence.

Papernot, who was not involved in the Epoch study, said that more skilled AI systems can also come from training models that are more specialized for specific tasks. But he worries about training AI generation systems on the same outputs they produce, leading to degraded performance known as “model collapse”.

Training on AI-generated data is “like what happens when you make a photocopy of a piece of paper and then make a photocopy of the photocopy. You lose some of the information,” Papernot said. Not only that, but Papernot’s research also found that it can further encode the mistakes, bias and inequity already baked into the information ecosystem.

If real human-crafted sentences are still a critical source of AI data, those who are the custodians of the most popular soldiers—websites like Reddit and Wikipedia, as well as news and book publishers—have been forced to think hard about how are they being used.

“Maybe you don’t top every mountain,” says Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re talking natural resources about human-created data. I shouldn’t be laughing about it, but I think it’s kind of cool.”

Although some have tried to shut down their data from AI training – often after it has already been taken without compensation – Wikipedia has placed few restrictions on how AI companies use their written entries. voluntarily. Still, Deckelmann said she hopes there will still be incentives for people to keep contributing, especially as a flood of cheap and automatically generated “garbage content” begins to pollute the internet.

AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said.

From the point of view of AI developers, the Epoch study says that paying millions of people to generate the text that AI models need to drive better technical performance is unlikely to be an economical way.

As OpenAI begins training the next generation of its large GPT language models, CEO Sam Altman told an audience at a United Nations event last month that the company had already experimented with “many generate synthetic data” for training.

“I think what you want is quality data. It contains low quality synthetic data. There is low-quality human data,” Altman said. But he also expressed concern about relying too much on synthetic data over other technical methods to improve AI models.

“It would be something very strange if the best way to train a model was to generate a quadrillion synthetic data signals and feed them back in,” Altman said. “In a way that seems inefficient.”

—————

The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to some of AP’s text archives.

Leave a Reply

Your email address will not be published. Required fields are marked *