Human differences in judgment pose problems for AI

Many people understand the concept of bias on some intuitive level. In society, and in artificial intelligence systems, race and gender biases are well documented.

If society could somehow remove bias, would all problems go away? The late Nobel laureate Daniel Kahneman, who was a central figure in the field of behavioral economics, argued in his last book that bias is only one side of the coin. Errors in judgment can be attributed to two sources: bias and noise.

Both bias and noise play important roles in fields such as law, medicine and financial forecasting, where human judgment is central. In our work as computer and information scientists, my colleagues and I discovered that noise also plays a role in AI.

Statistical noise

Noise in this context means variation in how people judge the same problem or situation. The problem of noise is more widespread than first meets the eye. An important work, dating back to the Great Depression, found that different judges gave different sentences for similar cases.

Worryingly, sentencing in court cases could depend on things like the temperature and whether the local football team won. Such factors contribute, at least in part, to the perception that the justice system is not only biased but sometimes arbitrary.

Other examples: Insurance adjusters may give different estimates for similar claims, which introduces noise in their judgments. All kinds of competitions, from wine tasting to local beauty pageants to college admissions, are likely to involve noise.

Noise in the data

On the surface, it doesn’t seem like noise could affect the performance of AI systems. After all, weather or football teams don’t interfere with the machines, so why would they make judgments that vary according to circumstances? On the other hand, researchers know that bias affects AI, because it is reflected in the data the AI ​​is trained on.

For the new range of AI models like ChatGPT, the gold standard is human performance on general information problems such as common sense. ChatGPT and their counterparts are benchmarked against human-labeled common sense datasets.

Simply put, researchers and developers can ask the machine a sensible question and compare it to human responses: “If I put a heavy rock on a paper table, will it fall off? Yes or No.” If there is high agreement between the two — at best, perfect agreement — the machine is approaching human-level common sense, according to the test.

So where would noise come in? The common sense question above seems simple, and most people would probably agree with its answer, but there are many questions where there is more disagreement or uncertainty: “Is the following sentence plausible or plausible ? My dog ​​plays volleyball.” In other words, there could be noise. It’s not surprising that intelligent and interesting questions would have some noise.

But the issue is that most AI tests do not account for this noise in experiments. Intuitively, questions that generate human responses that tend to agree with each other should be weighted higher than if the responses vary – in other words, where there is noise. Researchers don’t yet know whether or how to weigh AI responses in that situation, but acknowledging the problem is the first step.

Download the noise in the machine

Theory aside, the question remains whether all of the above are hypothetical or if common sense tests are pure noise. The best way to prove or disprove the presence of noise is to take an existing test, extract the answers and get multiple people to independently label them, meaning provide answers. By measuring disagreement among people, researchers can know how much noise is in the test.

Measuring the data behind this disagreement is complex, involving significant statistics and mathematics. Besides, who is to say how common sense should be defined? How do you know the human judges are motivated enough to think about the issue? These issues lie at the intersection of good experimental design and statistics. Robustness is key: A single result, test or set of human labeling is unlikely to convince anyone. As a pragmatic matter, human labor is expensive. Perhaps for this reason, no study of potential noise in AI tests has been conducted.

To address this gap, my colleagues and I designed such a study and published our findings in Nature Scientific Reports, showing that noise is unavoidable even in the realm of common sense. Because the location in which judgments are obtained can be important, we conducted two types of studies. One type of study involved paid workers from Amazon Mechanical Turk, while the other involved a smaller-scale labeling exercise in two labs at the University of Southern California and Rensselaer Polytechnic Institute.

You can think of the former as a more realistic online site, showing how many AI tests are actually labeled before they are released for training and evaluation. The second is more of a big one, guaranteeing high quality but at much smaller scales. The question we set out to answer was how unavoidable is the noise, and is it just quality control?

The results were sobering. In both settings, even on sensible questions that might have been expected to achieve high – even universal – agreement, we found a non-trivial amount of noise. The noise was high enough that we estimated that between 4% and 10% of system performance could be attributed to noise.

To emphasize what this means, suppose I built an AI system that scored 85% on a test, and you built an AI system that scored 91%. Your system seems to be much better than mine. But if there is noise in the human labels used to score the responses, we’re now not sure that the 6% improvement means much. As we all know, there may not really be any improvement.

On AI leaderboards, where large language models like the one that powers ChatGPT are compared, performance differences between competing systems are much narrower, typically less than 1%. As we show in the paper, conventional statistics do not really come to the rescue in disentangling the effects of noise from those of real performance improvements.

Noise inspections

What is the way forward? Returning to Kahneman’s book, he proposed the concept of “noise auditing” to quantify and ultimately mitigate noise as much as possible. At the very least, AI researchers need to estimate the potential impact of noise.

Auditing AI systems for bias is commonplace, so we believe the concept of noise auditing should naturally follow. We hope that this study, as well as others like it, will be accepted.

This article is republished from The Conversation, a non-profit, independent news organization that brings you facts and analysis to help you make sense of our complex world.

It was written by: Mayank Kejriwal, University of Southern California.

Read more:

Mayank Kejriwal receives funding from DARPA.

Leave a Reply

Your email address will not be published. Required fields are marked *