Microsoft recently launched a new version of all its software and added an artificial intelligence (AI) assistant that can perform various tasks for you. Copilot can summarize oral conversations in online Team meetings, present arguments for or against a particular point based on oral discussion and answer some of your emails. He can even write computer code.
This rapidly developing technology seems to bring us closer to a future where AI makes our lives easier and takes away all the boring and repetitive things we humans have to do.
But while these advances are very significant and very useful, we must be careful when using large language models (LLMs). Despite their intuitive nature, they still require skill to use effectively, reliably and safely.
Major language models
LLMs, a type of “deep learning” neural network, are designed to understand user intent by analyzing the likelihood of different responses based on the cue provided. So, when someone enters a hint, the LLM examines the text and determines the most likely answer.
ChatGPT, a prominent example of LLM, can provide answers to prompts on a wide range of topics. However, despite the seemingly familiar answers, ChatGPT doesn’t really know. The answers are only the most likely outcomes based on the hint given.
When people provide ChatGPT, Copilot and other LLMs with detailed descriptions of the tasks they want to complete, these models can excel in providing quality answers. This could generate text, images or computer code.
But, as humans, we often push the boundaries of what technology can do and what it was originally designed for. Therefore, we start using these systems to do the legwork that we should have done ourselves.
Why over-reliance on AI could be a problem
Despite their seemingly intelligent answers, we cannot trust LLMs to be accurate or reliable. We must carefully evaluate and verify their outputs, ensuring that our initial clues are reflected in the answers given.
To effectively verify and validate LLM outputs, we need to have a strong understanding of the subject. Without expertise, we cannot provide the necessary quality assurance.
This becomes particularly important in situations where we are using LLMs to fill gaps in our own knowledge. Our lack of knowledge may arise here when we cannot determine whether the output is correct or not. This situation can arise during text generation and coding.
Using AI to attend meetings and summarize the discussions has obvious reliability risks. Although the meeting record is based on a transcript, the meeting notes are generated in the same way as other texts from LLMs. They are still based on language patterns and the probabilities of what was said, so they need verification before they can be acted upon.
They also suffer from problems of interpretation due to homophones, words that sound the same but have different meanings. People are good at understanding what is at stake in situations like this because of the context of the conversation.
But AI doesn’t like to grasp context and doesn’t understand nuance. Therefore, expecting him to construct arguments based on a potentially erroneous transcript creates even more problems.
The verification is even more difficult if we are using AI to generate computer code. Testing computer code with test data is the only reliable method of validating its functionality. While this shows that the code works as intended, it does not guarantee that its behavior matches real-world expectations.
Suppose we use generative AI to create code for a sentiment analysis tool. The goal is to analyze product reviews and categorize sentiments as positive, neutral or negative. We can test the functionality of the system and validate that the code functions correctly – that it is sound from a technical programming point of view.
However, imagine that we deploy such software in the real world and it starts classifying sarcastic product reviews as positive. The sentiment analysis system lacks the contextual information necessary to understand that sarcasm is not used as positive feedback, and vice versa.
Expertise is required to verify that code output matches the desired results in nuanced situations like this.
Read more: Phase 1 at ChatGPT: AI chatbot success says as much about people as technology
Non-programmers will have no knowledge of software engineering principles used to ensure code is correct, such as planning, methodology, testing and documentation. Programming is a complex discipline, and software engineering emerged as a field to manage software quality.
There is a significant risk, as my own research has shown, that critical steps in the software design process will be overlooked or skipped by non-experts, resulting in code of unknown quality.
Validation and verification
LLMs like ChatGPT and Copilot are powerful tools that we can all benefit from. But we must be careful not to blindly trust the outputs we are given.
We are right at the beginning of a great revolution based on this technology. AI has endless possibilities but needs to be modeled, checked and verified. And right now, humans are the only ones who can do this.
This article from The Conversation is republished under a Creative Commons license. Read the original article.
Simon Thorne does not work for, consult with, hold shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.