Machine learning has pushed the boundaries in a variety of areas, including personalized medicine, self-driving cars and customized ads. However, research has shown that these systems memorize features of the data they’ve been trained to learn patterns from, raising privacy concerns.
In statistics and machine learning, the goal is to learn from past data to make new predictions or inferences about future data. To achieve this goal, the statistician or machine learning expert selects a model to capture the suspected patterns in the data. A model applies a simplifying structure to the data, making it possible to learn patterns and make predictions.
Complex machine learning models have inherent advantages and disadvantages. On the plus side, they can learn much more complex patterns and work with richer datasets for tasks like image recognition and predicting how a particular person will respond to treatment.
However, there is also the risk of overfitting the data. This means that they make accurate predictions on the data they were trained on but begin to learn additional aspects of the data that are not directly relevant to the task at hand. This results in models that do not generalize, meaning they perform poorly on new data that is similar but not exactly the same as the training data.
While there are techniques to combat the predictive error of overfitting, there are also privacy concerns because so much can be learned from the data.
How machine learning algorithms make inferences
Each model has a certain number of parameters. A parameter is an element of a model that can be changed. Each parameter has a value, or setting, that the model derives from the training data. Parameters can be thought of as the various knobs that can be turned to affect the performance of the algorithm. While a straight line pattern has only two nodes, the slope and intercept, machine learning models, have many parameters. For example, the language model GPT-3, there are 175 billion.
In order to select the parameters, machine learning methods use training data to minimize the predictive error on the training data. For example, if the goal is to predict whether a person would respond well to a certain medical treatment based on their medical history, the machine learning model would make predictions on the data where the model developers know whether someone has responded good or bad. The model is rewarded for correct predictions and penalized for incorrect predictions, which instructs the algorithm to adjust its parameters – that is, turn some of the “knobs” – and try again.
To avoid overfitting the training data, machine learning models are also checked against a validation dataset. The validation dataset is a separate dataset that is not used in the training process. By checking the machine learning model’s performance on this validation dataset, developers can ensure that the model is able to generalize its learning beyond the training data, avoiding overfitting.
Although this process ensures good performance of the machine learning model, it does not directly prevent the machine learning model from remembering information in the training data.
Privacy concerns
Due to the large number of parameters in machine learning models, the machine learning method may remember some data it was trained on. In fact, this is a widespread phenomenon, and users can extract the reminder data from the machine learning model by using queries tailored to get the data.
If the training data contains sensitive information, such as medical or genomic data, the privacy of the people whose data was used to train the model could be compromised. Recent research has shown that it is actually necessary for machine learning models to memorize aspects of the training data in order to get the best performance in solving certain problems. This shows that there may be a fundamental trade-off between the performance of a machine learning method and privacy.
Machine learning models also make it possible to predict sensitive information using seemingly insensitive data. For example, Target was able to predict which customers were likely to become pregnant by analyzing the purchasing habits of customers who signed up for Target’s baby program. When the model was trained on this data set, it was able to send pregnancy-related ads to customers who suspected they were pregnant because they purchased items such as supplements or unscented lotions.
Is privacy protection even possible?
Although many methods have been proposed to reduce the recall of machine learning methods, most of them have been largely ineffective. Currently, the most promising solution to this problem is to ensure a mathematical limit on the privacy risk.
Differential privacy is the state-of-the-art method of formal privacy protection. Differential privacy requires that a machine learning model does not change much if the data of an individual in the training data set is changed. This guarantee is achieved through differential privacy methods by introducing additional randomness into the learning algorithm that “covers” the contribution of any given individual. When there is a protection method with differential privacy, no possible attack can violate that privacy guarantee.
Even if a machine learning model is trained using differential privacy, however, that does not prevent it from making sensitive inferences such as in the Target Example. To prevent these privacy breaches, all data transmitted to the organization must be protected. This approach is called local differential privacy, and has been implemented by Apple and Google.
Because differential privacy limits how much the machine learning model can rely on an individual’s data, this prevents rote learning. Unfortunately, it also limits the performance of machine learning methods. Because of this trade-off, there are criticisms of differential privacy uses, as it often results in significant performance degradation.
Going forward
Because of the tension between inferential learning and privacy concerns, it is ultimately a societal issue that is more important than the contexts. When the data does not contain sensitive information, it is easy to recommend the most powerful machine learning methods available.
However, when working with sensitive data, it is important to weigh the consequences of privacy leakage, and it may be necessary to sacrifice some machine learning performance for the privacy of the people whose data trained the model to protect.
This article is republished from The Conversation, a non-profit, independent news organization that brings you facts and analysis to help you make sense of our complex world.
It was written by: Jordan Awan, Purdue University.
Read more:
Jordan Awan receives funding from the National Science Foundation and the National Institutes of Health. He also serves as a privacy consultant for the federal nonprofit, MITRE.