Machine Learning for Science: Reports of algorithmic discrimination have increased recently. Ostensibly objective, algorithm-based systems have made decisions that disadvantage individuals and reveal the systems to be unfair. This is about things like facial recognition programs that simply fail to recognize people of color, or programs that pre-sort job applications to favor men’s resumes over those of women. “Fix it!” society then simply demands of developers, and researchers as well. But is it just that simple?
Ulrike von Luxburg: We just press a button to make an algorithm fair – it’s not that easy. Machine learning isn’t just an algorithm I apply. Instead, machine learning is a very long pipeline.
What do you mean by that?
Well, for a start, a bias could already be attributable to the data, who collected it and how it’s labeled. Or it could emerge from the definition of the groups, so defining to whom I must be fair. The algorithm only comes after that. And the selected concept of fairness must be followed along this entire pipeline.
Could we go through this pipeline – these different steps – together? The data is there at the start. The data used to train the algorithm. This material is what it uses to make decisions.
“Data used in machine learning are often gathered for a completely different purpose than to train algorithms.”
And it’s right at this point that the first bias already comes in. Consider many of the publicly discussed applications that place people of color at a disadvantage. There were far too few images of people of color in the data. In the meantime, I believe it has become clear to everyone that in facial recognition people with different skin colors must be well represented in their data set. Or consider the case of resumes. If in the past mostly men had been hired, then of course that’s what is represented in the data, and a system that’s being trained on this data will imitate this behavior. And data used in machine learning are often gathered for a completely different purpose than to train algorithms. So a key question is: Where does the data come from and who selected it? Was it gathered for a particular purpose or did it simply come from the Internet? And who evaluates and labels it?
Labeling – assigning categories is often done by “crowdworkers”, or freelancers who are quasi on call on internet platforms to take on smaller jobs.
And that’s how, for example, tools for evaluating the attractiveness of a person come into existence with the help of machine learning. Labeling is usually done by 25-year-old men – most crowdworkers are young males – and they assess how attractive the people in the photographs are. A data set like that is biased from the start and primarily reflects the preferences of the crowdworker involved.
Let’s continue to the next step, the question of who you want or should be fair to.
“For every special application, I must first consider what ‘fair’ means in that context.”
There’s the definition of fairness first and foremost. Which groups do I want to be fair to? Women compared to men? Blacks compared to whites? I need to “tell” an algorithm in advance if I want to make it fair. And the more groups I name, the harder it gets. And then there’s the concept of fairness in and of itself. For every special application, I must first consider what “fair” means in that context. There are a few standard definitions. One, for example, is “demographic parity”. You could say, for example, a university, when admitting students should reflect the ratio of men to women in the general population, so half the share of those admitted should be women and the other half, men.
Yet in doing that you’re looking at an absolute quantity, not at the qualifications of individual applicants. Another fairness concept is “equalized odds” or “equalized opportunity”, meaning everyone has the same chances. If, for instance, we stick to the case of university admissions, that would mean if you have equal ability you should – no matter if you’re a woman or a man, or black or white – be admitted to this course of study. The big problem is: You can’t meet all the conceptions of fairness simultaneously. You need to decide in favor of one or the other.
That all sounds as if you can actually adjust the processes of machine learning for fairness as long as you’re clear about the concept of fairness. What’s the catch?
At the very moment I want to establish fairness, other things go out the window. If I want more fairness, then the accuracy of the predictions, for example, drops.
What does that mean, exactly?
It probably sounds a bit abstract. Let’s take granting credit as an example. If just as many whites as blacks or the same number of men and women are to be extended credit, then it could be that I’m allowing people to borrow who might not be able to pay the money back. But at some point, the money has to be repaid after all. The bank or the customers or society together have to come up with the money that’s been lost. Meaning, it costs something. And then you have a very concrete question: How much is fairness worth to us?
After the question of gathering data and the definition of the concept of fairness, now we’re getting to the algorithm.
The algorithm is aiming at two targets. On the one hand, it should be fair; on the other, accurate. If we continue to use our example: The algorithm should, despite the defined fairness criteria, select if at all possible the credit applicant who will pay back the loan. Now I need to resolve this trade-off – the process of weighing up between fairness and the actual objective of the algorithm. There’s a fine-tuning screw here, too. How much fairness, and how much accuracy do I want? As a bank, for example, I can decide to give ten percent of my loans to the needy. I can also choose to limit it to just five percent. Depending on how I decide, fairness increases or decreases. At the same time, accuracy, which depends on this as well, and along with it ultimately, the resulting costs, go up and down.
Let’s assume that, as a university I decide to use the algorithm of a start-up to automate student selection to save personnel and costs. Then I’d also like to know if this algorithm will make a reasonably fair selection. How algorithms are constructed is usually a trade secret. Companies rarely reveal this.
That’s a question I find quite exciting. How can a government attempt to certify something like that? If you look into the future now, there are lots of start-ups bringing out algorithms and they want to be able to say they do their job well. And they’d like to have, for instance, something like a certificate from a testing authority on their website that says: “Federal Data Protection Office tested and certified fair.” Or at least: “As fair as possible.” But how would something like that look like? How would you define some type of minimum standard that would later be testable without revealing the secrets of the algorithm? My colleagues and I discuss this often, but we don’t yet have a solution.
How should, in your opinion, a society, or the government, position itself as long as there isn’t a testing authority for algorithms? Is the only possible option to declare as taboo sensitive areas in which discriminatory decisions would have far-reaching consequences?
I believe there are actually areas in which I wouldn’t want such a system for ethical reasons. If it’s about decisions that are impacting on somebody’s life – such as going to prison or not, or if a child is to be removed from the care of its parents – this responsibility cannot simply be delegated to an algorithm.
You could argue that algorithm-based systems would not be mandatory for making the choice. It could also work like an “assist” system, which only would make suggestions to us.
We hear that argument again and again, but often it just doesn’t work in practice. A judge that’s pressed for time anyway won’t want to consistently rule against the suggestion of the assist system. The tendency would always be to follow the system’s recommendations. Yet there are other areas where I’d say that systems that work with machine learning can actually do good. Medicine is a typical example. It would be an assist system that suggests possible diagnoses or drugs that should be taken. There I’d say that when it’s well done, then the benefits would outweigh the damages. There I see potential in the near future in any case.
“It could be that machine learning systems are in some cases better or fairer than humans.”
In general, people will have to become accustomed to the thought that these systems are not perfect and this fact has to be dealt with. But it could be that they are in some cases better or fairer than humans. Because one thing is clear: Even human decision-makers aren’t always fair and have biases that influence their decisions. The difference is perhaps that we now have methods in hand that can be used to judge the fairness or accuracy of an algorithm, and also the fairness or accuracy of human decision-makers. The comparisons between the two could – depending on the application – sometimes be in favor of the humans and at other times, in favor of the machines.
Interview: Theresa Authaler
Translation into English: Taryn Toro