It is increasingly becoming important that AI-based decision support systems (which leverage machine learning methods) must be able to explain the reasoning behind their decisions, recommendations, predictions, or actions taken. Explainability is not just a nice-to-have property but is also becoming a serious matter of public debate. ‘Explainability’ and ‘Establishing trust’ in machine learning models remain key challenges, not only for sensitive domains like healthcare but now for everyday consumer-facing products as well. In some countries, compliance laws will require ‘a right to explanation,’ which means end-users can ask for an explanation of a certain decision that affects their lives (especially those involved in making decisions in the criminal justice system). This also means that using black box methods (such as Deep Neural Networks) in predictive and prescriptive approaches won’t be feasible as we can’t explain the decisions or predictions made.

One of the major challenges the field faces is defining utility functions in a way that aligns with our human values, or in other words, how do we encode our human values into a utility function in a fair and unbiased manner. Other challenge is that incentives, because when we operate in a digital economy where we have decided that we want free things. In this scenario, attention becomes the new currency which then becomes a zero-sum game. Peter Norvig explains it succinctly:

"We've built this society in this infrastructure where we say we have a marketplace for attention. And we've decided as a society that we like things that are free. And so we want all apps on our phone to be free, and that means they're all competing for your attention. And then, eventually, they make some money, some way through ads, or in-game sales, or whatever. But they can only win by defeating all the other apps, by stealing your attention. And we built a marketplace where it seems like they're working against you rather than working with you. And I'd like to find a way where we can change the playing field, so, you feel more like what these things are on my side: "Yes, they have let me have some fun in the short term but they're also helping me in the long term rather than competing against me."

Returning to the utility problem, we are interested in programming AI agents and solutions to achieve what we humans want, i.e., defining the “right” behavior for an AI system. Here, the objectives we define for our AI/ML systems need to be sufficiently close to the actual goals, and it turns out this is a challenging problem. Furthermore, Peter Norvig, in a podcast interview , was asked how the 4th edition of the famous AI textbook differs compared to the third edition. He explains that in the first three editions, AI was defined as maximizing utility. So, given a utility function, we spent a lot of chapters discussing cool techniques for optimizing that utility. Now, we are entering a phase where we believe optimizing is the easy part, and deciding what my utility function should be is the hard part.

Here is a visual example of what that undesirable solution looks like:

Why is that well, "A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want. A highly capable decision maker – especially one connected through the Internet to all the world's information and billions of screens and most of our infrastructure – can have an irreversible impact on humanity."

In practice, this would mean that before deployment, extensive testing of various system properties like trust, validation, verification, and explainability would be required. But how does all of this relate to the AI alignment problem? We should realize that it’s difficult to convert our values (regarding the justice system) into precise objectives and that it’s hard to simulate the effects that an ML system will have on the world at training time.

One of the promising solutions to this problem is to optimize complex objectives by first learning them. This approach is advantageous because learning these values from data is easier than manually crafting a utility function that truly captures our values. An example of such an effort is the field of inverse reinforcement learning, which deals with the problem of observing certain expert human behavior in a particular domain and trying to infer what the human is “trying to do,” by converting it into an objective that can later be used to train our systems.

Overall, its important to keep in mind that its fairly challenging to encode human values in advance. These challenges will get extra hard to resolve as build smart and smart solutions and approach AGI: Stuart Russell explains it brilliantly although these are more applicable to future AGI systems:

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem: 1.) The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down. 2.) Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

In the case that one day some artificial intelligence reaches super-human capabilities, IRL (inverse reinforcement learning) kind of approaches will become to understand what humans want and to hopefully work towards these goals. There is also a whole philosophical side to it, where we can ask ourselves: Given unlimited time and perfect knowledge about (ideal) human behavior, can we find any reasonable approximation to “what a human wants"? We will leave the answer for discussion in another post and end with an old radio talk by alan turing.

'Can digital computers think?' - Alan Turing

Read more:

References

The AI Alignment Problem: Why It’s Hard, and Where to Start - Machine Intelligence Research Institute
What is it?: A talk by Eliezer Yudkowsky given at Stanford University on May 5, 2016 for the Symbolic Systems Distinguished Speaker series. Talk: Full video. Transcript: Full (including Q&A), partial (including select slides). Slides without transitions: High-quality, low-quality. Slides with tran…
Interpretability in Machine Learning: An Overview
A broad overview of the sub-field of machine learning interpretability; conceptual frameworks, existing research, and future directions.

[1] European Union Regulations on Algorithmic Decision Making and a “Right to Explanation” Bryce Goodman, Seth Flaxman