🦾 OpenAI's breakthrough in understanding AI's black box (so we can build safe AI)

OpenAI has found a way to solve part of the AI alignment problem. So we can understand and create safe AI.

Mathias Sundin 25.May.2023 4 min read

Share this story!

Peeking into the black box

A problem with creating safe AI is that we don't know exactly how an AI arrives at the results it presents. If I ask ChatGPT to write something, we don't know how it came to the answer it presents. The work takes place in what is usually called a black box.

So far, the solution has been to manually look at which neurons in the neural network are activated. This is of course an extremely time-consuming task and not practically possible on any large scale.

OpenAI therefore used GPT4 to test whether it can understand what is going on under the hood. And yes, it could.

In a paper, they show how it can trace which neurons are activated and why. Further down in the text is a more detailed explanation of OpenAI's paper.

A common mistake, leading to pessimism

One reason many worry about how we are going to (continue to) create safe AI is that they don't know how it will be done.

One of the central people in this debate is Eliezer Yudkowsky. He is among those people in the world who have thought the most about these issues. This is one of the problems he pointed out and he reacts with surprise to the result from OpenAI.

When you don't see possible solutions, it's easy to get worried and scared and want to pause, stop or slow down the development.

This is a very common mistake that pessimists make. They don't trust that we humans can solve problems in the future, just because we haven't solved them yet. This leads to people who might not be pessimistic at all, becoming so. (There is a reason we call pessimists naive.)

They might then, like Paul Ehrlich in the 60s, believe that hundreds of millions of people will starve to death. But then we solve the problems and instead hundreds of millions of people leave extreme poverty.

Use AI to understand AI

When it comes to solving future problems with AI, we have a new tool to help us: AI. This is what I wrote a few weeks ago:

Should we then ignore possible problems and just keep going? Of course not. But we should use the best tools available. Many of these tools are now within the AI field.

If we pause progress, we will have worse tools and a harder time solving problems. At the same time, we miss all the enormous advantages and opportunities that are created.

Instead of pausing AI development, we should put more resources in the form of money, brain capacity and data capacity to accelerate the safety work with AI.

The results from OpenAI's paper

Language models are computer programs that can generate or understand natural language, such as English or French. They are often based on neural networks, which consist of many interconnected units called neurons that can process information and learn from data.

Neurons in language models

are organized in layers, and each layer performs a different function, such as encoding the meaning of words or generating the next word in a sentence.
can observe a specific pattern in text, such as a word, a phrase, a topic, or a grammatical function, and activate when they encounter it.
can influence what the model says next by sending signals to other neurons in the next layer or the output layer, which determines the probability of each possible word.

Three steps

The paper proposes a technique consisting of three steps:

Show neuron activations for GPT-4 and ask it about what causes them.
Simulate neuron activations using GPT-4, depending on the explanation.
Score the explanation by comparing the simulated and real activations.

Step 1: Explain the neuron's activations with the help of GPT-4

This step involves showing a text input and the corresponding activation of a neuron for GPT-4 and asking it to write a natural language explanation of what causes the neuron to activate. For example, given a text input about Marvel movies and characters, and a neuron that strongly activates on it, GPT-4 could explain that the neuron is sensitive to language related to Marvel comics, movies, and characters, as well as other superhero-themed content.

The goal of this step is to generate a concise and intuitive description of the neuron's function that can easily be understood by humans.

Step 2: Simulate activations with the help of GPT-4, depending on the explanation

This step involves using GPT-4 to generate new text inputs that would activate the same neuron, given the explanation from step 1 as a condition.

For example, given the explanation that the neuron is sensitive to language related to Marvel comics, movies, and characters, GPT-4 could generate text inputs like "Spider-Man is one of the most popular superheroes in the Marvel universe" or "The Avengers: Endgame was the epic conclusion to the Infinity Saga".

The goal of this step is to test how well the explanation captures the neuron's behavior and generate more examples of inputs that activate the neuron.

Step 3: Score the explanation by comparing the simulated and real activations

This step involves comparing the neuron's activation on the original text input and the simulated text inputs generated by GPT-4 in step 2.

The comparison is made by calculating a correlation coefficient between the two sets of activations, which ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation). The correlation coefficient is used as a measure of how well the explanation matches the neuron's behavior. A high correlation means that the explanation is correct and consistent, while a low correlation means that the explanation is incorrect or incomplete.

The goal of this step is to quantify how interpretable the neuron is and to provide a feedback signal to improve the explanation.

Mathias Sundin
The Angry Optimist

🧠 Artificial Intelligence The Angry Optimist