The algorithms that are winning in games like Go or Dota 2 use reinforcement learning to train multilayer neural networks. The animal brain also uses reinforcement learning via dopamine. But research shows that the human brain performs two types of reinforcement learning on top of each other. This new theory implements a technique called Learning to Learn, also called meta-reinforcement learning, which may benefit machine learning algorithms.
The Standard Model of Learning
Dopamine is the neurotransmitter associated with the feeling of desire and motivation.
Neurons release dopamine when a reward for an action is surprising. For example, when a dog receives a treat unexpectedly, dopamine is released in the brain. The reverse is also true. When the brain predicts a reward and the animal does not get it, then a dip in dopamine occurs. Simply put, dopamine serves as a way for the brain to learn through reinforcement learning.
These dopamine fluctuations are what scientists call signaling a reward prediction error. There is a burst of dopamine when things are better than expected and a dip when things are worse. Dozens of studies show that the burst of dopamine, when it reaches the striatum, adjusts the strength of synaptic connections. How does that drive behavior? When you execute an action in a particular situation, if an unexpected reward occurs, then you strengthen the association between that situation and action. Intuition says that if you do something and are pleasantly surprised, then you should do that thing more often in the future. And if you do something and are unpleasantly surprised, then you should do it less often.
Inside people’s brains, the levels of dopamine increase when there is a difference between the predicted reward and the reward for a task. But dopamine also rises when it predicts that a reward is about to happen. So, it tricks people’s brains into doing work even if the reward does not come. For example, when you train a dog to do something like come to you when you blow a whistle, dopamine is what drives the synaptic change. You teach your dog to come when called by rewarding him, like giving him a treat, when he does what you want. After a while, you no longer need to reward the dog because his brain releases dopamine, expecting the reward (treat). Dopamine is part of what is known as model-free reinforcement learning.
Model-Free versus Model-Based Reinforcement Learning
But that is not the only system in people’s brains benefiting from reinforcement learning. The prefrontal cortex, the part of the cortex that is at the very front of the brain, also uses reinforcement learning rewards in its activities, or dynamics.
The prefrontal cortex together with the rest of the brain has two circuits that create what is called Learning to Learn. Model-based learning occurs via dopamine and model-free learning acts on top of that circuit in the prefrontal cortex.
One way to describe the difference between model-free and model-based reinforcement learning is that the latter uses a model of the task, meaning an internal representation of task contingencies. If I do this, then this will happen, or if I do that, then the other thing will happen. Model-free learning, however, does not do that. It only responds to the strengthening or weakening of stimulus-response associations. Model-free learning does not know what is going to happen next and simply reacts to what is happening now. That is why a dog can learn, with dopamine, how to come when called even if you stop giving it treats. It had no model of the event but learned that the stimulus, like whistling, is a good thing.
Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
If the dopamine learning mechanism is model-free, then it should not reflect something called inferred value. I explain what that means with the following experiment will help explain this concept
A monkey looks at a central fixed point and sees targets to the left and right. If the monkey moves its eyes to a target, it is given a reward or not, depending on what side he was asked to look toward. Sometimes the left is rewarded and other times the right. These reward contingencies remain the same for a while and then reverse in a way not signaled to the animal, except by the rewards themselves. So, let’s say that the left is rewarded all the time and the right is not, but suddenly, the right is rewarded all the time and that continues for a while.
Initially, the monkey received a reward for looking left, and the brain immediately received dopamine. In this case, if the monkey looks right, dopamine is not released because the monkey is not going to get a reward. But at the moment of reversal, the monkey thinks it will receive a reward for looking left, but it receives nothing. When the target changes to the right, the monkey receives a reward for that new task. Once the animal understands the new task, then looking to the left should no longer trigger the dopamine response because the animal has experience and evidence to say that there is a reversal. The task that used to excite dopamine disappoints the dopamine system, and the target that did not previously stimulate the dopamine system now does. The animal has experienced a stimulus-reward association, and the dopamine system adjusts to that.
But consider a different scenario. The animal was rewarded for looking left, but in the next trial, the right is the target. It has no experience with the right in this new regime. But what you find is that if the right was not rewarded before and the animal infers that the right should be rewarded, then dopamine is released. Since the monkey knows that there has been a reversal now, it can tell that the next target should be rewarded. This is a model-based inference since it draws on the knowledge of the task, and that presumed reward is called inferred value.
Given the concept of inferred value, it is possible to determine that some parts of the brain learn via model-free and others from model-based reinforcement learning. The dopamine response clearly does not show inferred value because it is not based on a model of the task, but the brain still performs model-based reinforcement learning in its prefrontal cortex circuitry. The technique to show this is called a two-step task and works as follows.
Let’s say you play a game where you drive a car. The only two actions are turning left or right. If you turn left, then you die and lose the game. But if you turn right, then you continue playing the game.
If the driver plays the game again, a model-free system says, “If I turned right and did not die last time, then I should turn right again. Turning right is ‘good.’” A model-based system will understand the task at hand and will turn right when the road goes to the right and turn left when the road goes to the left. Therefore, someone who learns driving using a model-free reinforcement learning algorithm will never learn how to drive these roads properly. But a driver who learns to drive with a model-based algorithm will do just fine.
This simple task gives us a way of teasing apart model-free and model-based action selection. If you plot the behavior of the beginning of the trial, then you can show whether the system is a model-free or model-based reinforcement learning algorithm. The two-step task shows the fingerprint of the algorithm.
Studies with humans and even animals, including rats, that measure brain signals in the two-step task show that the prefrontal cortex presents the model-based pattern. In 2015, Nathaniel Daw demonstrated that behavior in the human prefrontal circuit via brain signals and the two-step task.* This implies that the prefrontal circuit learns from its own autonomous reinforcement learning procedure, which is distinct from the reinforcement learning algorithm used to set the neural network weights—the dopamine-based model-free reinforcement learning.
Model-Free and Model-Based Learning, Working Together
These two types of circuits work together to form what is known as Learning to Learn. Dopamine works on top of the prefrontal cortex as part of a model-free reinforcement learning system to update the circuit connections, while the prefrontal cortex circuit learns via model-based reinforcement learning.
The type of reinforcement learning implemented in the prefrontal circuit can be executed even when the synaptic weights are frozen. That means that the neural circuitry in the brain does not update the synapses’ weights to implement reinforcement learning.
It is different from the reinforcement learning algorithm accomplished by dopamine that trains the synaptic weights in the prefrontal cortex. In the prefrontal circuit, the task structure sculpts the learned reinforcement learning algorithm, which means that each task will have a different type of model-based reinforcement learning algorithm that runs in the prefrontal circuit.
In a different type of experiment, monkeys have two targets, A and B, in front of them and the reward probability between the two targets changes over time.* The monkey looks at the center point between the targets, and then it chooses to stare at one target or the other and receives a reward after a minute or so. This experiment showed that the brain has the two types of reinforcement learning algorithms working together, a model-free dopamine-based one on top of a model-based algorithm.
With that in mind, Matthew Botvinick designed a deep learning neural network that had the same characteristics as the brains of monkeys, that is, that learned to learn.
The results showed that if you train a deep learning system on this task using a reinforcement learning algorithm and without any additional assumptions, the network itself instantiated a separate reinforcement learning algorithm; that is, the network imitated what was found in the brain.*