How Deep Mind learns to win

About a year ago, DeepMind was bought for half a billion dollars by Google for creating software that could learn to beat video games. Over the past year, DeepMind has detailed how they did it.


Let us say that you were an artificial intelligence that had access to a computer screen, a way to play the game (an imaginary video game controller, say), and its current score. How should it learn to beat the game? Well, it has access to three things: the state of the screen (its input), a selection of actions, and a reward (the score). What the AI would want to do is find the best action to go along with every state.

A well-established way to do this without any explicit modeling of the environment is through Q-learning (a form of reinforcement learning). In Q-learning, every time you encounter a certain state and take an action, you have some guess of what reward you will get. But the world is a complicated, noisy place, so you won’t necessarily always get the same reward back in seemingly-identical situations. So you can just take the difference between the reward you find and what you expected, and nudge your guess a little closer.

This is all fine and dandy, though when you’re looking at a big screen you’ve got a large number of pixels – and a huge number of possible states. Some of them you may never even get to see! Every twist and contortion of two pixels is, theoretically, a completely different state. This would make it implausible to check each state, choose the action and play it again and again to get a good estimate of reward.

What we could do, if we were clever about it, is to use a neural network to learn features about the screen. Maybe sometimes this part of the screen is important as a whole and maybe other times those two parts of the screen are a real danger.

But that is difficult for the Q-learning algorithm. The DeepMind authors list three reasons: (1) correlations in sequence of observations, (2) small updates to Q significantly change the policy and the data distribution, and (3) correlations between action values and target values. It is how they tackle these problems that is the main contribution to the literature.

The strategy is to implement a Deep Convolutional Neural Network to find ‘filters’ that can more easily represent the state space. The network takes in the states – the images on the screen – processes them, and then outputs a value. In order to get around problems (1) and (3) above (the correlations in observations), they take a ‘replay’ approach. Actions that have been taken are stored into memory; when it is time to update the neural network, they grab some of the old state-action pairs out of their bag of memories and learn from that. They liken this to consolidation during sleep, where the brain replays things that had happened during the day.

Further, even though they train the network with their memories after every action, this is not the network that is playing the game. The network that is playing the game stays in stasis and only ‘updates itself’ with what it has learned after a certain stretch of time – again, like it is going to “sleep” to better learn what it had done during the day.

Here is an explanation of the algorithm in a hopefully useful form:


Throughout the article, the authors claim that this may point to new directions for neuroscience research. This being published in Nature, any claims to utility should be taken with a grain of salt. That being said! I am always excited to see what lessons arise when theories are forced to confront reality!

What this shows is that reinforcement learning is a good way to train a neural network in a model-free way. Given that all learning is temporal difference learning (or: TD learning is semi-Hebbian?), this is a nice result though I am not sure how original it is. It also shows that the replay way of doing it – which I believe is quite novel – is a good one. But is this something that  sleep/learning/memory researchers can learn from? Perhaps it is a stab in the direction of why it is useful (to deal with correlations).


Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, & Hassabis D (2015). Human-level control through deep reinforcement learning. Nature, 518 (7540), 529-533 PMID: 25719670

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, & Martin Riedmiller (2013). Playing Atari with Deep Reinforcement Learning arXiv arXiv: 1312.5602v1