Yeah, but what has ML ever done for neuroscience?

This question has been going round the neurotwitters over the past day or so.

Let’s limit ourselves to ideas that came from machine learning that have had an influence on neural implementation in the brain. Physics doesn’t count!

  • Reinforcement learning is always my go-to though we have to remember the initial connection from neuroscience! In Sutton and Barto 1990, they explicitly note that “The TD model was originally developed as a neuron like unit for use in adaptive networks”. There is also the obvious connection the the Rescorla-Wagner model of Pavlovian conditioning. But the work to show dopamine as prediction error is too strong to ignore.
  • ICA is another great example. Tony Bell was specifically thinking about how neurons represent the world when he developed the Infomax-based ICA algorithm (according to a story from Terry Sejnowski). This obviously is the canonical example of V1 receptive field construction
    • Conversely, I personally would not count sparse coding. Although developed as another way of thinking about V1 receptive fields, it was not – to my knowledge – an outgrowth of an idea from ML.
  • Something about Deep Learning for hierarchical sensory representations, though I am not yet clear on what the principal is that we have learned. Progressive decorrelation through hierarchical representations has long been the canonical view of sensory and systems neuroscience. Just see the preceding paragraph! But can we say something has flowed back from ML/DL? From Yemins and DiCarlo (and others), can we say that maximizing the output layer is sufficient to get similar decorrelation as the nervous system?

And yet… what else? Bayes goes back to Helmholtz, in a way, and at least precedes “machine learning” as a field. Are there examples of the brain implementing…. an HMM? t-SNE? SVMs? Discriminant analysis (okay, maybe this is another example)?

My money is on ideas from Deep Learning filtering back into neuroscience – dropout and LSTMs and so on – but I am not convinced they have made a major impact yet.

RIP Marvin Minsky, 1927-2016

Marvin Minsky in Detroit

I awoke to sad news this morning – Marvin Minsky passed away at the age of 88. Minsky’s was the first serious work on artificial intelligence that I ever read and one of the reasons I am in neuroscience today.

Minsky is infamously known for his book Perceptrons, which most famously showed that the neural networks at the time had problems with computations such as XOR (here is the solution, which every neuroscientist should know!).

Minsky is also known for the Dartmouth Summer Research Conference, whose proposal is really worth reading in full.

Fortunately, Minsky put many of his writings online which I have been rereading this morning. You could read his thoughts on communicating with Alien Intelligence:

All problem-solvers, intelligent or not, are subject to the same ultimate constraints–limitations on space, time, and materials. In order for animals to evolve powerful ways to deal with such constraints, they must have ways to represent the situations they face, and they must have processes for manipulating those representations.

ECONOMICS: Every intelligence must develop symbol-systems for representing objects, causes and goals, and for formulating and remembering the procedures it develops for achieving those goals.

SPARSENESS: Every evolving intelligence will eventually encounter certain very special ideas–e.g., about arithmetic, causal reasoning, and economics–because these particular ideas are very much simpler than other ideas with similar uses.

He also mentions this, which sounds fascinating. I was not aware of this but cannot find the actual paper. If anyone can send me the citation, please leave a comment!

A TECHNICAL EXPERIMENT. I once set out to explore the behaviors of all possible processes–that is, of all possible computers and their programs. There is an easy way to do that: one just writes down, one by one, all finite sets of rules in the form which Alan Turing described in 1936. Today, these are called “Turing machines.” Naturally, I didn’t get very far, because the variety of such processes grows exponentially with the number of rules in each set. What I found, with the help of my student Daniel Bobrow, was that the first few thousand such machines showed just a few distinct kinds of behaviors. Some of them just stopped. Many just erased their input data. Most quickly got trapped in circles, repeating the same steps over again. And every one of the remaining few that did anything interesting at all did the same thing. Each of them performed the same sort of “counting” operation: to increase by one the length of a string of symbols–and to keep repeating that. In honor of their ability to do what resembles a fragment of simple arithmetic, let’s call these them “A-Machines.” Such a search will expose some sort of “universe of structures” that grows and grows. For our combinations of Turing machine rules, that universe seems to look something like this:

minsky turing machines

In Why Most People Think Computers Can’t, he gets off a couple of cracks at people who think computers can’t do anything humans can:

Most people assume that computers can’t be conscious, or self-aware; at best they can only simulate the appearance of this. Of course, this
assumes that we, as humans, are self-aware. But are we? I think not. I
know that sounds ridiculous, so let me explain.

If by awareness we mean knowing what is in our minds, then, as every  clinical psychologist knows, people are only very slightly self-aware, and  most of what they think about themselves is guess-work. We seem to build  up networks of theories about what is in our minds, and we mistake these  apparent visions for what’s really going on. To put it bluntly, most of  what our “consciousness” reveals to us is just “made up”. Now, I don’t  mean that we’re not aware of sounds and sights, or even of some parts of  thoughts. I’m only saying that we’re not aware of much of what goes on inside our minds.

Finally, he has some things to say on Symbolic vs Connectionist AI:

Thus, the present-day systems of both types show serious limitations. The top-down systems are handicapped by inflexible mechanisms for retrieving knowledge and reasoning about it, while the bottom-up systems are crippled by inflexible architectures and organizational schemes. Neither type of system has been developed so as to be able to exploit multiple, diverse varieties of knowledge.

Which approach is best to pursue? That is simply a wrong question. Each has virtues and deficiencies, and we need integrated systems that can exploit the advantages of both. In favor of the top-down side, research in Artificial Intelligence has told us a little—but only a little—about how to solve problems by using methods that resemble reasoning. If we understood more about this, perhaps we could more easily work down toward finding out how brain cells do such things. In favor of the bottom-up approach, the brain sciences have told us something—but again, only a little—about the workings of brain cells and their connections.

Apparently, he viewed the symbolic/connectionist split like so:

minsky connectionist vs symbolic

How a neural network can create music

Playing chess, composing classical music, __: computer programmers love creating ‘AIs’ that can do this stuff. Music, especially is always fun: there is a long history of programs that can create new songs that are so good that they fool professional musicians (who cannot tell the difference between a Chopin song and a generated song – listen to some here; here is another video).

I do not know how these have worked; I would guess a genetic algorithm, hidden markov model, or neural network of some sort. Thankfully Daniel Johnson has just created such a neural network and laid out the logic behind it in beautiful detail:

Music composing neural network

The power of this is that it enables the network to have a simple version of memory, with very minimal overhead. This opens up the possibility of variable-length input and output: we can feed in inputs one-at-a-time, and let the network combine them using the state passed from each time step.

One problem with this is that the memory is very short-term. Any value that is output in one time step becomes input in the next, but unless that same value is output again, it is lost at the next tick. To solve this, we can use a Long Short-Term Memory (LSTM) node instead of a normal node. This introduces a “memory cell” value that is passed down for multiple time steps, and which can be added to or subtracted from at each tick. (I’m not going to go into all of the details, but you can read more about LSTMs in the original paper.)…

However, there is still a problem with this network. The recurrent connections allow patterns in time, but we have no mechanism to attain nice chords: each note’s output is completely independent of every other note’s output. Here we can draw inspiration from the RNN-RBM combination above: let the first part of our network deal with time, and let the second part create the nice chords. But an RBM gives a single conditional distribution of a bunch of outputs, which is incompatible with using one network per note.

The solution I decided to go with is something I am calling a “biaxial RNN”. The idea is that we have two axes (and one pseudo-axis): there is the time axis and the note axis (and the direction-of-computation pseudo-axis). Each recurrent layer transforms inputs to outputs, and also sends recurrent connections along one of these axes. But there is no reason why they all have to send connections along the same axis!

What blows me away – and yes, I am often blown away these days – is how relatively simple all these steps are. By using logical, standard techniques for neural networks (and these are not deep), the programmer on the street can create programs that are easily able to do things that were almost unfathomable a decade ago. This is not just pattern separation, but also generation.

Rationality and the machina economicus

Science magazine had an interesting series of review articles on Machine Learning last week. Two of them were different perspectives of the exact same question: how does traditional economic rationality fit into artificial intelligence?

At the core of much AI work are concepts of optimal ‘rational decision-makers’. That is, the intelligent program is essentially trying to maximize some defined objective function, known economics as maximizing utility. Where the computer and economic traditions diverge is in their implementation: computers need algorithms, and often need to take into account non-traditional resource constraints such as time, whereas in economics this is left unspecified outside of trivial cases.

economics of thinking

How can we move from the classical view of a rational agent who maximizes expected utility over an exhaustively enumerable state-action space to a theory of the decisions faced by resource-bounded AI systems deployed in the real world, which place severe demands on real-time computation over complex probabilistic models?

We see the attainment of an optimal stopping time, in which attempts to compute additional precision come at a net loss in the value of action. As portrayed in the figure, increasing the cost of computation would lead to an earlier ideal stopping time. In reality, we rarely have such a simple economics of the cost and benefits of computation. We are often uncertain about the costs and the expected value of continuing to compute and so must solve a more sophisticated analysis of the expected value of computation.

Humans and other animals appear to make use of different kinds of systems for sequential decision-making: “model-based” systems that use a rich model of the environment to form plans, and a less complex “model-free” system that uses cached values to make decisions. Although both converge to the same behavior with enough experience, the two kinds of systems exhibit different tradeoffs in computational complexity and flexibility. Whereas model-based systems tend to be more flexible than the lighter-weight model-free systems (because they can quickly adapt to changes in environment structure), they rely on more expensive analyses (for example, tree-search or dynamic programming algorithms for computing values). In contrast, the model-free systems use inexpensive, but less flexible, look-up tables or function approximators.

That being said, what does economics have to offer machine learning? Parkes and Wellman try to offer an answer and basically say – game theory. Which is not something that economics can ‘offer’ so much as ‘offered a long, long time ago’. A recent interview with Parkes puts this in perspective:

Where does current economic theory fall short in describing rational AI?

Machina economicus might better fit the typical economic theories of rational behavior, but we don’t believe that the AI will be fully rational or have unbounded abilities to solve problems. At some point you hit the intractability limit—things we know cannot be solved optimally—and at that point, there will be questions about the right way to model deviations from truly rational behavior…But perfect rationality is not achievable in many complex real-world settings, and will almost surely remain so. In this light, machina economicus may need its own economic theories to usefully describe behavior and to use for the purpose of designing rules by which these agents interact.

Let us admit that economics is not fantastic at describing trial-to-trial individual behavior. What can economics offer the field of AI, then? Systems for multi-agent interaction. After all, markets are what are at the heart of economics:

At the multi-agent level, a designer cannot directly program behavior of the AIs but instead defines the rules and incentives that govern interactions among AIs. The idea is to change the “rules of the game”…The power to change the interaction environment is special and distinguishes this level of design from the standard AI design problem of performing well in the world as given.

For artificial systems, in comparison, we might expect AIs to be truthful where this is optimal and to avoid spending computation reasoning about the behavior of others where this is not useful…. The important role of mechanism design in an economy of AIs can be observed in practice. Search engines run auctions to allocate ads to positions alongside search queries. Advertisers bid for their ads to appear in response to specific queries (e.g., “personal injury lawyer”). Ads are ranked according to bid amount (as well as other factors, such as ad quality), with higher-ranked ads receiving a higher position on the search results page.

Early auction mechanisms employed first-price rules, charging an advertiser its bid amount when its ad receives a click. Recognizing this, advertisers employed AIs to monitor queries of interest, ordered to bid as little as possible to hold onto the current position. This practice led to cascades of responses in the form of bidding wars, amounting to a waste of computation and market inefficiency. To combat this, search engines introduced second-price auction mechanisms, which charge advertisers based on the next-highest bid price rather than their own price. This approach (a standard idea of mechanism design) removed the need to continually monitor the bid- ding to get the best price for position, thereby end- ing bidding wars.

But what comes across most in the article is how much economics needs to seriously consider AI (and ML more generally):

The prospect of an economy of AIs has also inspired expansions to new mechanism design settings. Researchers have developed incentive-compatible multiperiod mechanisms, considering such factors as uncertainty about the future and changes to agent preferences because of changes in local context. Another direction considers new kinds of private inputs beyond preference information.

I would have loved to see an article on “what machine learning can teach economics” or how tools in ML are transforming the study of markets.

Science also had one article on “trends and prospects” in ML and one on natural language processing.


Parkes, D., & Wellman, M. (2015). Economic reasoning and artificial intelligence Science, 349 (6245), 267-272 DOI: 10.1126/science.aaa8403

Gershman, S., Horvitz, E., & Tenenbaum, J. (2015). Computational rationality: A converging paradigm for intelligence in brains, minds, and machines Science, 349 (6245), 273-278 DOI: 10.1126/science.aac6076

Small autonomous drones

Nature has a fascinating review on drones – and especially microdrones!


For those who don’t have access, here are some highlights (somewhat technical):

Propulsive efficiencies for rotorcraft degrade as the vehicle size is reduced; an indicator of the energetic challenges for flight at small scales. Smaller size typically implies lower Reynolds numbers, which in turn suggests an increased dominance of viscous forces, causing greater drag coefficients and reduced lift coefficients compared with larger aircraft. To put this into perspective, this means that a scaled-down fixed-wing aircraft would be subject to a lower lift-to-drag ratio and thereby require greater relative forward velocity to maintain flight, with the associated drag and power penalty reducing the overall energetic efficiency. The impacts of scaling challenges (Fig. 3) are that smaller drones have less endurance, and that the overall flight times range from tens of seconds to tens of minutes — unfavourable compared with human-scale vehicles.

There are, however, manoeuvrability benefits that arise from decreased vehicle size. For example, the moment of inertia is a strong function of the vehicle’s characteristic dimension — a measure of a critical length of the vehicle, such as the chord length of a wing or length of a propeller in a similar manner as used in Reynolds number scaling. Because the moment of inertia of the vehicle scales with the characteristic dimension, L, raised to the fifth power, a decrease in size from a 11 m wingspan, four-seat aircraft such as the Cessna 172 to a 0.05 m rotor-to-rotor separation Blade Pico QX quadcopter implies that the Cessna has about 5 × 1011 the inertia of the quadcopter (with respect to roll)…This enhanced agility, often achieved at the expense of open-loop stability, requires increased emphasis on control — a challenge also exacerbated by the size, weight and power constraints of these small vehicles.

microdrone flight vs mass


Improvements in microdrones will come from becoming more insect-like and adapting knowledge from biological models:


In many situations, such as search and rescue, parcel delivery in confined spaces and environmental monitoring, it may be advantageous to combine aerial and terrestrial capabilities (multimodal drones). Perching mechanisms could allow drones to land on walls and power lines in order to monitor the environment from a high vantage point while saving energy. Agile drones could move on the ground by using legs in conjunction with retractable or flapping wings. In an effort to minimize the total cost of transport, which will be increased by the additional locomotion mode, these future drones may benefit from using the same actuation system for flight control and ground locomotion…

Many vision-based insect capabilities have been replicated with small drones. For example, it has been shown that small fixed-wing drones and helicopters can regulate their distance from the ground using ventral optic flow while a GPS was used to maintain constant speed and an IMU was used to regulate roll angle. The addition of lateral optic flow sensors also allowed a fixed-wing drone to detect near-ground obstacles. Optic flow has also been used to perform both collision-free navigation and altitude control of indoor and outdoor fixed-wing drones without a GPS. In these drones, the roll angle was regulated by optic flow in the horizontal direction and the pitch angle was regulated by optic flow in the vertical direction, while the ground speed was measured and maintained by wind-speed sensors. In this case, the rotational optic flow was minimized by flying along straight lines interrupted by short turns or was estimated with on-board gyroscopes and subtracted from the total optic flow, as suggested by biological models

The future ecology of stock traders

I am beyond fascinated by the interactions between competing intelligences that exist in the stock market. It is a bizarre mishmash of humans, AIs, and both (cyborgpeople?).

One recent strategy that exploits this interaction is ‘spoofing‘. The description from the link:

  • You place an order to sell a million widgets at $104.
  • You immediately place an order to buy 10 widgets at $101.
  • Everyone sees the million-widget order and is like, “Wow, lotta supply, the market is going down, better dump my widgets!”
  • So someone is happy to sell you 10 widgets for $101 each.
  • Then you immediately cancel your million-widget order, leaving you with 10 widgets for which you paid $1,010.
  • Then you place an order to buy a million widgets for $101, and another order to sell 10 widgets at $104.
  • Everyone sees the new million-widget order, and since no one has any attention span at all, they are like, “Wow, lotta demand, the market is going up, better buy some widgets!”
  • So someone is happy to buy 10 widgets from you for $104 each.
  • Then you immediately cancel your million-widget order, leaving you with no widgets, no orders and $30 in sweet sweet profits.

Amusingly enough, you don’t even need a fancy computer program for it – you can just hire a bunch of people who are really good at fast video games and they can click click click those keys fast enough for you.

Now some day trader living in his parent’s basement is accused of using this technique and causing the flash crash of 2010 (it possibly wasn’t him directly, but he could have caused some cascade that led to it).

I’m sitting here with popcorn, waiting to see how the ecosystem of varied intelligences evolves in competition with each other. Sounds like Wall Street needs to take some crash courses in ecology.

How Deep Mind learns to win

About a year ago, DeepMind was bought for half a billion dollars by Google for creating software that could learn to beat video games. Over the past year, DeepMind has detailed how they did it.


Let us say that you were an artificial intelligence that had access to a computer screen, a way to play the game (an imaginary video game controller, say), and its current score. How should it learn to beat the game? Well, it has access to three things: the state of the screen (its input), a selection of actions, and a reward (the score). What the AI would want to do is find the best action to go along with every state.

A well-established way to do this without any explicit modeling of the environment is through Q-learning (a form of reinforcement learning). In Q-learning, every time you encounter a certain state and take an action, you have some guess of what reward you will get. But the world is a complicated, noisy place, so you won’t necessarily always get the same reward back in seemingly-identical situations. So you can just take the difference between the reward you find and what you expected, and nudge your guess a little closer.

This is all fine and dandy, though when you’re looking at a big screen you’ve got a large number of pixels – and a huge number of possible states. Some of them you may never even get to see! Every twist and contortion of two pixels is, theoretically, a completely different state. This would make it implausible to check each state, choose the action and play it again and again to get a good estimate of reward.

What we could do, if we were clever about it, is to use a neural network to learn features about the screen. Maybe sometimes this part of the screen is important as a whole and maybe other times those two parts of the screen are a real danger.

But that is difficult for the Q-learning algorithm. The DeepMind authors list three reasons: (1) correlations in sequence of observations, (2) small updates to Q significantly change the policy and the data distribution, and (3) correlations between action values and target values. It is how they tackle these problems that is the main contribution to the literature.

The strategy is to implement a Deep Convolutional Neural Network to find ‘filters’ that can more easily represent the state space. The network takes in the states – the images on the screen – processes them, and then outputs a value. In order to get around problems (1) and (3) above (the correlations in observations), they take a ‘replay’ approach. Actions that have been taken are stored into memory; when it is time to update the neural network, they grab some of the old state-action pairs out of their bag of memories and learn from that. They liken this to consolidation during sleep, where the brain replays things that had happened during the day.

Further, even though they train the network with their memories after every action, this is not the network that is playing the game. The network that is playing the game stays in stasis and only ‘updates itself’ with what it has learned after a certain stretch of time – again, like it is going to “sleep” to better learn what it had done during the day.

Here is an explanation of the algorithm in a hopefully useful form:


Throughout the article, the authors claim that this may point to new directions for neuroscience research. This being published in Nature, any claims to utility should be taken with a grain of salt. That being said! I am always excited to see what lessons arise when theories are forced to confront reality!

What this shows is that reinforcement learning is a good way to train a neural network in a model-free way. Given that all learning is temporal difference learning (or: TD learning is semi-Hebbian?), this is a nice result though I am not sure how original it is. It also shows that the replay way of doing it – which I believe is quite novel – is a good one. But is this something that  sleep/learning/memory researchers can learn from? Perhaps it is a stab in the direction of why it is useful (to deal with correlations).


Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, & Hassabis D (2015). Human-level control through deep reinforcement learning. Nature, 518 (7540), 529-533 PMID: 25719670

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, & Martin Riedmiller (2013). Playing Atari with Deep Reinforcement Learning arXiv arXiv: 1312.5602v1

The future is now, kind of

I normally do most of my blog-writing on the weekend but I was sick! So enjoy these videos of robots.

The new Big Dog, spot. But don’t worry:


(via reddit)

Walter Pitts was Will Hunting

Apparently Walter Pitts (of McCulloch-Pitts) was Good Will Hunting:

Standing face to face, they were an unlikely pair. McCulloch, 42 years old when he met Pitts, was a confident, gray-eyed, wild-bearded, chain-smoking philosopher-poet who lived on whiskey and ice cream and never went to bed before 4 a.m. Pitts, 18, was small and shy, with a long forehead that prematurely aged him, and a squat, duck-like, bespectacled face. McCulloch was a respected scientist. Pitts was a homeless runaway. He’d been hanging around the University of Chicago, working a menial job and sneaking into Russell’s lectures, where he met a young medical student named Jerome Lettvin.

This article is so great I could quote the whole thing. But I won’t! You only this and then you must go and read all of it:

Pitts was soon to make a similar impression on one of the towering intellectual figures of the 20th century, the mathematician, philosopher, and founder of cybernetics, Norbert Wiener. In 1943, Lettvin brought Pitts into Wiener’s office at the Massachusetts Institute of Technology (MIT). Wiener didn’t introduce himself or make small talk. He simply walked Pitts over to a blackboard where he was working out a mathematical proof. As Wiener worked, Pitts chimed in with questions and suggestions. According to Lettvin, by the time they reached the second blackboard, it was clear that Wiener had found his new right-hand man. Wiener would later write that Pitts was “without question the strongest young scientist whom I have ever met … I should be extremely astonished if he does not prove to be one of the two or three most important scientists of his generation, not merely in America but in the world at large.”

…His work with Wiener was “to constitute the first adequate discussion of statistical mechanics, understood in the most general possible sense, so that it includes for example the problem of deriving the psychological, or statistical, laws of behavior from the microscopic laws of neurophysiology … Doesn’t it sound fine?”

That winter, Wiener brought Pitts to a conference he organized in Princeton with the mathematician and physicist John von Neumann, who was equally impressed with Pitts’ mind. Thus formed the beginnings of the group who would become known as the cyberneticians, with Wiener, Pitts, McCulloch, Lettvin, and von Neumann its core. And among this rarified group, the formerly homeless runaway stood out. “None of us would think of publishing a paper without his corrections and approval,” McCulloch wrote. “[Pitts] was in no uncertain terms the genius of our group,” said Lettvin. “He was absolutely incomparable in the scholarship of chemistry, physics, of everything you could talk about history, botany, etc. When you asked him a question, you would get back a whole textbook … To him, the world was connected in a very complex and wonderful fashion.”

Here is the original research article – fascinating historically and not at all what I would have expected. Very Principia Mathematica-y for neuroscience, but I suppose that was the time?

The Talking Machines

There’s a great new Machine Learning podcast out called Talking Machines. They only have two episodes out but they are quite serious. They have traveled to NIPS and interviewed researchers, they have discussed A* sampling, and more.

On the most recent episode, they interviewed Ilya Sutskever on Deep Learning. He had two interesting things to say.

First, that DL works well (now) partially because we have figured out the appropriate initialization conditions: weights between units should be small, but not too small (specifically, the eigenvalues of the weight matrix should be ~1). This is what allows the backpropagation to work. Given that real neural networks don’t use backprop, how much thought should neuroscientists give to this? We know that homeostasis and plasticity keep things in a balanced range – you don’t want epilepsy, after all.

Second, that recursion in artificial networks is mostly interesting for temporal sequences. Recurrent connections – such as to the thalamus – always seems to be something that is understudied (or at least, that I don’t pay enough attention to).