How a neural network can create music

Playing chess, composing classical music, __: computer programmers love creating ‘AIs’ that can do this stuff. Music, especially is always fun: there is a long history of programs that can create new songs that are so good that they fool professional musicians (who cannot tell the difference between a Chopin song and a generated song – listen to some here; here is another video).

I do not know how these have worked; I would guess a genetic algorithm, hidden markov model, or neural network of some sort. Thankfully Daniel Johnson has just created such a neural network and laid out the logic behind it in beautiful detail:

Music composing neural network

The power of this is that it enables the network to have a simple version of memory, with very minimal overhead. This opens up the possibility of variable-length input and output: we can feed in inputs one-at-a-time, and let the network combine them using the state passed from each time step.

One problem with this is that the memory is very short-term. Any value that is output in one time step becomes input in the next, but unless that same value is output again, it is lost at the next tick. To solve this, we can use a Long Short-Term Memory (LSTM) node instead of a normal node. This introduces a “memory cell” value that is passed down for multiple time steps, and which can be added to or subtracted from at each tick. (I’m not going to go into all of the details, but you can read more about LSTMs in the original paper.)…

However, there is still a problem with this network. The recurrent connections allow patterns in time, but we have no mechanism to attain nice chords: each note’s output is completely independent of every other note’s output. Here we can draw inspiration from the RNN-RBM combination above: let the first part of our network deal with time, and let the second part create the nice chords. But an RBM gives a single conditional distribution of a bunch of outputs, which is incompatible with using one network per note.

The solution I decided to go with is something I am calling a “biaxial RNN”. The idea is that we have two axes (and one pseudo-axis): there is the time axis and the note axis (and the direction-of-computation pseudo-axis). Each recurrent layer transforms inputs to outputs, and also sends recurrent connections along one of these axes. But there is no reason why they all have to send connections along the same axis!

What blows me away – and yes, I am often blown away these days – is how relatively simple all these steps are. By using logical, standard techniques for neural networks (and these are not deep), the programmer on the street can create programs that are easily able to do things that were almost unfathomable a decade ago. This is not just pattern separation, but also generation.