Deep learning and vision

Object recognition is hard. Famously, an attempt to use computers to automatically identify tanks in photos in the 1980s failed in a clever way:

But the scientists were worried: had it actually found a way to recognize if there was a tank in the photo, or had it merely memorized which photos had tanks and which did not? This is a big problem with neural networks, after they have been trained you have no idea how they arrive at their answers, they just do. The question was did it understand the concept of tanks vs. no tanks, or had it merely memorized the answers? So the scientists took out the photos they had been keeping in the vault and fed them through the computer. The computer had never seen these photos before — this would be the big test. To their immense relief the neural net correctly identified each photo as either having a tank or not having one…

Eventually someone noticed that in the original set of 200 photos, all the images with tanks had been taken on a cloudy day while all the images without tanks had been taken on a sunny day. The neural network had been asked to separate the two groups of photos and it had chosen the most obvious way to do it – not by looking for a camouflaged tank hiding behind a tree, but merely by looking at the colour of the sky. The military was now the proud owner of a multi-million dollar mainframe computer that could tell you if it was sunny or not.

But Deep Learning – and huge data sets – have propelled a huge breakthrough over the last few years:

Today, Olga Russakovsky at Stanford University in California and a few pals review the history of this competition and say that in retrospect, SuperVision’s comprehensive victory was a turning point for machine vision. Since then, they say, machine vision has improved at such a rapid pace that today it rivals human accuracy for the first time. [NE: I don’t think this is quite true…]

Convolutional neural networks consist of several layers of small neuron collections that each look at small portions of an image. The results from all the collections in a layer are made to overlap to create a representation of the entire image. The layer below then repeats this process on the new image representation, allowing the system to learn about the makeup of the image.

An interesting question is how the top algorithms compare with humans when it comes to object recognition. Russakovsky and co have compared humans against machines and their conclusion seems inevitable. “Our results indicate that a trained human annotator is capable of outperforming the best model (GoogLeNet) by approximately 1.7%,” they say…But the trend is clear. “It is clear that humans will soon outperform state-of-the-art image classification models only by use of significant effort, expertise, and time,” say Russakovsky and co.