Learning to see through semantics

Humans have a visual bias: everything in vision seems easy and natural to us, and it can seem a bit of a mystery why computers are so bad at it. But there is a reason such a massive chunk (about 30%) of cortex is devoted to it. It’s really hard! To do everything that it needs to, the brain splits up the stream of visual information into a few different streams. One of these streams, which goes down the ventral (purple, above) portion of the brain, is linked to object recognition and representing abstract forms.

For companies like Facebook or Google, copying this would be something of a holy grail. Think how much better image search would be if you could properly pull out what objects are in the image. As it is, though, these things are fairly hard.

Jon Shlens recently visited from Google and gave a talk about their recent research on improving the search (which I see will be presented as a poster at NIPS this week). In order to extract abstract form, they decided, they must find a way to abstract the concept of each image. There is one really obvious way to do this: use words. Semantic space is rich and very easily trainable (and something Google has ample practice at).

Shlens filters

First, they want a way to do things very quickly. One way to get at the structure of an image is to use different ‘filters’ that represent underlying properties of the image. When moved across an image, the combination of these filters can reconstruct the image and identify what are the important underlying components. Unfortunately, these comparisons go relatively slowly over many, many dot products. Instead, they just choose a few points on the filters to compare (left) which improves performance without a loss of sensitivity.

Once they can do that quickly, they train a deep-learning artificial neural network (ANN) on the images to try to classify them. This does okay. The fancy-pants part is where they also train an ANN on words in Wikipedia. This gives them relationships between all sorts of words and puts the words in an underlying continuous space. Now words have a ‘distance’ between them that tells how similar they are.

ANN guess

By combining the word data with the visual data, they get a ~83% improvement in performance. More importantly, even when the system is wrong it is only kind of wrong. Look at the sample above: on the left are the guesses of the combined semantic-visual engine and on the right is the vision-only guesser. With vision-only, guesses vary widely for the same object: a punching bag, a whistle, a bassoon, and a letter opener may all be long straight objects but they’re not exactly in the same class of things. On the other hand, an English horn, an oboe and a bassoon are pretty similar (good guesses); even a hand is similar in that it is used for an instrument. Clearly the semantic-visual engine can understand the class of object it is looking at even if it can’t get the precise word 100% of the time. This engine does very well on unseen data and scales very well across many labels.

This all makes me wonder: what other sensory modalities could they add? It’s Google, so potentially they could be crawling data from a ‘link-space’ representation. In animals we could add auditory and mechanosensory (touch) input. And does this mean that the study of vision is missing something? Could animals have a sort of ‘semantic’ representation of the world in order to better understand visual or other sensory information? Perhaps multimodal integration is actually the key to understanding our senses.


Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, & Mikolov T (2013). DeViSE: A Deep Visual-Semantic Embedding Model NIPS

Dean T, Ruzon MA, Segal M, Shlens J, Vijayanarasimhan S, & Yagnik J (2013). Fast, Accurate Detection of 100,000 Object Classes on a Single Machine Proceedings of IEEE Conference on Computer Vision and Pattern Recognition DOI: 10.1109/CVPR.2013.237