And with this hurricane of digital records, carried along in its wake, comes a simple question: How can we have this much data and still not understand collective human behavior?
There are several issues implicit in a question like this. To begin with, it’s not about having the data, but about the ideas and computational follow-through needed to make use of it—a distinction that seems particularly acute with massive digital records of human behavior. When you personally embed yourself in a group of people to study them, much of your data-collection there will be guided by higher-level structures: hypotheses and theoretical frameworks that suggest which observations are important. When you collect raw digital traces, on the other hand, you enter a world where you’re observing both much more and much less—you see many things that would have escaped your detection in person, but you have much less idea what the individual events mean, and have no a priori framework to guide their interpretation. How do we reconcile such radically different approaches to these questions?
In other words, this strategy of recording everything is conceptually very simple in one sense, but it relies on a complex premise: that we must be able to take the resulting datasets and define richer, higher-level structures that we can build on top of them.
What could a higher-level structure look like? Consider one more example—suppose you have a passion for studying the history of the Battle of Gettysburg, and I offer to provide you with a dataset containing the trajectory of every bullet fired during that engagement, and all the movements and words uttered by every soldier on the battlefield. What would you do with this resource? For example, if you processed the final day of the data, here are three distinct possibilities. First, maybe you would find a cluster of actions, movements, and words that corresponded closely to what we think of as Pickett’s Charge, the ill-fated Confederate assault near the close of the action. Second, maybe you would discover that Pickett’s Charge was too coarse a description of what happened—that there is a more complex but ultimately more useful way to organize what took place on the final day at Gettysburg. Or third, maybe you wouldn’t find anything interesting at all; your analysis might spin its wheels but remain mired in a swamp of data that was recorded at the wrong granularity.
We don’t have that dataset for the Battle of Gettysburg, but for public reaction to the 2012 U.S. Presidential Election, or the 2012 U.S. Christmas shopping season, we have a remarkable level of action-by-action detail. And in such settings, there is an effort underway to try defining what the consequential structures might be, and what the current datasets are missing—for even with their scale, they are missing many important things. It’s a convergence of researchers with backgrounds in computation, applied mathematics, and the social and behavioral sciences, at the start of what is by every indication a very hard problem. We see glimpses of the structures that can be found—Trending Topics on Twitter, for example, is in effect a collection of summary news events induced by computational means from the sheer volume of raw tweets—but a general attack on this question is still in its very early stages.
What is the question about your field that you dread being asked?
(In neuroscience? Anything.)