# Explaining the structure of science through information theory

tl;dr I propose a half-baked and  totally wrong theory of the philosophy of science based on information theory that explains why some fields are more data oriented, and some more theory oriented

Which is better: data or theory? Here’s a better question: why do some people do theory, and some people analyze data? Or rather, why do some fields tend to have a large set of theorists, and some don’t?

If we look at any individual scientist, we could reasonably say that they were trying to understand world as well as possible as well as possible. We could describe this in information theory terms: they are trying to maximize the information they have about that description, when given some set of data. One way to think about information is that it reduces uncertainty. In other words, when given a set of data we want to reduce our uncertainty about our description of the world as much as possible. When you have no information about something, you are totally uncertain about it. You know nothing! But the more information you have, the less uncertain you are. How do we do that?

Thanks to Shannon, we have an equation that tells us how much information two things share. In other words, how much will knowing one thing tell us about the other:

I(problem; data) = H(problem) – H(problem | data)

This tells you (in bits!) how much certain you will be about a problem or our description of the world if you get some set of data.

H(problem) is the entropy function; it tells us how many different possibilities we have of describing this problem. Is there only one way? Many possible ways? Similarly, H(problem | data) is how many possible ways we have of describing the problem if we’ve seen some data. If we see data, and there are still tons of possibilities, the data has not told us much; we won’t have much information about the problem. But if the data is so precise that for each set of data we know exactly how to describe the problem, then we will have a lot of information!

This tells us that there are two ways to maximize our information about our problem if we have a set of data. We can either increase our set of descriptions of the problem or we can decrease how many possible ways there are to describe the problem when we see data.

In a high-quality, data-rich world we can mostly get away with the second one: the data isn’t noisy, and will tell us what it represents. Information can simply be maximized by collecting more data. But what happens when the data is really noisy? Collecting more data gives us a smaller marginal improvement in information than working on the set of descriptions – modeling and theory.

This explains why some fields have more theory than others. One of the hopes of Big Data is that it will reduce the noise in the data, shifting fields to focusing on the H(problem|data) part of the equation. On the other hand, the data in economics, honestly, kind of sucks. It’s dirty and noisy and we can’t even agree on what it’s telling us. Hence, marginal improvements come by creating new theory!

Look at the history of physics; for a long time, physics was getting high-quality data and had a lot of experimentalists. Since, oh, the 50s or so it’s been getting progressively harder to get new, good data. Hence, theorists!

Biology, too, has such an explosion of data it’s hard to know what to do with it. If you put your mind to it, it’s surprisingly easy to get good data that tells you about something.

Theory: proposes possible alternatives for how the world could work [H(problem)]; data limits H(problem | data). Problem is data itself is noisy.

# Monsanto: out with GMOs, in with big data

Monsanto is (partially) switching from GMOs to “naturally” grown plants:

The lettuce is sweeter and crunchier than romaine and has the stay-fresh quality of iceberg. The peppers come in miniature, single-serving sizes to reduce leftovers. The broccoli has three times the usual amount of glucoraphanin, a compound that helps boost antioxidant levels…Frescada lettuce, BellaFina peppers, and Bene­forté broccoli—cheery brand names trademarked to an all-but-anonymous Mon­santo subsidiary called Seminis—are rolling out at supermarkets across the US.

But here’s the twist: The lettuce, peppers, and broccoli—plus a melon and an onion, with a watermelon soon to follow—aren’t genetically modified at all. Monsanto created all these veggies using good old-fashioned crossbreeding…

In 2006, Monsanto developed a machine called a seed chipper that quickly sorts and shaves off widely varying samples of soybean germplasm from seeds. The seed chipper lets researchers scan tiny genetic variations, just a single nucleotide, to figure out if they’ll result in plants with the traits they want—without having to take the time to let a seed grow into a plant. Monsanto computer models can actually predict inheritance patterns, meaning they can tell which desired traits will successfully be passed on. It’s breeding without breeding, plant sex in silico. In the real world, the odds of stacking 20 different characteristics into a single plant are one in 2 trillion. In nature, it can take a millennium. Monsanto can do it in just a few years.

…There they slice open a classic cantaloupe and their own Melorange for comparison. Tolla’s assessment of the conventional variety is scathing. “It’s tastes more like a carrot,” he says. Mills agrees: “It’s firm. It’s sweet, but that’s about it. It’s flat.” I take bites of both too. Compared with the standard cantaloupe, the Melorange tastes supercharged; it’s vibrant, fruity, and ultrasweet. I want seconds

I think this neatly illustrates the silliness of much of the debate between GMOs and natural breeding techniques. One of the interesting facts to come out of this article is the number of GMOs that Monsanto has made that haven’t made it out into the world!

Big agricultural companies say the next revolution on the farm will come from feeding data gathered by tractors and other machinery into computers that tell farmers how to increase their output of crops like corn and soybeans…

The world’s biggest seed company, Monsanto, estimates that data-driven planting advice to farmers could increase world-wide crop production by about \$20 billion a year, or about one-third the value of last year’s U.S. corn crop.

The technology could help improve the average corn harvest to more than 200 bushels an acre from the current 160 bushels, companies say. Such a gain would generate an extra \$182 an acre in revenue for farmers, based on recent prices. Iowa corn farmers got about \$759 an acre last year.

File this under ‘intentional control of our ecology’ and ‘hacking our taste buds’. Next thing you know, they’ll have artificial taste buds…

# What is the question about your field that you dread being asked? (Human collective behavior)

At Edge:

And with this hurricane of digital records, carried along in its wake, comes a simple question: How can we have this much data and still not understand collective human behavior?

There are several issues implicit in a question like this. To begin with, it’s not about having the data, but about the ideas and computational follow-through needed to make use of it—a distinction that seems particularly acute with massive digital records of human behavior. When you personally embed yourself in a group of people to study them, much of your data-collection there will be guided by higher-level structures: hypotheses and theoretical frameworks that suggest which observations are important. When you collect raw digital traces, on the other hand, you enter a world where you’re observing both much more and much less—you see many things that would have escaped your detection in person, but you have much less idea what the individual events mean, and have no a priori framework to guide their interpretation. How do we reconcile such radically different approaches to these questions?

In other words, this strategy of recording everything is conceptually very simple in one sense, but it relies on a complex premise: that we must be able to take the resulting datasets and define richer, higher-level structures that we can build on top of them.

What could a higher-level structure look like? Consider one more example—suppose you have a passion for studying the history of the Battle of Gettysburg, and I offer to provide you with a dataset containing the trajectory of every bullet fired during that engagement, and all the movements and words uttered by every soldier on the battlefield. What would you do with this resource? For example, if you processed the final day of the data, here are three distinct possibilities. First, maybe you would find a cluster of actions, movements, and words that corresponded closely to what we think of as Pickett’s Charge, the ill-fated Confederate assault near the close of the action. Second, maybe you would discover that Pickett’s Charge was too coarse a description of what happened—that there is a more complex but ultimately more useful way to organize what took place on the final day at Gettysburg. Or third, maybe you wouldn’t find anything interesting at all; your analysis might spin its wheels but remain mired in a swamp of data that was recorded at the wrong granularity.

We don’t have that dataset for the Battle of Gettysburg, but for public reaction to the 2012 U.S. Presidential Election, or the 2012 U.S. Christmas shopping season, we have a remarkable level of action-by-action detail. And in such settings, there is an effort underway to try defining what the consequential structures might be, and what the current datasets are missing—for even with their scale, they are missing many important things. It’s a convergence of researchers with backgrounds in computation, applied mathematics, and the social and behavioral sciences, at the start of what is by every indication a very hard problem. We see glimpses of the structures that can be found—Trending Topics on Twitter, for example, is in effect a collection of summary news events induced by computational means from the sheer volume of raw tweets—but a general attack on this question is still in its very early stages.

What is the question about your field that you dread being asked?

(In neuroscience?  Anything.)