*tl;dr I propose a half-baked and totally wrong theory of the philosophy of science based on information theory that explains why some fields are more data oriented, and some more theory oriented*

Which is better: data or theory? Here’s a better question: why do some people do theory, and some people analyze data? Or rather, why do some fields tend to have a large set of theorists, and some don’t?

If we look at any individual scientist, we could reasonably say that they were trying to understand world as well as possible as well as possible. We could describe this in information theory terms: they are trying to *maximize the information* they have about that description, when given some set of data. One way to think about information is that it *reduces uncertainty*. In other words, when given a set of data we want to reduce our uncertainty about our description of the world as much as possible. When you have no information about something, you are totally uncertain about it. You know nothing! But the more information you have, the less uncertain you are. How do we do that?

Thanks to Shannon, we have an equation that tells us how much information two things share. In other words, how much will knowing one thing tell us about the other:

I(problem; data) = H(problem) – H(problem | data)

This tells you (in bits!) how much certain you will be about a *problem* or our* **description of the world* if you get some set of *data*.

H(problem) is the *entropy* function; it tells us how many different possibilities we have of describing this problem. Is there only one way? Many possible ways? Similarly, H(problem | data) is how many possible ways we have of describing the problem if we’ve seen some data. If we see data, and there are still tons of possibilities, the data has not told us much; we won’t have much information about the problem. But if the data is so precise that for each set of data we know exactly how to describe the problem, then we will have a lot of information!

This tells us that there are two ways to maximize our information about our problem if we have a set of data. We can either increase our set of *descriptions* of the problem or we can decrease how many possible ways there are to describe the problem when we see data.

In a high-quality, data-rich world we can mostly get away with the second one: the data isn’t noisy, and will tell us what it represents. Information can simply be maximized by collecting more data. But what happens when the data is really noisy? Collecting more data gives us a smaller marginal improvement in information than working on the set of descriptions – modeling and theory.

This explains why some fields have more theory than others. One of the hopes of Big Data is that it will reduce the noise in the data, shifting fields to focusing on the H(problem|data) part of the equation. On the other hand, the data in economics, honestly, kind of sucks. It’s dirty and noisy and we can’t even agree on what it’s telling us. Hence, marginal improvements come by creating new theory!

Look at the history of physics; for a long time, physics was getting high-quality data and had a lot of experimentalists. Since, oh, the 50s or so it’s been getting progressively harder to get new, good data. Hence, theorists!

Biology, too, has such an explosion of data it’s hard to know what to do with it. If you put your mind to it, it’s surprisingly easy to get good data that tells you about *something*.

**Theory: proposes possible alternatives for how the world could work [H(problem)]; data limits H(problem | data). Problem is data itself is noisy.**

I meant to comment on this last week, but somehow it slipped into a dusty back corner of my mind to only see the light of late evening now. I love half-baked theories for the philosophy of science. It is an active pastime for me! The appeal to information theory is particularly appealing to my advocations for an algorithmic philosophy of science.

Two more specific comments:

[1] I think you need your system to distinguish between passive data (naturalistic observation) versus active data (experimental realization of edge events) and their effects on learning. Unfortunately, information theory does not capture this easily because you are concerned with maximizing information so you will always prefer experimental realization of edge events since they are informative. However, if you take a look at computational learning theory, and in particular at the PAC variant of Angluin’s minimum adequate teacher model then you will notice the importance of both types of observations. Naturalistic observations teach you about the distribution of the ‘typical’ world, which is essential to have low generalization error and thus engineering, while experiments help you pull out potential falsifiers that are information rich but unlikely to be sampled by chance.

[2] I doubt that many theorists would think of themselves as simply trying to maximize the number and variety of stories (i.e. maximizing H(problem)). In fact, once a paradigm is established, I feel like most theorists are actually concerned with separating reasonable from unreasonable theories and thus actually segmenting and reducing the space of problems in your analogy. I think to make your story work, both theorists and experimentalists in your story are actually experimentalists. The ones trying to maximize H(problem) are the ones writing grants and trying to come up with potential falsifiers, while the ones trying to minimize H(problem|data) are the graduate students running the experiments to please their boss.