Alert: empirical parasites are taking advantage of data scientists

The aerial view of the concept of collecting data is beautiful. What could be better than high-quality information carefully examined to give a p-value less than .05? The potential for leveraging these results for narrow papers in high-profile journals, never to be checked except by other independent studies costing thousands – tens of thousands – is a moral imperative to honor those who put the time and effort into collecting that data.

However, many of us who have actually performed data analyses, managed large data sets and analyses, and curated data sets have concerns about the details. The first concern is that someone who is not regularly involved in the analysis of data may not understand the choices involved in statistical testing. Special problems arise if data are to be combined from independent experiments and considered comparable. How heterogeneous were the study populations? Does the underlying data fulfill the assumptions for each test? Can it be assumed that the differences found are due to chance or improper correction for complex features of the data set?

A second concern held by some is that a new class of research person will emerge – people who have very little mathematical and computational training but analyze data for their own ends, possibly stealing from the research productivity of those who have invested much of their career in these very areas, or even to use the data to try to prove what the original investigators had posited before data collection! There is concern among some front-line researchers that the system will be taken over by what some researcher have characterized as “empirical parasites”.

Wait wait, sorry, that was an incredibly stupid argument. I don’t know how I could have even come up with something like that… It’s probably something more like this:

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

Yes, that’s it, open science could lead the way to research parasites analyzing other people’s data. I now look forward to the many other subtle insights on science that the editors of NEJM have to say.

Advertisements

4 thoughts on “Alert: empirical parasites are taking advantage of data scientists

  1. I enjoy the tongue-and-cheek of this article, but there is an aspect of this discussion that I have only recently come to realize as relevant. The realization came from starting to work with experimentalists, and trying to make some of my purely theoretical work more empiricism friendly. And the realization is that the experiments that experimentalists do are not the ones that theorists need done, and the theories that theorists build are not the ones that help experimentalists. In biomedical fields, there is just a disconnect and the only way to get over it (right now, at least) is to get experimentalists and theorists working together: designing experiments together, building math together. The obvious way to do it, and the way it has happened to me, is to have experimentalists and theorists physically next to each other and thinking about similar questions.

    But how open data will affect this is not obvious to me. On the positive, it could be that as theorists look at more data sets, they will realize that their theories are inadequate and thus start building more useful theories. The theories will then become useful to experimentalists and they will start to use them instead of just p < 0.01 with statistics black-box, and that will make them adjust their experiments. Or the experimentalists that already build things that are more useful to the theorists, will get more citations to their data and thus propagate. But on the negative, it could just let theorists and experimentalists continue to exist as isolated communities. And theorists will start to cherry-pick black-box experiments that play nice with their theories, ignoring all the subtleties of the experiment that only the experimentalists understand, and end up using these databases of experimental results in much the same way as the caricature of the math-phobic experimentalist with statistical techniques.

    I obviously hope for the positive case, and think there are arguments for open data that go beyond the efficiency of theory-experiment collaboration. But I can see where somebody might make an argument in the other direction.

  2. Pingback: Links 1/29/16 | Mike the Mad Biologist

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s