My data is bigger than yours. Analyzing the fragility of BigData with Nassim N. Taleb

Gil Press (BigData guru writing for Forbes) as well as others [1][2] [3] have recently suggested that organizations can become Antifragile (gain from disorder) by adopting a BigData strategy. The concept of Antifragility has been described at length by Nassim Nicholas Taleb [1] in his brilliant book.

For those who are new to the subject of Antifragility here is a video summary where Taleb gives a general overview:

Now in the context of BigData, it seems both Gil Press as well as many other marketing departments operating in the name of BigData product vendors, have completely misunderstood the concept of Antifragility. Surely if they would have actually read the book, they must have noted what Taleb had to say about BigData, and that it wasn’t at all positive? Here is a reminder on how he felt about BigData:

BigData is a version of cherry-picking that destroys the entire spirit of research and makes the abundance of data extremely harmful to knowledge. More data means more information, perhaps, but it also means more false information. We are discovering that fewer and fewer papers replicate—textbooks in, say, psychology need to be revised. As to economics, fuhgetaboudit. You can hardly trust many statistically oriented sciences—especially when the researcher is under pressure to publish for his career. Yet the claim will be “to advance knowledge.”

Recall the notion of epiphenomenon as a distinction between real life and libraries. Someone looking at history from the vantage point of a library will necessarily find many more spurious relationships than one who sees matters in the making, in the usual sequences one observes in real life. He will be duped by more epiphenomena, one of which is the direct result of the excess of data as compared to real signals.

Noise becomes a worse problem, because there is an optionality on the part of the researcher, no different from that of a banker. The researcher gets the upside, truth gets the downside. The researcher’s free option is in his ability to pick whatever statistics can confirm his belief—or show a good result—and ditch the rest. He has the option to stop once he has the right result. But beyond that, he can find statistical relationships—the spurious rises to the surface. There is a certain property of data: in large data sets, large deviations are vastly more attributable to noise (or variance) than to information (or signal).

1 FIGURE 18. The Tragedy of Big Data.

The more variables, the more correlations that can show significance in the hands of a “skilled” researcher. Falsity grows faster than information; it is nonlinear (convex) with respect to data. There is a difference in medical research between (a) observational studies, in which the researcher looks at statistical relationships on his computer, and (b) the double-blind cohort experiments that extract information in a realistic way that mimics real life. The former, that is, observation from a computer, produces all manner of results that
tend to be as last computed by John Ioannides, now more than eight times out of ten, spurious—yet these observational studies get reported in the papers and in some scientific journals. Thankfully, these observational studies are not accepted by the Food and Drug Administration, as the agency’s scientists know better. The great Stan Young, an activist against spurious statistics, and I found a genetics-based study in The New England Journal of Medicine claiming significance from statistical data—while the results to us were no better than random. We wrote to the journal, to no avail.

Figure 18 shows the swelling number of potential spurious relationships. The idea is as follows. If I have a set of 200 random  variables, completely unrelated to each other, then it would be near impossible not to find in it a high correlation of sorts, say 30 percent, but that is entirely spurious. There are techniques to control the cherry-picking (one of which is known as the Bonferoni adjustment), but even then they don’t catch the culprits—much as regulation doesn’t stop insiders from gaming the system.

This explains why in the twelve years or so since we’ve decoded the human genome, not much of significance has been found. I am not saying that there is no information in the data: the problem is that the needle comes in a haystack. Even experiments can be marred with bias: the researcher has the incentive to select the experiment that corresponds to what he was looking for, hiding the failed attempts. He can also formulate a hypothesis after the results of the experiment—thus fitting the hypothesis to the experiment. The bias is smaller, though, than in the previous case. The fooled-by-data effect is accelerating. There is a nasty phenomenon called “BigData” in which researchers have brought cherry-picking to an industrial level.

Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.

Increasingly, data can only truly deliver via negativa–style knowledge—it can be effectively used to debunk, not confirm. The tragedy is that it is very hard to get funding to replicate—and reject—existing studies. And even if there were money for it, it would be hard to find takers: trying to replicate studies will not make anyone a hero. So we are crippled with a distrust of empirical results, except for those that are negative. To return to my romantic idea of the amateur and tea-drinking English clergyman: the professional researcher competes to “find” relationships. Science must not be a competition; it must not have rankings—we can see how such a system will end up blowing up. Knowledge must not have an agency problem.

This post was written by Joachim Bauernberger. If you like it, why not connect with him on LinkedIn?

Valbonne Consulting provides Research & Consulting for emerging technologies in Internet/Web of Things (WoT/IoT/M2M) and Emerging-Tech. We specialise in decentralisation, security and privacy. We work across a variety of traditional industry verticals (Telecommunications, Automotive, Energy, ...). We support Open Source and technologies built on open standards.

Joachim Bauernberger

Passionate about Open Source, GNU/Linux and Security since 1996. I write about future technology and how to make R&D faster. Expatriate, Entrepreneur, Adventurer and Foodie, currently living near Nice, France.

4 thoughts on “My data is bigger than yours. Analyzing the fragility of BigData with Nassim N. Taleb

  1. nice to be called a “guru” and I guess being misquoted goes with the title. I never said that “organizations can become Antifragile (gain from disorder) by adopting a BigData strategy.” What I said in the post you refer to (did you read it?) is that big data (like other new developments the IT department has to deal with) is an opportunity rather than a threat to IT departments if they adopt Taleb’s advice and become “Antifragile” in their practices (thrive on chaos rather than resist it). And here is what I did say about Taleb and big data

Leave a Reply

Your email address will not be published. Required fields are marked *