If you work anywhere remotely close to an information technology supplier or in an information technology business unit, you have heard the term “big data”, even if you don’t know what it is or what it means. With Big Data: A Revolution that will Transform How We Live, Work and Think Viktor Mayer-Schönberger and Kenneth Cukier attempt to answer both questions, and raise even more.
As one of the first big books on big data, Big Data could be THAT BOOK. You know, the one your manager hands out at a meeting and asks everyone to read. Houghton Mifflin Harcourt certainty hopes that will happenm as they gave their marketing team permission to create a subtitle that needs neon to do it true justice. Mayer-Schönberger and Kenneth Cukier produce a text that is much more contemplative and humble that the subtitle suggests.
Mayer-Schönberger and Cukier attempt to balance the hype with risk and reality. Don’t get me wrong, this book is not skeptical of big data’s transformative power, but unlike some hype books written as self-serving marketing tools for an idea about to explode, this book recognizes the long road ahead and the bumps along the way.
Before I go further, I should say I’m a big data skeptic, not as an idea, but in its implementation. I wrote a post for Fast Company titled “Why Big Data Won’t Make You Smart Rich or Pretty” in response to Dirk Helbing of the Swiss Federal Institute of Technology in Zurich’s purposed €1-billion project, the topic of the December 2011 Scientific American cover story. Helbing seeks to do nothing less than foretell the future. Mayer-Schönberger and Cukier don’t mention Helbing, which places them a notch closer to reality even at first glance.
For those of you not familiar with big data, think about everything Amazon knows about what you have purchased, or Google knows about you from your searches, your Google+ and YouTube interactions, along your connections to other Google properties. Anything and everything you do, or that takes place in the digital world requires a digital record. A few thousands of those records is a database, several million or billion starts to be big data. Consider Twitter alone, which produces 12 terabytes of tweets each day. That is really big data.
Big Data starts with an ample historical context in chapters called “Now” and “More”. “More” concludes with a major change in the use of data: a movement from statistical sampling to being able to digest and analyze everything that has been captured about a particular topic. That observation leads into chapter titled “Messy” that begins by stating, “Using all available data is feasible in an increasing number of contexts. But it comes at a cost. Increasing the volume opens the door to inexactitude.”
To the authors’ credit, a healthy skepticism runs throughout the book. They point out the profound implications of running science through the big data engine in the cloud not just with random samples of people not related to you, but with all data available about you and everyone else. Amazon does not need to guess what the American public is buying or is likely to buy, it knows exactly what people have purchased and for the majority of products, it knows exactly what people will likely buy tomorrow. Amazon’s practice includes social science, not just marketing.
The revolution takes place across retail and hard science. Perhaps one of Mayer-Schönberger and Cukier’s more dramatic assertions comes from the elimination of the hypothesis in science. No longer do we need to ask hypothetical questions, says Wired Magazine’s Chris Anderson. If the data fits our suppositions, researchers and clinicians can simply ask the data, in near anthropomorphic fashion, what it knows and the answer will be revealed. The thoughtful pair of Big Data author’s call such an end to scientific process and theory “preposterous”.
Another point for rational thinking. They go on to point out that big data itself is a set of theories and that each attempt to interpret the vast sea of data requires conscious, human decisions about what data to use and how to ask the question. In the chapter “Dataification”, they tackle humanity’s propensity to render more-and-more of what we do, see, hear, say and otherwise experience into some digital representation. We will have every more data, about more-and-more things, so that we can eventually ask computers about almost anything.
Of course, asking data for an answer involves serious programming: select relevant data, normalizes it so it can be processed by the algorithm and finally, produce results that a human or machine can act upon. In the chapter called “Correlation” the authors tell the story of the University of Ontario Institute of Technology and IBM teaming up to analyze data coming from premature babies. As they collect 1,260 data points a second, the system can detect the onset of an infection a full 24 hours before the baby presents symptoms. Data from multiple instruments, collected, correlated and acted upon.
Mayer-Schönberger and Cukier, as authors do with this sort of book produced at the peak of a technology’s hype cycle, discuss the value of big data, attempting to make the business case that more big data is better than less big data because the more we can process the more we can understand the world. They frankly say that data isn’t really worth much by itself, it is the option value, how the data might be used in the future, repurposed and reapplied to solve problems or creates insight that provides the value, not the space it occupies on a hard disk.
If you haven’t thought of George Orwell or 1984, let me put him into a big data context. At the beginning of “Risks” Mayer-Schönberger and Cukier remind us that in 2007 the British Media reported that 30 surveillance cameras were deployed within 200 yards of the London apartment where Orwell wrote 1984. Big brother was indeed watching, Orwell had left the building long ago, however. If you are worried about big data, despite its miraculous correlations, you should be, and the authors are right there with you.
The “Risks” chapter focuses on privacy and free will. In it the authors ask the very disturbing question (to paraphrase): Does predicting what may happen actually increase the likelihood it will happen when not knowing might lead to a different outcome? Unfortunately, in today’s world, it’s hard to know, because most very public things we do, from checking a book out of a public library to attending a concert to buying ingredients for dinner, sit under the scrutiny of an algorithm designed to help us fulfill our inclination to do what we are thinking about doing. And in the obverse to that, if we think someone is highly likely to commit a crime, shouldn’t we prevent them, even punish them, ala Minority Report, before they commit that crime?
The fact that a trade technology book digs into such deep morale territory is a credit to the authors and the publisher. I don’t have the data, because there is no data about the future, but I wonder if people reading this book at Amazon, at IBM, at Accenture, will head the warnings and cautions and yellow flags thrown out by Mayer-Schönberger and Cukier or if they will blissfully analyze the detail out of the world oblivious to all but the question at hand?
Perhaps the most common sense attack against big data and the value of its outcomes can be derived from every person who has experienced an inaccurate credit report, missing inventory in a manufacturing plant, a lost delivery or some other glitch in his or her life at the hands of inaccurate data. Because we can cognitively describe a better approach to racial profiling, as the author do, it does not mean that we have accurate data in an operational system available to inform us that the person with the Arabic name is not the terrorist being tracked, but a third-generation school teacher from Cincinnati who coaches his son’s soccer team and contributes to NPR.
Just this week (early April 2013), as I write these words, Atlanta educators continue to surrender themselves to authorities for changing test answers to improve scores in order to meet test score standards. Instead of “test score” read “data.” These educators were changing data to affect other data that was being used to measure their performance and to determine program funding. The Atlanta educators had very clear guidance on what good was and very obvious instruments for affecting the goodness of the data. If, as Mayer-Schönberger and Cukier suggest, big data drives behavior as much as it anticipates it, should we not ask: if we measure everything, then how will we know what really matters?
At the most abstract level, algorithms are data, so we must also ask who or what is watching the algorithm to see if is behaving badly or not. The cover of Wired 17.03 reads: The Secret Formula that Destroyed Wall Street, a cover story that discusses the risk analysis algorithm used by Wall Street whose keepers didn’t understand when it’s underlying assumptions could no longer be assumed. The authors remind the reader often of the vigilance necessary to obtain positive value from data.
Big Data presents a well-rounded discussion of big data, including its risks and implications. Despite its hyperbolic subtitle, it does so with reason and reflection. What it doesn’t do is address the risks in a way that businesses and policy makers can remediate them. Most of the recommendations in the “Control” chapter read as Orwellian prescriptions where beneficent overseers make sure the algorithmists don’t do anything too wrong. I’m not sure that translates into actionable data.
Finally in the “Next” chapter, the authors explore the future and make, what I think is one of their more disturbing industrial age observations: Because correlations can be found far faster and cheaper than causation, they’re often preferable… for many everyday needs, knowing what not why is good enough.” This implies that because big data machinery pumps out a low-cost product we should be satisfied with it, and should, except in highly sensitive cases, like airline parts, not care why something is, just accept that the what can be applied in a way that results in a positive outcome. That too is a risk—a risk that we become so reliant on the quick and the ready that we cease to explore the underlying principles that govern our existence.
My guess is that Big Data will indeed become THAT BOOK because its wide-ranging examples, strong story telling moments and exhausting references (accounting for 27 pages of the books 242 pages) steep it with credibility and relevance. As reviewer Sally Adee pointed out in New Scientist, the authors seem to dialog in the book. We find likely caution from Mayer-Schönberger, author of Delete: The virtue of forgetting in the digital age and likely enthusiasm from Economist data editor Cukier. The tension between the authors ultimately delivers a kind of intellectual Fueng Shui for the purveyors big data.
People visit PopMatters for cultural insight. Services from Google to Alexa know much about traffic running through our site, and about those who traffic in popular culture. You are consuming data as you read this, and you left a series of digital bread crumbs on your way here that can be used to reconstruct and interpret your motivations and your inclinations. When I’m the you, that doesn’t bother me much, because for the most part, organizations, from big companies to governments, are generally well-meaning and very often borderline dysfunctional. In other words, they don’t intend to harm us, and if they did, they would probably botch it up in some way. I find that comforting.
If Google continuously improves the accuracy of the ads it displays on sites that run Google ads, however, that should invoke a little fear. If healthcare practitioners can improve the odds of surviving a hospital visit and do it at a lower cost, so much the better for society. We should worry less about the state of an incremental erosion privacy and convenience, and more about a Black Swan event in which a power rises intent on employing big data in a deliberately malevolent way.
Mayer-Schönberger and Cukier say little about the analog world. The analog world of politics and relationships taht complements, tempers and creates context for the digitization of everything. The analog world is the reason that digitization matters at all. Through big data we attempt to build better models of that world so that we can understand it, perhaps make it more meaningful. The analog sources the questions. If maliciousness arises, will we see it coming as we ask our computers not where evil hides, but who will win this season of American Idol?
The authors have the final word: “Big data is a resource and a tool. It is meant to inform, rather than explain; it points us toward understanding, but it can still lead to misunderstanding, depending on how well or poorly it is wielded. And however dazzling we find the power of big data to be, we much never let its seductive glimmer blind us to its inherent imperfections.”