179490-ies-damn-lies-and-statisitcs

Lies, Damn Lies and Statistics

Uncharted puts the "big data" of Google Books through the lens of a tool called Ngram, but the meaning of the results, and even their validity, turn a great read into a cautionary tale.

Uncharted puts the “big data” of Google Books through the lens of a tool called Ngram, but the meaning of the results, and even their validity, turn a great read into a cautionary tale.

Big data isn’t just limited to the populations’ phone metadata or buying history at Amazon.com. It isn’t just numbers or short pieces of text that can be honed and squeezed into the rows and columns of a database. Big data also arises from words, from words-upon-word that build first books, then libraries. Such is the big data of Google Books.

If you are a graduate student, you are likely myopically focused on some subset of a topic, an intellectual island from which you can both establish your academic street cred, and hopefully discover something interesting that will catapult your toward enough recognition that you can publish for years and reach tenure.

The topic of verb evolution is one such island of academic pursuit. It was the island upon which authors Erez Aiden and Jean-Baptise Michel started their explorations. And like the isolated domains of Stephen Jay Gould’s punctuated equilibrium theory of evolution, these two graduate students found inspiration chasing the cladistics of verbs through the metamorphosis from cultural influence, neglect and laziness. That journey, and subsequent revelations made by analyzing Google Books, is the topic of Uncharted: Big Data as a Lens on Human Culture.

Now big data isn’t just about data. Big data is about the patterns revealed in the data, patterns that can sometime lead to a buying recommendation at Amazon that seems rather prescient, or a phone call at dinner time from a political campaign or not-for-profit that has profiled you into their target demographic. Unlike politics or commerce, where the outcome drives the questions, academia can be more free flowing, and less relevant.

Uncharted explores the use of Google books at it Ngram feature, which looks at the frequency of words across Google Book’s repository of data and metadata. Uncharted reveals very little that could not be ascertained by history, and in fact, the team spends too much time on reinforcing history with their data than using their data to unveil new aspects of history. Although their background information about the query would be inconsequential to real science, it is the most relevant and engaging aspect of the book.

References to the United States as a collective whole, for instance, didn’t catch steam until around 1880, several years after the Civil War. Historian James McPherson asserted it happened after the end of the Civil War. He was right, but going viral in the 1800s wasn’t the same as it is now. So his “after” use of “The United States is” (vs. “The United States are”) took a few years to find its way into common use.

Likewise, Hitler and cultural henchman Goebbles were successful in eliminating references to Marc Chagall and Paul Klee from German literature during World War II. Degenerate artists had no place in the Third Reich, but the artists lived, and went on to outshine the society that tried to eliminate them.

Indeed, it’s interesting trivia that an oppressive regime can keep references from official sources, but a look at Syria or North Korea today would offer the same insight. And if you look back at the reign of Egyptian Pharoah Akhenaten and his monotheistic bent toward worshiping the sun god Ra, the reference itself, thousands of years later, belies the efforts of his successors to have him erased from history, and his effort to eliminate the traditional religion. Although Akhenaten is an outlier when it comes to data, we still not only know of him, but of what came before and after.

Big data promises not to just reinforce what is known, but also to anticipate it. Although Aiden and Michel demonstrate admirably that the Hollywood Ten took a downturn in mentions during the McCarthy House on Un-American Activities hearings and subsequent years, they could not have anticipated who would be dropped from history until the events actually occurred. In some ways, this analysis is like religious history where someone fulfills an ancient prophecy. How much easier it is to align an action to a forecast in the past, than it is to make a new forecast that comes true.

It may be academically worthy, even fulfilling, to prove a point asserted by intuition rather than data-backed insight, but a better question, and one this data sent can’t answer, is what was it that made Marc Chagall famous in the first place? As with much of today’s emphasis on big data, questions are limited by available data. Fame, even though its outward manifestations, even its duration, can be measured historically, says nothing about the attributes that lead to adoration first by a few, and then an attachment to the cultural memory of a society.

The authors point out that cultural osmosis occurs even under to most oppressive circumstances. Lady Gaga is probably known to more North Koreans than the country’s leaders might like, but the porous fabric that is air and Internet can’t stop information from moving, any more than it could stop exiled artists from painting in other countries, keeping both them and their names alive while former compatriots sought their elimination. Could Hitler have erased all traces of Chagall from cultural memory had the Third Reich succeeded? We will never know. There is no data to support any hypothesis about a past that didn’t occur, nor any future that has yet to happen.

Big data then isn’t just driven by a good data source, it is also compelled by good, meaningful questions. And Uncharted doesn’t do itself any favors when it trivializes fame even more than the idea trivializes itself, by observing that at the peak of his popularity, President Bill Clinton was as popular as lettuce.

“What does it mean when we make a millions discoveries, but can’t explain a single one?” That phrase from early in the book demonstrates that the authors understand the issues of big data, but it doesn’t persuade them from hundreds of pages more of “insights” that suggest associations, and maybe correlations, but little of causation that isn’t already known (Chagall’s name disappeared because Hitler wanted it to. There is no mystery in that causation.)

Uncharted takes the reader on an uncharted foray into using big data to ask little questions. The pair tells compelling stories that put their analysis in context, and as a book. At the end of the book, I asked myself if Uncharted needed its data, and I concluded that the story telling was more engaging than the analysis. The Heath brother’s of Made it Stick fame suggest that good stories benefit from compellingly positioned data. I did not find the data all that compelling, in fact, in my own experiments, I found the entire approach flawed (see the analysis below).

Between the attempt to hype the idea of “culturomics” and the references to Issac Asimov’s fictional “psychohistory”, the book moves into speculative areas of future prediction, asking if society is driven by laws that can be discovered so that models and predictions can be made. Even if that were to be true, would it benefit society? Would knowing about toy shortages before they occurred eliminate them, thus removing some core emotional element from the scarcity equation of human existence? Would people behave in ways they are told they should likely behave, because evidence now supports the rightness of the behavior versus any individual, internal inclination to act differently? Would humankind benefit from being statically smoothed? Those are the questions that must be asked if humankind is to protect itself from where big data its adherents are leading.

Read the book with skepticism and some trepidation. Big data could be taking the entire world down a path of unsubstantiated conclusions based on incomplete data, kludged algorithms, poor questions and over reactions to “fact” and “evidence”. Read between the lines, because the lines here are not as straight as scientists would like them to appear. There is more insight in the observations about process than in the conclusions drawn from the data. Read the book not for the conclusions drawn by its authors, but by the hair-raising questions that it raises.

Rather than playing with old texts in Ngrams, you may find more contemporary, though no less trivial use of statistics by monitoring what’s trending on Tweeter, at least then you can talk about something statistically likely to be of interest to your children, neighbors or co-workers.

A Reviewers Experiment

Readers can explore their own Ngrams at Google. I wasn’t impressed. I tried “T.S. Eliot” and “Shakespeare,” but the system found nothing for “T.S. Eliot”. “Eliot”, yes, but which “Eliot”? I ran “James Joyce” and “Shakespeare” and it worked fine. “e. e. cummings” returned the same null results as “T.S. Eliot”. Seems that Ngams have a problem with abbreviated names.

I also ran Clinton and Shakespeare, and although the phrase Clinton briefly eclipsed Shakespeare in the ’90s, it wasn’t President Clinton, Bill Clinton or William Jefferson Clinton, it was the amalgamation of Clinton’s, including Hilary, healthcare reform and It Takes a Village. When “William Shakespeare” is run against “Bill Clinton”, however, Clinton comes out on top, which demonstrates data anomalies, not any conclusion about popularity.

The correlation between “Shakespeare” and “William Shakespeare” seems lacking, which is clearly not something derived from the seemingly more intelligent Google search engine, which knows the two are the same person. I ran “Shakespeare” vs. “William Shakespeare” and what I concluded was this: the data sets weren’t rationalized so that the last name and full name of a major literary figure with very little historical competition for name recognition, given the lack of heirs, was equated as the same data object, and that people tend not to use Shakespeare’s full name (to prove my own point), when writing about him. Interesting but I don’t think I will be awarded a Ph.D. for the insight.

I then ran my name, as I always use it in print: “Daniel W. Rasmus” and nothing was returned. However, running just “rasmus” did returned results, and I was in them, along with the rock band The Rasmus and programming expert Rasmus Lerdorf. If Uncharted is built upon imprecise inputs and the inability to process abbreviations, then even the trivial conclusions asserted in it need to be seriously questioned.

As a final test, I ran queries using the exact phrases as those in the appendix, and the results turned out to match those in the book, which makes me skeptical about how refined Google’s Ngam engine is, and how well it parsed the intent of the questions being posed in Uncharted.