Reading through computer eyes
This is a re-print of a previous post by myself on Resilience Science blog. It’s inspired by the TED video above where the authors nicely explain what is all about.
An N-gram is a sequence of characters separated by a space in a text. An N-gram may be a word, a number or a combination of both. The concept of N-grams simplifies the application of statistical methods to assess the frequency of a word or a phrase in body of text. N-gram statistical analyses have been around for years, but recently Jean-Baptiste Michel and collaborators had the opportunity to applying N-gram text analysis techniques to the massive Google Books collection of digitalized books. They analyzed over 5 million documents which they estimate are about 4% of all books ever published, and published their work in Science [doi].
The potential of exploring huge amounts of text, which no single person could read, provides the opportunity to trace the use of words over time. This allows researchers to track the impact of events on word use and even the evolution of language, grammar and culture. For example, by counting the words used in English books, the team found that in the year 2000 the English lexicon had over one million words, and it has been growing about 8500 words per year. Similarly, they were able to track word fads, for example the changes in the regular or irregular forms of verb conjugations over time (e.g. burned vs burnt). More interestingly, based on particular events and famous names they identified that our collective memory, as recorded in books, has both a short-term and long-term component; we are forgetting our past faster than before; but we are also learning faster when it comes to, for example, the adoption of technologies.
The options for reading books with machine eyes does not end there. Censorship during the German Nazi regime was identified by comparing the frequency of author’s names in the German and English corpus. The researchers could detect a fingerprint of the suppression of a person’s ideas in the language corpus.
The researchers term this quantitative analysis of our historic knowledge and culture through the analysis of this huge amount of data – culturomics. They plan further research will incorporate newspapers, manuscripts, artwork, maps and other human creations. Possible future applications are the development of methods for historical epidemiology (e.g. influenza peaks), the analysis of conflicts and wars, the evolution of ideas (e.g. feminism), and I think, why not ecological regime shifts?
Above you can see the frequency of some of the regime shifts we are working with in the English corpus. Soil salinization and lake eutrophication appear in 1940’s and 1960’s respectively, probably with the first description of such shifts. Similarly, coral bleaching take off during the 1980’s when reef degradation in the Caribbean basin began to be documented. Similarly, the concept of regime shift has been more and more used since 1980’s, probably not only to describe ecological shifts but also political and managerial transitions.
Although data may be noisy, the frequency of shock events may be tracked as well. Here for example we plot oil spill and see the peak corresponding to the case of January 1989 in Floreffe, Pennsylvania. Note that it does not show the oil spill in the Gulf of Mexico last year because the database is updated to 2008.