The future of automated textual analysis is Crimson Hexagon, a patent pending text reading technology that allows users to define the questions they want to ask, and crawl the blogosphere (or any text-based source) for fast, accurate answers. The technology was created under the aegis of Harvard University Professor Gary King.
I met with the new company’s CEO this week to learn more about the group’s parsing technology and underlying statistical models. Some UN colleagues and I are particularly interested in the technology’s potential application to conflict monitoring and analysis. At present, early warning units within the UN, and other international (regional) organizations such as the OSCE, use manual labor to collect relevant information from online sources. Most units employ full-time staff for this, often meaning that 80% of an analyst’s time is actually used to collect pertinent articles and reports, leaving only 20% of the time for actual analysis, interpretation and policy recommendations. We can do better. Analysts ought to be spending 80% of their time analyzing.
Crimson Hexagon is of course not the first company to carry out automated textual analysis. Virtual Research Associates (VRA) and the EC’s Joint Research Center (JRC) have both been important players in this space. VRA developed GeoMonitor, a natural language parser that reads the headlines of Reuters and AFP news wires and codes “who did what, to who, where and when?” for each event reported by the two media companies. According to an independent review of the VRA parser by Gary King and Will Lowe (2003),
The results are sufficient to warrant a serious reconsideration of the apparent bias against using events data, and especially automatically created events data, in the study of international relations. If events data are to be used at all, there would now seem to be little contest between the machine and human coding methods. With one exception, performance is virtually identical, and that exception (the higher propensity of the machine to find “events” when none exist in news reports) is strongly counterbalanced by both the fact that these false events are not correlated with the degree of conflict of the event category, and by the overwhelming strength of the machine: the ability to code huge numbers of events extremely quickly and inexpensively.
However, as Gary King mentioned in a recent meeting I had with him this month, VRA’s approach faces some important limitations. First, the parser can only parse the headline of each newswire. Second, adding new media sources such as BBC requires significant investment in adjusting the parser. Third, the parser cannot draw on languages other than English.
The JRC has developed the European Media Monitor (EMM). Unlike VRA’s tool, EMM is based on a key-word search algorithm, i.e., it uses a search engine like Google. EMM crawls online news media for key words and places each article into a corresponding category, such as terrorism. The advantage of this approach over VRA’s is that EMM can parse thousands of different news sources, and in different languages. The JRC recently set up an “African Media Monitor” for the African Union’s Continental Early Warning System (CEWS). However, this approach nevertheless faces limitations since analysts still need to read each article to understand the nature of the terrorist event.
Google.org is also pursuing text-based parsing. This initiative stems from Larry Brilliant’s TED 2006 prize to expand the Global Public Health Information Network (GPHIN) for the purposes of prediction and prevention:
Rapid ecological and social changes are increasing the risk of emerging threats, from infectious diseases to drought and other environmental disasters. This initiative will use information and technology to empower communities to predict and prevent emerging threats before they become local, regional, or global crises.
Larry’s idea led to the new non-profit InSTEDD, but last time I spoke with the team, they were not pursuing this initiative. In any case, I wouldn’t be surprised if Google.com were to express an interest in buying out Crimson Hexagon before year’s end. Hexagon’s immediate clients are private sector companies who want to monitor in real-time their brand perception as reported in the blogosphere. The challenge?
115 million blogs, with 120,000 more added each day. As pundits proclaim the death of email, social web content is exploding. Consumers are generating their own media through blogs and comments, social network profiles and interactions, and myriad microcontent publishing tools. How do we begin to know and accurately quantify the relevant opinion that’s out there? How can we get answers to specific questions about online opinion as it relates to a particular topic?
The accuracy and reliability of Crimson Hexagon is truly astounding. Equally remarkable is the fact that the technology developed by Gary King’s group parses every word in a given text. How does the system work? Say we were interested in monitoring the Iranian blogosphere—like the Berkman Center’s recent study. If we were interested in liberal bloggers and their opinion on riots (hypothetically taking place now in Tehran), we would select 10-30 examples of pro-democratic blog entries addressing the ongoing riots. These would then be fed into the system to teach the algorithm about what to look for. A useful analogy that Gary likes to give is speech recognition.
The Crimson Hexagon parser uses a stemming approach, meaning that every word in a given text is reduced to it’s root word. For example, “rioting”, “riots”, “rioters”, etc., is reduced to riot. The technology creates a vector of stem words to characterize each blog entry so that thousands of Iranian blogs can be automatically compared. By providing the algorithm with a sample of 10 or more blogs on, say, positive perceptions of rioting in Tehran were this happening now, the technology would be able to quantify the liberal Iranian bloggers’ changing opinion on the rioting in real time by aggregating the stem vectors.
Crimson Hexagon is truly pioneering a fundamental shift in the paradigm of textual analysis. Instead of trying to find the needle in the haystack as it were, the technology seeks to characterize the hay stack with astonishing reliability such that any changes in the hay stack (amount of hay, density, structure) can be immediately picked up by the parser in real time. Furthermore, the technology can parse any language, say Farsi, just as long as the sample blogs provided are in Farsi. In addition, the system has returned highly reliable results even when using less than 10 samples, and even when the actual blog entry had less than 10 words. Finally, the parser is by no means limited to blog entries, any piece of text will do.
The potential for significantly improving conflict monitoring and analysis is, in my opinion, considerable. Imagine parsing Global Voices in real time, or Reliefweb and weekly situation reports across all field-based agencies world wide. Crimson Hexagon’s CEO immediately saw the potential during our meeting. We therefore hope to carry out a joint pilot study with colleagues of mine at the UN and the Harvard Humanitarian Initiative (HHI). Of course, like any early warning initiative, the link to early response will dictate the ultimate success or failure of this project.