The Components of SPEED: SPEED Project

Global News Archive

To meet the information needs of the SPEED project we assembled a comprehensive set of global news sources for the post-1945 period. Since 2006 our SEARCH program has been crawling across news websites (over 5,000 news feeds in 120 countries) several times each day, scraping news reports and storing them on our server. We are currently adding an additional 100,000 articles each day. Acquiring news sources before 2006, however, required a different approach. We were able to secure the complete historical archives of the New York Times and Wall Street Journal for the 1946-2006 period, which had been digitized. However, these were not deemed to have sufficient international coverage. Thus, we secured microfiche and microfilm records for two intelligence agency news services: the Foreign Broadcast Information Service (CIA) and the Summary of World Broadcasts (BBC). These contain millions of news articles and broadcasts that were translated into English from scores of languages. These news reports were derived from tens of thousands of news outlets and cover developments in every country in the world. To access the information in these sources, over 1,000 reels of microfilm and 50,000 microfiche had to be scanned and digitized. In addition, each individual report had to be "segmented" and joined with its header information (e.g. source, date, etc.). Thus, processing this information has required a multi-year effort. Adding these news reports, however, led to a highly inclusive global news archive of over 40M reports, which is growing on a daily basis.

Classification Scheme

Our events classification scheme emerged from the needs of the SID project and our assessment of the capacity of news reports to fill those information needs. Not every information need in a project such as SID is covered adequately by news reports. Moreover, there are some information needs that can be met by more efficient procedures than those required by SPEED (e.g., economic and demographic information). Thus, a great deal of time and effort was invested in identifying the types of information that could optimally be secured through an event analysis project such as SPEED. This process led to the development of an event classification scheme that was used to guide the subsequent phases of SPEED. The final version of the classification scheme includes events pertaining to such diverse topics as societal stability, human rights, electoral integrity, the supremacy of law, the security of property rights, the viability of governmental checks and balances, and the government's economic role. Each category within the classification has a multi-tier ontology of relevant events that is reflected in the design of category-specific protocols.

Classifying News Reports: The BIN Module

Assembling a global news archive and identifying relevant event categories are only preliminary steps in generating event data rigorously. An archive of over 40M news report requires automated techniques to identify reports with information about events that fall within the classification scheme, as well as to sort them into the appropriate category (societal stability, electoral integrity, property rights, etc.). To do this we developed an automatic text categorization program (BIN). BIN uses statistically based algorithms based on key words, word correlations, and semantic structures to identify and categorize relevant reports. BIN generates statistical probabilities that a news report belongs to a particular category within the classification scheme. A report gets assigned to a category if that probability is sufficiently high. Moreover, as reports often contain information on several different events, BIN has the capacity to sort a single report into different category-specific bins. BIN's algorithms were developed by using thousands of human-categorized reports to "teach" the computer to recognize the semantic attributes that characterize reports belonging to a specific category; it has proven to be very robust. Thresholds for inclusion were set relatively low, so as not to discard news reports with information on relevant events. Consequently, repeated tests examining random samples of discarded news reports (i.e., those not deemed relevant to any category within the classification scheme), suggest that BIN has a false negative rate of just 1%.

Text Annotation within Binned Reports: The EAT Module

Correctly identifying and electronically categorizing events is absolutely essential to generating event data in a project of SPEED's scope. But the large amount of text that has to be processed - even with perfectly binned reports - gives rise to another set of formidable cognitive challenges to information extraction. To meet these challenges we developed an "event annotation tool" (EAT) that annotates "binned" news reports. EAT employs a variety of computational procedures rooted in the field of natural language processing (NLP) to highlight text that contains relevant information about events belonging to a specific event ontology. Training data coded by humans educate the computer as to the type of information that is relevant. Generating a requisite level of accuracy with a tool such as EAT requires an extended iterative process between computer-generated models and human coders. EAT is currently in an advanced developmental stage; when properly calibrated, EAT annotations will greatly enhance the efficiency, accuracy and reliability of information extraction within SPEED.

Information Extraction: The EXTRACT Suite of Programs

To extract large sets of complex information from the millions of binned reports we developed EXTRACT, a suite of electronic modules that facilitates the work of human operators. At the core of EXTRACT is a set of category-specific protocols and a web-based interface that integrates the digitized news reports and category-specific protocols. The protocols are carefully designed and pretested and the human operators are extensively trained in both the protocol and EXTRACT's modules. Moreover, EXTRACT provides for on-going quality control: it can feed a set of pre-coded "test" articles to all operators and generate reports on the accuracy and reliability of operators by question set. The EXTRACT program also contains a number of modules to extract information efficiently and accurately. A calendaring module facilitates the ascertainment of the date upon which an event occurred. The geocoder module uses NLP techniques in conjunction with two large geospatial databases containing 8M place names (GIS, GNIS) to identify the event's location. In addition, EXTRACT employs chaining technologies to link related events that are contained in different news reports (antecedent events, post-hoc reactions, etc.). NLP techniques are also used with lexicons of social group names (religious, ethnic, racial, tribal, nationality, insurgent, etc.) to capture the identity of event participants (initiators, targets, victims, etc.) and external facilitators/collaborators (other nations, NGOs, etc.).


Related Item: SPEED Project White Papers