The Office of the Director of National Intelligence (ODNI) is building a computer system capable of automatically analyzing the massive quantities of data gathered across the entire intelligence community and extracting information on specific entities and their relationships to one another. The system which is called Catalyst is part of a larger effort by ODNI to create software and computer systems capable of knowledge management, entity extraction and semantic integration, enabling greater analysis and understanding of complex, multi-source intelligence throughout the government.
The intelligence community has been working for years to develop software and analytical frameworks capable of large-scale data analysis and extraction. Technological advances have now made it possible for spy agencies to not just capture the incredible amount of data flowing through public and private networks around the world, but to parse, contextualize and understand the intelligence that is being gathered. Automated software programs are now capable of integrating data into semantic systems, providing context and meaning to names, dates, photographs and practically any kind of data you can imagine.
Many agencies within the intelligence community have already created systems to do this sort of semantic integration. The Office of Naval Intelligence uses a system called AETHER “to correlate seemingly disparate entities and relationships, to identify networks of interest, and to detect patterns.” The NSA runs a program called APSTARS that provides “semantic integration of data from multiple sources in support of intelligence processing.” The CIA has a program called Quantum Leap that is designed to “find non-obvious linkages, new connections, and new information” from within a dataset. Several similar programs were even initiated by ODNI including BLACKBOOK and the Large Scale Internet Exploitation Project (LSIE).
Catalyst is an attempt to create a unified system capable of automatically extracting complex information on entities as well as the relationships between them while contextualizing this information within semantic systems. According to its specifications, Catalyst will be capable of creating detailed histories of people, places and things while mapping the interrelations that detail those entities’ interactions with the world around them. A study conducted by IARPA states that Catalyst is designed to incorporate data from across the entire intelligence community, creating a centralized repository of available information gathered from all agencies:
Many IC organizations have recognized this problem and have programs to extract information from the resources, store it in an appropriate form, integrate the information on each person, organization, place, event, etc. in one data structure, and provide query and analysis tools that run over this data. Whereas this is a significant step forward for an organization, no organization is looking at integration across the entire IC. The DNI has the charter to integrate information from all organizations across the IC; this is what Catalyst is designed to do with entity data. The promise of Catalyst is to provide, within the security constraints on the data, access to “all that is known” within the IC on a person, organization, place, event, or other entity. Not what the CIA knows, then what DIA knows, and then what NSA knows, etc., and put the burden on the analyst to pull it all together, but have Catalyst pull it all together so that analysts can see what CIA, DIA, NSA, etc. all know at once. The value to the intelligence mission, should Catalyst succeed, is nothing less than a significant improvement in the analysis capability of the entire IC, to the benefit of the national security of the US.
To fully grasp the capabilities of such a system, it is important to understand the concepts of “semantic integration” and “entity extraction” that Catalyst will perform. Using an example described in the IARPA study, we will follow data through the stages of processing in a Catalyst system:
For example, some free text may include “… Joe Smith is a 6’11” basketball player who plays for the Los Angeles Lakers…” from which the string “Joe Smith ” may be delineated as an entity of class Athlete (a subclass of People) having property Name with value JoeSmith and Height with value 6’11” (more on this example below). Note that it is important to distinguish between an entity and the name of the entity, for an entity can have multiple names (JoeSmith, JosephSmith, JosephQSmith, etc.).
Once entities and their associated relationship values are determined, the information is then integrated into a knowledge base to produce a semantic graph:
To continue the example, one entry in the knowledge base is the entity of class Athlete with (datatype property) Name having value JoeSmith, another is the entity of class SportsFranchise with Name having value Lakers, and another is an entity of class City having value LosAngeles. If each of these is viewed as a node in a graph, then an edge connecting the node (entity) with Name JoeSmith to the node with Name Lakers is named MemberOf and the edge connecting the node with Name Lakers to the node with Name LosAngeles is named LocatedIn. Such edges, corresponding to relationships (object properties) and have a direction; for example, JoeSmith is a MemberOf the Lakers, but the Lakers are not a MemberOf JoeSmith (there may be an inverse relationship, such as HasMember, that is between the Lakers and JoeSmith.).
Data that has been extracted and integrated can then produce patterns that determine unknown relations between an entity and other entities that may be of concern to a particular intelligence agency:
Another simple pattern could be: JoeSmith Owns Automobile, or Person Owns an instance of the class Automobile with Manufacturer Lexus and LicensePlate VA-123456 or even JoeSmith has-unknown-relationship-with an instance of the class Automobile with Manufacturer Lexus and LicensePlate VA-123456. In these last three examples, one of the entities or the relationship is uninstantiated. Note that JoeSmith Owns an instance of the class Automobile with Manufacturer Lexus and LicensePlate VA-123456 is not a pattern, for it has no uninstantiated entities or relationships. A more complex pattern could be: Person Owns Automobile ParticipatedIn Crime HasUnknownRelationshipWith Organization HasAffiliationWith TerroristOrganization. Any one or more of the entities and the has-unknown-relationship-with relationship (but not all) can be instantiated and it would still be a pattern, such as JoeSmith Owns Automobile ParticipatedIn Crime PerpetratedBy Organization HasAffiliationWith HAMAS.
While this example only provides a limited view of Catalyst functionality, it nonetheless helps to demonstrate the potential capabilities of the system. Far more detailed explanations of the system, as well as a useful overview of similar government systems across the intelligence community, are provided in IARPA’s one-hundred and twenty-two page study.