Software searches Air Force documents for new meaning

Federal intelligence analysts could soon be using an advertising and marketing tool to catch terrorists before they strike.

By Ben Ames

ROME, N.Y. — Federal intelligence analysts could soon be using an advertising and marketing tool to catch terrorists before they strike.

Brand Dashboard 2.0 is a software program from Cymfony Inc. of Newton, Mass. Marketing agencies use it to track how advertisements change the public's perception of certain brands.

Under the terms of a new research grant, Cymfony engineers will look for ways to use a more powerful version to sort through mountains of documents and leads. Government researchers could use this data-mining tool to find connections between suspected terrorists and their colleagues, locations, and plans.

Cymfony won two SBIR (small business innovative research) grants from the Air Force Research Laboratory (AFRL) Information Directorate in Rome, N.Y. Each Phase-One grant is worth $100,000 and lasts nine months.

Data mining has three levels of sophistication — shallow extraction, intermediate extraction, and deep extraction — explains Carrie Pine, program manager in the Information and Intelligence Exploitation Division at AFRL.

In shallow extraction, a program searches the surface level of documents for simple strings of text such as names or dates. This is the level that most commercial products achieve in applications today, she says.

In intermediate extraction, a program uses grammar and statistics to resolve ambiguity in the language of documents. Such a program could see the verb "bombing" and choose the appropriate meaning, or see the location "Buffalo" and determine the correct state, or see the pronoun "he" and determine to whom it refers. This type of performance is the goal of the SBIR grant to Cymfony.

Finally, deep extraction programs could resolve ambiguous meanings not just within single documents, but across a range of events, like the Sept. 11, 2001, terror attacks. Such programs do not exist yet, but are the ultimate goal of AFRL's research, Pine says.

The core software engine that drives Brand Dashboard and the future military version is called InfoXtract, says Rohini Srihari, founder and chief scientist of Cymfony.

Internet search engines match specific keywords, then return simple URLs and documents. But InfoXtract generates three levels of new information from the raw documents. It matches related entities, such as people, organizations, weapons, and targets. It calculates their attributes, such as employer, family, age, and place within an organization's hierarchy. And it lists specific events for those entities; the basic who, what, where, and when.

In the commercial arena, Brand Dashboard digs down only to the first two levels by searching vast databases of content from sources like Dow Jones, Factiva, and press clipping services. Cymfony runs the program on its own computers, and offers it to customers as a Web-based application.

For the government, Cymfony offers more advanced features, such as a classified version, and intermediate data mining power. In one project, Cymfony designers built a domain porting tool kit for Veridian Engineering of Arlington, Va. (now part of General Dynamics). Veridian researchers then used the InfoXtract engine to analyze classified documents at the National Air and Space Intelligence Center (NAIC) in Dayton, Ohio, without exposing them to the Internet, Srihari says.

Cymfony's latest project is the Intelligence Discovery Portal, which the AFRL grant supports. The portal is a next-generation tool that will search for trends and patterns that the user did not know about. Using statistical algorithms like clustering and grouping, the program can find which people are mentioned in similar context to known targets. For instance, when it finds people who turn up in the same groups, dates, and places as known criminals, it will list the new names.

To do that, it uses grammar rules to hone the data. In the passive sentence "Mary was hit by John," a simple program would list Mary as the subject, but the new portal would recognize that John had committed the action.

And it performs "time normalization" to assign specific dates to casual references like "yesterday" or "next Friday," Srihari says. Another federal grant supports some of this work, from the U.S. Navy's Joint Warfare Analysis Center (JWAC) at Dahlgren, Va.

The Air Force and Navy are not the only government agencies tapping private sector software companies to help them use data-mining to fight terrorism. Four other recent initiatives are included.

The first initiative is ClearResearch, from ClearForest of New York. This initiative, which companies in chemistry, manufacturing, and publishing use, attaches XML meta-tags to data throughout masses of documents, then searches them to find connections. This is the engine behind the U.S. Federal Bureau of Investigations' TID (terrorism intelligence and data) program in the Trilogy network.

The second initiative is from Matrix (multi-state anti-terrorism information exchange) from Seisint Inc. of Boca Raton, Fla. It finds links and patterns between police records and commercially available data. One hundred and thirty-five police agencies in Florida subscribe to the database, and the U.S. departments of Justice and Homeland Security have pledged $12 million to help other states adopt it soon.

Third is fCoplink from Knowledge Computing Corp. of Tucson, Ariz. This initiative sifts through arrest records, accident reports, 911 recordings, and homicide investigations to find connections. Users include city police forces in Arizona, Iowa, Massachusetts, Texas, Virginia, and Washington.

Fourth is PowerCase 4.0 from Xanalys LLC of Waltham, Mass. It finds relationships between past investigations, automobile records, and witness statements. Police departments throughout Ontario, Canada now use it.

More in Communications