Judy Hochberg - Computer Research and Applications Group (CIC-3) at Los Alamos National Laboratory
Students: Spring 2025, unless noted otherwise, sessions will be virtual on Zoom.
Automatic identification of classified documents
Feb 25, 2000

How can one automatically identify classified documents? This is a vital question for the Department of Energy (DOE), which is reviewing millions of classified documents for possible declassification, and for Los Alamos National Laboratory (LANL), which is checking its unclassified computing storage systems for the presence of classified documents.The DOE, having already developed an expert rule system for automatic document classification, provided LANL with a small set of documents with which to explore a statistical classifier as an alternative. We represented documents as vectors of character trigram frequencies, used a chi-square statistic to select the optimal trigrams, and trained a linear classifier to distinguish classified and unclassified documents. Results ranged from 60% to 87% accuracy, depending on the training set size and other variables.
In contrast, the LANL effort started "from scratch" and needed to be moved rapidly into large-scale production. We implemented an expert system tailored to the classified documents of most concern to LANL. The talk will discuss the practical issues that arose in canvassing large amounts of files in a variety of formats, and the security issues involved in the sampling, analysis, and notification processes.
About the Speaker

Ways to Watch