Abstract
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional†data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against an anually-categorized “ground truth†news corpus showing this technique is effective in identifying “topics†in collections of news articles.
Note
3rd European Conference
on Principles and Practice of Knowledge Discovery in Databases
September 15-18,1999 in Prague,Czech Republic
Lecture Notes in Artificial Intelligence
1704, Springer-Verlag(Draft Available)