TopCat: Data Mining for Topic Identification in a Text Corpus

Get BibTex-formatted data

Download

PDF

Author

Christopher Clifton

Tech report number

CERIAS TR 2001-91

Entry type

conference

Abstract

TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on â€œtraditionalâ€ data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against an anually-categorized â€œground truthâ€ news corpus showing this technique is effective in identifying â€œtopicsâ€ in collections of news articles.

Download

PDF

Date

1999 – 09

URL

http://people.csail.mit.edu/jr ... apers/topcat-tkde2000.pdf

Key alpha

Clifton

Note

3rd European Conference on Principles and Practice of Knowledge Discovery in Databases September 15-18,1999 in Prague,Czech Republic Lecture Notes in Artificial Intelligence 1704, Springer-Verlag(Draft Available)

Publication Date

2001-09-01

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.

@Conference{ Clifton,
	title = "TopCat:  Data Mining for Topic Identification in a Text Corpus",
	author = "Christopher Clifton",
	year = "1999",
	month = "09",
	note = "3rd European Conference
on Principles and Practice of Knowledge Discovery in Databases
September 15-18,1999 in Prague,Czech Republic 
Lecture Notes in Artificial Intelligence
1704, Springer-Verlag(Draft Available)",
	abstract = "TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on â€œtraditionalâ€ data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against an anually-categorized â€œground truthâ€ news corpus showing this technique is effective in identifying â€œtopicsâ€ in collections of news articles.",
	url = "http://people.csail.mit.edu/jrennie/papers/topcat-tkde2000.pdf",
}