TopCat: data mining for topic identification in a text corpus

Get BibTex-formatted data

Download

PDF

Author

Christopher Clifton

Tech report number

CERIAS TR 2004-90

Entry type

article

Abstract

TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.

Download

PDF

Date

2004 – 08

URL

http://ieeexplore.ieee.org/sea ... pper.jsp?arnumber=1318580

Address

Los Alamitos, CA

Journal

Transactions on Knowledge and Data Engineering

Key alpha

Clifton

Number

Pages

949-964

Publisher

IEEE Computer Society Press

Volume

Publication Date

2004-08-01

Language

English

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.

@Article{ Clifton,
	title = "TopCat: data mining for topic identification in a text corpus",
	author = "Christopher Clifton",
	year = "2004",
	month = "08",
	address = "Los Alamitos, CA",
	journal = "Transactions on Knowledge and Data Engineering",
	number = "8",
	pages = "949-964",
	publisher = "IEEE Computer Society Press",
	volume = "16",
	abstract = "TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.",
	language = "English",
	url = "http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=1318580",
}