The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Identifying Rare Classes with Sparse Training Data

Download

Download PDF Document
PDF

Author

Christopher Clifton

Tech report number

CERIAS TR 2007-97

Entry type

conference

Abstract

Building models and learning patterns from a collection of data are essential tasks for decision making and dissemination of knowledge. One of the common tools to extract knowledge is to build a classifier. However, when the training dataset is sparse, it is difficult to build an accurate classifier. This is especially true in biological science, as biological data are hard to produce and error-prone. Through empirical results, this paper shows challenges in building an accurate classifier with a sparse biological training dataset. Our findings indicate the inadequacies in well known classification techniques. Although certain clustering techniques, such as seeded k-Means, show some promise, there are still spaces for further improvement. In addition, we propose a novel idea that could be used to produce more balanced classifier when training data samples are very limited.

Download

PDF

Date

2007 – 09

Booktitle

Database and Expert Systems Applications

Key alpha

Clifton

Pages

251-260

Publisher

Springer Berlin / Heidelberg

Volume

4653

Publication Date

2007-09-01

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.