Summary structures for frequency queries on large transaction sets

Get BibTex-formatted data

Author

Dow-Yung Yang, A. Johar, A. Grama, W. Szpankowski

Entry type

conference

Abstract

As large-scale databases become commonplace, there has been significant interest in mining them for commercial purposes. One of the basic tasks that underlies many of these mining operations is querying of transaction sets for frequencies of specified attribute values. The size of these databases makes it important to develop summary structures capable of high compression ratios as well as supporting fast frequency queries. The nature of the problem and its differences with respect to traditional text compression allows very high compression ratios. In this paper, we propose a binary trie-based summary structure for representing transaction sets. We demonstrate that this trie structure, when augmented with an appropriate set of horizontal pointers, can support frequency queries several orders of magnitude faster than raw transaction data. We improve the memory characteristics of our scheme by compressing the trie into a Patricia trie and demonstrate that this does not have a significant adverse effect on frequency query time. We further reduce the size of this trie by selectively pruning branches to compute a â€œdominantâ€ trie that is capable of approximate frequency querying. The complement trie called the â€œdeviantâ€ trie is also useful in many data mining applications. Recompressing the â€œdominantâ€ trie into a Patricia trie results in further compression of the trie. Finally, we demonstrate that our binary compressed trie structure has better memory (compression) characteristics compared to related schemes. We support our claims with experimental results on datasets from the IBM synthetic association data generator

Date

2000

URL

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=838182

Booktitle

Data Compression Conference, 2000. Proceedings. DCC 2000

Key alpha

Grama

Affiliation

Purdue University

Publication Date

2000-00-00

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.

@Conference{ Grama,
	title = " Summary structures for frequency queries on large transaction sets",
	author = "Dow-Yung Yang, A. Johar, A. Grama, W. Szpankowski",
	year = "2000",
	booktitle = "Data Compression Conference, 2000. Proceedings. DCC 2000",
	abstract = "As large-scale databases become commonplace, there has been significant interest in mining them for commercial purposes. One of the basic tasks that underlies many of these mining operations is querying of transaction sets for frequencies of specified attribute values. The size of these databases makes it important to develop summary structures capable of high compression ratios as well as supporting fast frequency queries. The nature of the problem and its differences with respect to traditional text compression allows very high compression ratios. In this paper, we propose a binary trie-based summary structure for representing transaction sets. We demonstrate that this trie structure, when augmented with an appropriate set of horizontal pointers, can support frequency queries several orders of magnitude faster than raw transaction data. We improve the memory characteristics of our scheme by compressing the trie into a Patricia trie and demonstrate that this does not have a significant adverse effect on frequency query time. We further reduce the size of this trie by selectively pruning branches to compute a â€œdominantâ€ trie that is capable of approximate frequency querying. The complement trie called the â€œdeviantâ€ trie is also useful in many data mining applications. Recompressing the â€œdominantâ€ trie into a Patricia trie results in further compression of the trie. Finally, we demonstrate that our binary compressed trie structure has better memory (compression) characteristics compared to related schemes. We support our claims with experimental results on datasets from the IBM synthetic association data generator",
	affiliation = "Purdue University",
	url = "http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=838182",
}