A Fixed-Size Bloom Filter for Searching Textual Documents

Get BibTex-formatted data

Author

M. A. Shepherd,W. J. Phillips,C. K. Chu

Entry type

article

Abstract

The empirical false drop rate associated with a fixed-size Bloom filter used to represent textual documents may be quite different than the theoretical rate. This problem arises when the filter size is based on the expectation of a uniform distribution of the number of different terms per document. The distribution is, in fact, not uniform. This paper describes a method to determine the filter size for a database of textual documents, based on the desired false drop rate and the actual distribution of different words over the documents for that database. Theoretical and experimental results are reported and indicate that a filter size based on this method produces empirical false drop rates equivalent to the theoretical rates. The filter was also compared to variable-length filters with respect to storage requirements and search times.

URL

http://comjnl.oxfordjournals.org/cgi/reprint/32/3/212

Journal

The Computer Journal

Key alpha

Shepherd

Number

Volume

Publication Date

2001-01-01

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.

@Article{ Shepherd,
	title = "A Fixed-Size Bloom Filter for Searching Textual Documents",
	author = "M. A. Shepherd,W. J. Phillips,C. K. Chu",
	journal = "The Computer Journal",
	number = "3",
	volume = "32",
	abstract = "The empirical false drop rate associated with a fixed-size Bloom filter used to
represent textual documents may be quite different than the theoretical rate.
This problem arises when the filter size is based on the expectation of a uniform
distribution of the number of different terms per document. The distribution
is, in fact, not uniform. This paper describes a method to determine the filter
size for a database of textual documents, based on the desired false drop rate
and the actual distribution of different words over the documents for that
database. Theoretical and experimental results are reported and indicate that
a filter size based on this method produces empirical false drop rates equivalent
to the theoretical rates. The filter was also compared to variable-length filters
with respect to storage requirements and search times.",
	url = "http://comjnl.oxfordjournals.org/cgi/reprint/32/3/212",
}