Abstract
The empirical false drop rate associated with a fixed-size Bloom filter used to
represent textual documents may be quite different than the theoretical rate.
This problem arises when the filter size is based on the expectation of a uniform
distribution of the number of different terms per document. The distribution
is, in fact, not uniform. This paper describes a method to determine the filter
size for a database of textual documents, based on the desired false drop rate
and the actual distribution of different words over the documents for that
database. Theoretical and experimental results are reported and indicate that
a filter size based on this method produces empirical false drop rates equivalent
to the theoretical rates. The filter was also compared to variable-length filters
with respect to storage requirements and search times.