Author
P. Flajolet, Y. Guivarch, W. Szpankowski, B. Vallee
Abstract
Two fundamental problems in combinatorics on words and string manipulation
are string matching and sequence comparison. In string matching one
searches for all occurrences of a given {it string}, understood as a
sequence of consecutive symbols, in a text. In sequence
comparison a {it subsequence} rather than a string is searched in a text
The string matching problem has been
extensively studied in literature from algorithmic and
probabilistic points of view. The sequence
comparison problem, also known as
{it hidden pattern} problem, is harder and
it has been much less investigated. In this
paper we study the number of occurrences of a given pattern~$w$
of length $m$ as a subsequence in a random text of length $n$ generated
by a memoryless source. In particular, we consider two
versions of this problem, namely
the {\\it unconstrained} one in which the subsequence $w$ can appear anywhere in
the text, and the {\\it constrained} one that puts bounds on the
distances between symbols of the word $w$.
We determine the mean and the
variance of the number of occurrences, and establish a Gaussian limit law.
These results are obtained via combinatorics on words, formal
languages, and methods of analytic combinatorics based on
generating functions and moment methods.
The motivation to study this problem comes from an attempt at finding a
reliable
threshold for intrusion detections, from textual data processing
applications, and from molecular biology.
Affiliation
INRIA, IRMAR, Purdue, GREYC