Hidden Patterns Statistics

Get BibTex-formatted data

Author

P. Flajolet, Y. Guivarch, W. Szpankowski, B. Vallee

Entry type

article

Abstract

Two fundamental problems in combinatorics on words and string manipulation are string matching and sequence comparison. In string matching one searches for all occurrences of a given {it string}, understood as a sequence of consecutive symbols, in a text. In sequence comparison a {it subsequence} rather than a string is searched in a text The string matching problem has been extensively studied in literature from algorithmic and probabilistic points of view. The sequence comparison problem, also known as {it hidden pattern} problem, is harder and it has been much less investigated. In this paper we study the number of occurrences of a given pattern~$w$ of length $m$ as a subsequence in a random text of length $n$ generated by a memoryless source. In particular, we consider two versions of this problem, namely the {\\it unconstrained} one in which the subsequence $w$ can appear anywhere in the text, and the {\\it constrained} one that puts bounds on the distances between symbols of the word $w$. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law. These results are obtained via combinatorics on words, formal languages, and methods of analytic combinatorics based on generating functions and moment methods. The motivation to study this problem comes from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.

URL

http://www.cs.purdue.edu/homes/spa/papers/icalp01.ps

Key alpha

hidden patterns, subsequences, intrusion detection

Affiliation

INRIA, IRMAR, Purdue, GREYC

Publication Date

0000-00-00

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.

@Article{ hidden patterns, subsequences, intrusion detection,
	title = "Hidden Patterns Statistics",
	author = "P. Flajolet, Y. Guivarch, W. Szpankowski, B. Vallee",
	abstract = "Two fundamental problems in combinatorics on words and string manipulation
are string matching and sequence comparison. In string matching one
searches for all occurrences of a given {it string}, understood as a
sequence of consecutive symbols, in a text. In sequence
comparison a {it subsequence} rather than a string is searched in a text
The string matching problem has been
extensively studied in literature from algorithmic and
probabilistic points of view. The sequence
comparison problem,  also known as
{it hidden pattern} problem, is harder and
it has been much less investigated. In this
paper we study the number of occurrences of a given pattern~$w$
of length $m$ as a subsequence in a random text of length $n$ generated
by a memoryless source. In particular, we consider two
versions of this problem, namely
the {\\it unconstrained} one in which the subsequence $w$ can appear anywhere in
the text, and the {\\it constrained} one that puts bounds on the
distances between symbols of the word $w$.
We determine the mean and the
variance of the number of occurrences, and establish a Gaussian limit law.
These results are obtained via combinatorics on words, formal
languages, and methods of analytic combinatorics based on
generating functions and moment methods.
The motivation to study this problem comes from an attempt at finding a
reliable
threshold for intrusion detections, from textual data processing
applications, and from molecular biology.",
	affiliation = "INRIA, IRMAR, Purdue, GREYC",
	url = "www.cs.purdue.edu/homes/spa/papers/icalp01.ps",
}