Compact Recognizers of Episode Sequences

Get BibTex-formatted data

Download

PDF

Author

Alberto Apostolico, Mikhail J. Atallah

Tech report number

CERIAS TR 97-20

Entry type

techreport

Abstract

Given two strings T = a 1 : : :a n and P = b 1 : : :b m over an alphabet , the problem of testing whether P occurs as a subsequence of T is trivially solved in linear time. It is also known that a simple O ( n log j j) time preprocessing of T makes it easy to decide subsequently for any P and in at most j P j log j j character comparisons, whether P is a subsequence of T . These problems become more complicated if one asks instead whether P occurs as a subsequence of some substring Y of T of bounded length. This paper presents an automaton built on the textstring T and capable of identifying all distinct minimal substrings Y of X having P as a subsequence. By a substring Y being minimal with respect to P , it is meant that P is not a subsequence of any proper substring of Y . For every minimal substring Y , the automaton recognizes the occurrence of P having lexicographically smallest sequence of symbol positions in Y . It is not di cult to realize such an automaton in time and space O ( n 2 ) for a text of n characters. One result of this paper consists of bringing those bounds down to linear or O ( n log n ), respectively, depending on whether the alphabet is bounded or of arbitrary size, thereby matching the respective complexities of o -line exact string searching. Having built the automaton, the search for all lexicographically earliest occurrences of P in X is carried out in time O ( n + P m i=1 rocc i i log n log j j), where rocc i is the number of distinct minimal substrings of T having b 1 : : :b i as a subsequence. All log factors appearing in the above bounds can be further reduced to log log by resort to known integer-handling data structures.

Type

COAST TR

Download

PDF

Date

1997

URL

http://66.102.1.104/scholar?hl ... rchive/archive/97-20.pdf+

Institution

Department of Computer Sciences

Journal

Information & Computation

Key alpha

apostolico

Number

97-20

School

Purdue University

Type

COAST TR

Publication Date

2001-01-01

BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.

@Techreport{ apostolico,
	title = "Compact Recognizers of Episode Sequences",
	author = "Alberto Apostolico, Mikhail J. Atallah",
	year = "1997",
	institution = "Department of Computer Sciences",
	journal = "Information & Computation",
	number = "97-20",
	school = "Purdue University",
	type = "COAST TR",
	abstract = "Given two strings T = a 1 : : :a n and P = b 1 : : :b m over an alphabet  , the problem of testing
whether P occurs as a subsequence of T is trivially solved in linear time. It is also known
that a simple O ( n log j j) time preprocessing of T makes it easy to decide subsequently for any
P and in at most j P j log j j character comparisons, whether P is a subsequence of T . These
problems become more complicated if one asks instead whether P occurs as a subsequence of
some substring Y of T of bounded length. This paper presents an automaton built on the
textstring T and capable of identifying all distinct minimal substrings Y of X having P as a
subsequence. By a substring Y being minimal with respect to P , it is meant that P is not
a subsequence of any proper substring of Y . For every minimal substring Y , the automaton
recognizes the occurrence of P having lexicographically smallest sequence of symbol positions
in Y . It is not di cult to realize such an automaton in time and space O ( n
2
) for a text of
n characters. One result of this paper consists of bringing those bounds down to linear or
O ( n log n ), respectively, depending on whether the alphabet is bounded or of arbitrary size,
thereby matching the respective complexities of o -line exact string searching. Having built
the automaton, the search for all lexicographically earliest occurrences of P in X is carried out
in time O ( n + P
m
i=1 rocc i   i   log n   log j j), where rocc i is the number of distinct minimal
substrings of T having b 1 : : :b i as a subsequence. All log factors appearing in the above bounds
can be further reduced to log log by resort to known integer-handling data structures.
",
	url = "http://66.102.1.104/scholar?hl=en&lr=&q=cache:FKlHXYFFKA4J:https://www.cerias.purdue.edu/tools_and_resources/bibtex_archive/archive/97-20.pdf+",
}