Abstract
In this thesis we present a solution to the problem of identification of significant sets of episodes in event sequences. In order to determine the significance of an episode in a monitored event sequence, we compare its observed frequency to its frequency in a reference sequence. The reference sequence in our work is represented by a variable-length Markov model of generating symbols in the reference sequence. An episode is significant if the probability that it would have a given frequency by chance, in the reference sequence, is very small. In order to identify significant episodes we first show how to select the sliding window size to ensure that a discovered episode is meaningful and then we show how to compute a lower threshold for under-represented and an upper threshold for overrepresented significant episodes. The frequency of occurrence alone is not enough to determine significance, i.e., an
infrequent episode can be more significant than a frequent one, and the significance depends on the structure of the episode and on probabilistic characteristics of the
reference and monitored event streams. As an extension, we propose a novel method for providing approximate answers, with probabilistic guarantees, to a class of ad
hoc sliding window queries referencing past data in data streams. The queries in that class compute the frequency of past windows that satisfy given join conditions among tuples in a window comprising multiple streams. To represent the join conditions
consisting of intra-stream and inter-stream constraints between tuples in the window we introduce a concept of a 2D-episode.
Contents
1. Introduction
2. Mining stream data
3. Notation
4. Mining frequent episodes
5. Finding occurrences of episodes
6. Mining significant episodes
7. Identification of significant sets of episodes
8. Analysis of the significance thresholds
9. Variable-length Markov model
10. Analysis of the probability of existence of an episode
11. Experimental results
12. 2D-episodes
13. Summary