The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Reports and Papers Archive


Browse All Papers »       Submit A Paper »

Hidden Pattern Statistics

CERIAS TR 2002-06
Philippe Flajolet, Yves Guivarc'h, Wojciech Szpankowski, and Brigitte Vallee
Download: PDF

Two fundamental problems in combinatorics on words and string manipulation are string matching and sequence comparison. In string matching one searches for all occurrences of a given string, understood as a sequence of consecutive symbols, in a text. In sequence comparison a subsequence rather than a string is searched in a text.  The string-matching problem has been extensively studied in literature from algorithmic and probabilistic points of view. The sequence comparison problem, also known as hidden pattern problem, is harder and it has been much less investigated. In this paper we study the number of occurrences of a given pattern w of length m as a subsequence in a random text of length n generated by a memoryless source. In particular, we consider two versions of this problem, namely the unconstrained one in which the subsequence w can appear anywhere in the text, and the constrained one that puts bounds on the distances between symbols of the word w. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law. These results are obtained via combinatorics on words, formal languages, and methods of analytic combinatorics based on generating functions and moment methods. The motivation to study this problem comes from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.

Added 2002-07-26

Trustworthiness Based Authorization on WWW

CERIAS TR 2002-08
Yuhui Zhong, Bharat Bhargava, and Malika Mahoui
Download: PDF

Current approaches for authorization on Web servers are mostly based on a predefined set of users or domains.  They are not suitable for Internet Web sites where the user set is unbounded and authorized users can be non-predefined.  We propose an authorization approach that applies role-based access control (RBAC) to WWW.  Under this approach, system administrators predefine roles, role-permission relations, and the policies that assign roles to users (user-role assignment policy).  The system automatically collects trustworthy information (valid evidence) and assigns roles to Internet users according to user-role assignment policies.  Trustworthiness information plays an important role in user-role assignment.  The validity of evidence is assessed based on the trustworthiness information of the evidence provider.  In addition, system administrators can specify the trustworthiness constraints that users have to satisfy for holding roles.  In this paper, the schema of using RBAC on the Web and the procedure of user-role assignment are presented.  The classification and evaluation of trustworthiness are discussed.

Added 2002-07-26

Separating Between Trust and Access Control Policies: A necessity for Web Applications

CERIAS TR 2002-07
Malika Mahoui, Bharat Bhargava, and Yuhui Zhong
Download: PDF

As Security is the key of success for Web Applications most of the efforts that have been put in this domain have focused on wining users

Added 2002-07-26

The Height of a Binary Search Tree: The Limiting Distribution Perspective

CERIAS TR 2002-09
Charles Knessl, Wojciech Szpankowski
Download: PDF

We study the height of the binary search tree - the most fundamental data structure used for searching. We assume that the binary search tree is built from a random permutation of n elements. Under this assumption, we study the limiting distribution of the height as n approaches infinity. We show that the distribution has six asymptotic regions (scales). These correspond to different ranges of k and n where Pr{Hn <= k} is the height distribution.  In the critical region (the so-called central region), where most of the probability mass is concentrated, the limiting distribution satisfies a non-linear integral equation. While we cannot solve this equation exactly, we show that both tails of the distribution are roughly of a double exponential form. From our analysis we conclude that the average height E[Hn] ~ A log n

Added 2002-07-26

The CROSS/Linux Value-added Services Router

Prem Gopalan
Added 2002-07-26

A Universal Predictor Based on Pattern Matching

CERIAS TR 2002-10
Philippe Jacquet, Wojciech Sqpankowski, Izydor Apostol
Download: PDF

We consider a universal predictor based on pattern matching: Given a sequence X1

Added 2002-07-26

Multicast Tree Structure and the Power Law

CERIAS TR 2002-11
Cedric Adjih, Philippe Jacquet, Leonidas Georgiadia and Wojciech Szpankowski
Download: PDF

One of the main benefits of multicast communication is the overall reduction of network load. To quantify this reduction, when compared to traditional unicast, experimental studies by Chuang and Sirbu indicated the so-called power law which asserts that the ratio R(m) of the average number of links in a multicast delivery tree connecting a source to m (distinct) sites to the average number of links in a unicast path, satisfies R(m) ~ cm^0.8 where c is a constant. In order to explain theoretically this behavior, Phillips, Shenker, and Tangmunarunkit examined approximately R(m) for a V -ary complete tree topology, and concluded that R(m) grows nearly linearly with m, thus not obeying the power law. We first re-examine the analysis by Phillips et.al. and provide precise asymptotic expansion for R(m) that confirms the nearly linear (with some wobbling) growth. Claiming that the essence of the problem lies in the modeling assumptions, we replace the V -ary complete tree topology by a V -ary self-similar tree with similarity factor 0 < T <1. In such a tree a node at level k is replicated CV^(D_ktT) times, where D is the depth of the tree and C is a constant. Under this assumption, we analyze again R(m) and prove that R(m) ~ cm^(1_T) where c is an explicitly computable constant. Hence self-similar trees provide a plausible explanation of the multicast power law. Next, we examine more general conditions for general trees, under which the power law still holds. We also discuss some experimental results in real networks that reaffirm the power law and show that in these networks the general conditions hold. In particular, our experiments show that for the tested networks T ~ 0.12.

Added 2002-07-26

1999 CSI/FBI Computer Crime and Security Survey

Richard Power
Added 2002-07-26


2000 IEEE Symposium on Security and Privacy

CERIAS TR 2003-57
IEEE Computer Society
Download: PDF
Added 2002-07-26

Generalized Shannon Code Minimizes the Maximal Redundancy

CERIAS TR 2002-12
Michael Drmota and Wojciech Szpankowski
Download: PDF

Source coding, also known as data compression, is an area of information theory that deals with the design and performance evaluation of optimal codes for data compression. In 1952 Huffman constructed his optimal code that minimizes the average code length among all prefix codes for known sources. Actually, Huffman codes minimize the average redundancy defined as the difference between the code length and the entropy of the source. Interestingly enough, no optimal code is known for other popular optimization criterion such asthemaximal redundancy defined as the maximum of the point-wise redundancy over all source sequences. We first prove that a generalized Shannon code minimizes the maximal redundancy among all prefix codes, and present an efficient implementation of the optimal code. Then we compute precisely its redundancy for memory less sources. Finally, we study universal codes for unknown source distributions. We adopt the minimax approach and search for the best code for the worst source. We establish that such redundancy is a sum of the likelihood estimator and the redundancy of the generalize code computed for the maximum likelihood distribution. This replaces Shtarkov\‘s bound by an exact formula. We also compute precisely the maximal minimax for a class of memory less sources. The main findings of this paper are established by techniques that belong to the toolkit of the

Added 2002-07-26

Precise Average Redundancy of an Idealized Arithmetic Coding

CERIAS TR 2002-13
Michael Drmota, Hsien-Kuei Hwang, Wojciech Szpankowski
Download: PDF

Redundancy is defined as the excess of the code length over the optimal (ideal) code length. We study the average redundancy of an idealized arithmetic coding (for memory less sources with unknown distributions) in which the Krichevsky and Trofimov estimator is followed by the Shannon-Fano code. We shall ignore here important practical implementation issues such as finite precisions and finite buffer sizes. In fact, our idealized arithmetic code can be viewed as an adaptive infinite precision implementation of arithmetic encoder that resembles Elias coding. However, we provide very precise results for the average redundancy that takes into account integer-length constraints. These findings are obtained by analytic methods of analysis of algorithms such as theory of distribution of sequences modulo 1 and Fourier series. These estimates can be used to study the average redundancy of codes for tree sources, and ultimately the context-tree weighting algorithms.

Added 2002-07-26

Improved Behaviour of Tries by the "Symmetrization" of the Source

CERIAS TR 2002-14
Yuriy A. Reznik and Wojciech Szpankowski
Download: PDF

In this paper, we propose and study a pre-processing technique for improving performance of digital tree (trie)-based search algorithms under asymmetric memory less sources. This technique (which we call a symmetrization of the source) bijectively maps the sequences of symbols from the original (asymmetric) source into symbols of an output alphabet resulting in a more uniform distribution. We introduce a criterion of efficiency for such a mapping, and demonstrate that a problem of finding an optimal for a given source (or universal) symmetrization transform is equivalent to a problem of constructing a minimum redundancy variable-length-to-block code for this source (or class of sources). Based on this result, we propose search algorithms that incorporate known (optimal for a given source and universal) variable-length-to-block codes and study their asymptotic behavior. We complement our analysis with a description of an efficient algorithm for universal symmetrization of binary memory less sources, and compare the performance of the resulting search structure with the standard tries.

Added 2002-07-26


Privacy Risks in Recommender Systems

Naren Ramakrishnan, Benjamin J. Keller, Batul J. Mirza, Anantha Y. Grama, George Karypis
Added 2002-07-26