The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Reports and Papers Archive


Browse All Papers »       Submit A Paper »

Building Verifiable Sensing Applications Through Temporal Logic Specification

Asad Awan, Ahmed Sameh, Suresh Jagannathan, Ananth Grama

Sensing is at the core of virtually every DDDAS application. Sensing applications typically involve distributed communication and coordination over large self-organized networks of heterogeneous devices with severe resource constraints. As a consequence, developers must explicitly deal with low-level details, making programming time-consuming and error-prone. To reduce this burden, current sensor network programming languages espouse a model that relies on packaged reusable components to implement relevant pieces of a distributed communication infrastructure. Unfortunately, programmers are often forced to understand the mechanisms used by these implementations in order to optimize resource utilization and performance, and to ensure application requirements are met. To address these issues, we propose a novel and high-level programming model that directly exposes control over sensor network behavior using temporal logic specifications, in conjunction with a set of system state abstractions to specify, generate, and automatically validate resource and communication behavior for sensor network applications. TLA+ (the temporal logic of actions) is used as the underlying specification language to express global state abstractions as well as user invariants. We develop a synthesis engine that utilizes TLC (a temporal logic model-checker) to generate detailed actions so that user-provided behavioral properties can be satisfied, guaranteeing program correctness. The synthesis engine generates specifications in TLA+, which are compiled down to sensor node primitive actions. We illustrate our model using a detailed experimental evaluation on our structural sensing and control testbed. The proposed framework is integrated into the COSMOS macroprogramming environment, which is extensively used to develop sensing and control applications at the Bowen Lab for Structural Engineering at Purdue.

Added 2008-05-15

Semi-Discrete Matrix Transforms (SDD) for Image and Video Compression

Sacha Zyto, Ananth Grama, Wojciech Szpankowski
Added 2008-05-14

Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns

Mehmet Koyuturk, Wojciech Szpankowski, Ananth Grama

Biclustering is an important problem that arises in diverse applications, including analysis of gene expression and drug interaction data. The problem can be formalized in various ways through different interpretation of data and associated optimization functions. We focus on the problem of finding unusually dense patterns in binary (0-1) matrices. This formulation is appropriate for analyzing experimental datasets that come from not only binary quantization of gene expression data, but also more comprehensive datasets such as gene-feature matrices that include functions of coded proteins and motifs in the coding sequence. We formalize the notion of an “unusually” dense submatrix to evaluate the interestingness of a pattern in terms of statistical significance based on the assumption of a uniform memoryless source. We then simplify it to assess statistical significance of discovered patterns. Using statistical significance as an objective function, we formulate the problem as one of finding significant dense submatrices of a large sparse matrix. Adopting a simple iterative heuristic along with randomized initialization techniques, we derive fast algorithms for discovering binary biclusters. We conduct experiments on a binary gene-feature matrix and a quantized breast tumor gene expression matrix. Our experimental results show that the proposed method quickly discovers all interesting patterns in these datasets.

Added 2008-05-14

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Mehmet Koyuturk, Ananth Grama, Naren Ramakrishnan

With the availability of large scale computing platforms and instrumentation for data gathering, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional datasets. In this paper, we present a novel algebraic technique based on a variant of semi-discrete matrix decomposition (SDD), which is capable of compressing large discrete-valued datasets in an error bounded fashion. We show that this process of compression can be thought of as identifying dominant patterns in underlying data. We derive efficient algorithms for computing dominant patterns, quantify their performance analytically as well as experimentally, and identify applications of these algorithms in problems ranging from clustering to vector quantization.We demonstrate the superior characteristics of our algorithm in terms of (i) scalability to extremely high dimensions; (ii) bounded error; and (iii) hierarchical nature, which enables multiresolution analysis. Detailed experimental results are provided to support these claims.

Added 2008-05-14

Mining Scientific Data

N. Ramakrishnan, A. Grama
Added 2008-05-14

Locality in structured peer-to-peer networks

Ronaldo A. Ferreira, Suresh Jagannathan, Ananth Grama

Distributed hash tables (DHTs), used in a number of structured peer-to-peer (P2P) systems, provide efficient mechanisms for resource placement and location. A key distinguishing feature of current DHT systems, such as Chord, Pastry, CAN and Tapestry, is the way they handle locality in the underlying network. Topology-based node identifier assignment, proximity routing, and proximity neighbor selection are examples of heuristics used to minimize message delays in the underlying network. While these heuristics are sometimes effective, they all rely on a single global overlay that may install the key of a popular object at a node far from most of the nodes accessing it. Furthermore, a response to a lookup message does not contain any locality information about the nodes holding a copy of the object. We address these issues in Plethora, a novel two-level overlay P2P network. A local overlay in Plethora acts as a locality-aware cache for the global overlay, grouping nodes close together in the underlying network. Local overlays are constructed by exploiting the organization of the Internet into autonomous systems (ASs). We present a detailed experimental study that demonstrates performance gains in response time of up to 60% compared to a single global Pastry overlay. We also present efficient distributed algorithms for maintaining local overlays in the presence of node arrivals and departures.

Added 2008-05-14

2D-Pattern Matching Image and Video Compression Theory, Algorithms, and Experiments

Marc Alzina, Wojciech Szpankowski, Ananth Grama
Added 2008-05-14

Enhancing locality in structured peer-to-peer networks

R.A. Ferreira, S. Jagannathan, A. Grama

Distributed hash tables (DHTs), used in a number of structured peer-to-peer systems, provide efficient mechanisms for resource location. A key distinguishing feature of current DHT systems such as Chord, Pastry, and Tapestry is the way they handle locality in the underlying network. Topology-based node identifier assignment, proximity routing, and proximity neighbor selection are examples of heuristics used to minimize message delays in the underlying network. While these heuristics are sometimes effective, they rely on a single global overlay that may install the key of a popular object at a node far from most of the nodes accessing it. Furthermore, a response to a lookup does not contain any locality information about the nodes holding a copy of the object. We address these issues by proposing a novel two-level overlay peer-to-peer architecture. In our architecture, local overlays act as locality-aware caches for the global overlay, grouping nodes close together in the underlying network. Local overlays are constructed by exploiting the structure of the Internet as autonomous systems. We present detailed experimental results demonstrating the practicality of the system, and showing performance gains in response time of up to 60% compared to a single global overlay with state-of-the-art localization schemes. We also present efficient distributed algorithms for maintaining local overlays in the presence of node arrivals and departures.

Added 2008-05-14

Detecting Conserved Interaction Patterns in Biological Networks

Mehmet Koyuturk, Yohan Kim, Shankar Subramaniam, Wojciech Szpankowski, Ananth Grama

Molecular interaction data plays an important role in understanding biological processes at a modular level by providing a framework for understanding cellular organization, functional hierarchy, and evolutionary conservation. As the quality and quantity of network and interaction data increases rapidly, the problem of effectively analyzing this data becomes significant. Graph theoretic formalisms, commonly used for these analysis tasks, often lead to computationally hard problems due to their relation to subgraph isomorphism. This paper presents an innovative new algorithm, MULE, for detecting frequently occurring patterns and modules in biological networks. Using an innovative graph simplification technique based on ortholog contraction, which is ideally suited to biological networks, our algorithm renders these problems computationally tractable and scalable to large numbers of networks. We show, experimentally, that our algorithm can extract frequently occurring patterns in metabolic pathways and protein interaction networks from the KEGG, DIP, and BIND databases within seconds. When compared to existing approaches, our graph simplification technique can be viewed either as a pruning heuristic, or a closely related, but computationally simpler task. When used as a pruning heuristic, we show that our technique reduces effective graph sizes significantly, accelerating existing techniques by several orders of magnitude! Indeed, for most of the test cases, existing techniques could not even be applied without our pruning step. When used as a stand-alone analysis technique, MULE is shown to convey significant biological insights at near-interactive rates. The software, sample input graphs, and detailed results for comprehensive analysis of nine eukaryotic PPI networks are available at www.cs.purdue.edu/homes/koyuturk/mule.

Added 2008-05-14

Coverage preserving redundancy elimination in sensor networks

B. Carbunar, A. Grama, J. Vitek, O. Carbunar

In this paper, we study the problem of detecting and eliminating redundancy in a sensor network with a view to improving energy efficiency, while preserving the network’s coverage. We also examine the impact of redundancy elimination on the related problem of coverage-boundary detection. We reduce both problems to the computation of Voronoi diagrams, prove and achieve lower bounds on the solution of these problems, and present efficient distributed algorithms for computing and maintaining solutions in cases of sensor failures or insertion of new sensors. We prove the correctness and termination properties of our distributed algorithms, and analytically characterize the time complexity and the traffic generated by our algorithms. Our simulations show that the traffic generated per sensor insertion or removal (failure) experiences a dramatic decrease with an increase in sensor density, (up to 300% when the number of sensors deployed in the same 1000 /spl times/ 1000 m/sup 2/ area increases from 150 to 800), and with an increase in radio transmission range (up to 200% when the sensor’s transmission range increases from 70 m to 200 m).

Added 2008-05-14

Assessing Significance of Connectivity and Conservation in Protein Interaction Networks

Mehmet Koyuturk, Wojciech Szpankowski, Ananth Grama

Comparative analyses of cellular interaction networks enable understanding of the cell’s modular organization through identification of functional modules and complexes. These techniques often rely on topological features such as connectedness and density, based on the premise that functionally related proteins are likely to interact densely and that these interactions follow similar evolutionary trajectories. Significant recent work has focused on efficient algorithms for identification of such functional modules and their conservation. In spite of algorithmic advances, development of a comprehensive infrastructure for interaction databases is in relative infancy compared to corresponding sequence analysis tools. One critical, and as yet unresolved aspect of this infrastructure is a measure of the statistical significance of a match, or a dense subcomponent. In the absence of analytical measures, conventional methods rely on computationally expensive simulations based on ad-hoc models for quantifying significance. In this paper, we present techniques for analytically quantifying statistical significance of dense components in reference model graphs. We consider two reference models—a G(n, p) model in which each pair of nodes in a graph has an identical likelihood, p, of sharing an edge, and a two-level G(n, p) model, which accounts for high-degree hub nodes generally observed in interaction networks. Experiments performed on a rich collection of protein interaction (PPI) networks show that the proposed model provides a reliable means of evaluating statistical significance of dense patterns in these networks. We also adapt existing state-of-the-art network clustering algorithms by using our statistical significance measure as an optimization criterion. Comparison of the resulting module identification algorithm, SIDES, with existing methods shows that SIDES outperforms existing algorithms in terms of sensitivity and specificity of identified clusters with respect to available GO annotations.

Added 2008-05-14

Summary structures for frequency queries on large transaction sets

Dow-Yung Yang, A. Johar, A. Grama, W. Szpankowski

As large-scale databases become commonplace, there has been significant interest in mining them for commercial purposes. One of the basic tasks that underlies many of these mining operations is querying of transaction sets for frequencies of specified attribute values. The size of these databases makes it important to develop summary structures capable of high compression ratios as well as supporting fast frequency queries. The nature of the problem and its differences with respect to traditional text compression allows very high compression ratios. In this paper, we propose a binary trie-based summary structure for representing transaction sets. We demonstrate that this trie structure, when augmented with an appropriate set of horizontal pointers, can support frequency queries several orders of magnitude faster than raw transaction data. We improve the memory characteristics of our scheme by compressing the trie into a Patricia trie and demonstrate that this does not have a significant adverse effect on frequency query time. We further reduce the size of this trie by selectively pruning branches to compute a “dominant” trie that is capable of approximate frequency querying. The complement trie called the “deviant” trie is also useful in many data mining applications. Recompressing the “dominant” trie into a Patricia trie results in further compression of the trie. Finally, we demonstrate that our binary compressed trie structure has better memory (compression) characteristics compared to related schemes. We support our claims with experimental results on datasets from the IBM synthetic association data generator

Added 2008-05-14

Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets

Mehmet Koyuturk, Ananth Grama, Naren Ramakrishnan

This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.

Added 2008-05-14

Distributed Uniform Sampling in Unstructured Peer-to-Peer Networks

Asad Awan, Ronaldo A. Ferreira, Suresh Jagannathan, Ananth Grama

Uniform sampling in networks is at the core of a wide variety of randomized algorithms. Random sampling can be performed by modeling the system as an undirected graph with associated transition probabilities and defining a corresponding Markov chain (MC). A random walk of prescribed minimum length, performed on this graph, yields a stationary distribution, and the corresponding random sample. This sample, however, is not uniform when network nodes have a non-uniform degree distribution. This poses a significant practical challenge since typical large scale real-world unstructured networks tend to have non-uniform degree distributions, e.g., power-law degree distribution in unstructured peer-to-peer networks.

In this paper, we present a distributed algorithm that enables efficient uniform sampling in large unstructured non-uniform networks. Specifically, we prescribe necessary conditions for uniform sampling in such networks and present distributed algorithms that satisfy these requirements. We empirically evaluate the performance of our algorithm in comparison to known algorithms. The performance parameters include computational complexity, length of random walk, and uniformity of the sampling. Simulation results support our claims of performance improvements due to our algorithm.

Added 2008-05-14

Distributed and dynamic voronoi overlays for coverage detection and distributed hash tables in ad-hoc networks

B. Carbunar, A. Grama, J. Vitek

In this paper we study two important problems - coverage-boundary detection and implementing distributed hash tables in ad-hoc wireless networks. These problems frequently arise in service location and relocation in wireless networks. For the centralized coverage-boundary problem we prove a /spl Omega/(n log n) lower bound for n devices. We show that both problems can be effectively reduced to the problem of computing Voronoi overlays, and maintaining these overlays dynamically. Since the computation of Voronoi diagrams requires O(n log n) time, our solution is optimal for the computation of the coverage-boundary. We present efficient distributed algorithms for computing and dynamically maintaining Voronoi overlays, and prove the stability properties for the latter - i.e., if the nodes stop moving, the overlay stabilizes to the correct Voronoi overlay. Finally, we present experimental results in the context of the two selected applications, which validate the performance of our distributed and dynamic algorithms.

Added 2008-05-14