Molecular interaction data plays an important role in understanding biological processes at a modular level by providing a framework for understanding cellular organization, functional hierarchy, and evolutionary conservation. As the quality and quantity of network and interaction data increases rapidly, the problem of effectively analyzing this data becomes significant. Graph theoretic formalisms, commonly used for these analysis tasks, often lead to computationally hard problems due to their relation to subgraph isomorphism. This paper presents an innovative new algorithm, MULE, for detecting frequently occurring patterns and modules in biological networks. Using an innovative graph simplification technique based on ortholog contraction, which is ideally suited to biological networks, our algorithm renders these problems computationally tractable and scalable to large numbers of networks. We show, experimentally, that our algorithm can extract frequently occurring patterns in metabolic pathways and protein interaction networks from the KEGG, DIP, and BIND databases within seconds. When compared to existing approaches, our graph simplification technique can be viewed either as a pruning heuristic, or a closely related, but computationally simpler task. When used as a pruning heuristic, we show that our technique reduces effective graph sizes significantly, accelerating existing techniques by several orders of magnitude! Indeed, for most of the test cases, existing techniques could not even be applied without our pruning step. When used as a stand-alone analysis technique, MULE is shown to convey significant biological insights at near-interactive rates. The software, sample input graphs, and detailed results for comprehensive analysis of nine eukaryotic PPI networks are available at www.cs.purdue.edu/homes/koyuturk/mule.
In this paper, we study the problem of detecting and eliminating redundancy in a sensor network with a view to improving energy efficiency, while preserving the network’s coverage. We also examine the impact of redundancy elimination on the related problem of coverage-boundary detection. We reduce both problems to the computation of Voronoi diagrams, prove and achieve lower bounds on the solution of these problems, and present efficient distributed algorithms for computing and maintaining solutions in cases of sensor failures or insertion of new sensors. We prove the correctness and termination properties of our distributed algorithms, and analytically characterize the time complexity and the traffic generated by our algorithms. Our simulations show that the traffic generated per sensor insertion or removal (failure) experiences a dramatic decrease with an increase in sensor density, (up to 300% when the number of sensors deployed in the same 1000 /spl times/ 1000 m/sup 2/ area increases from 150 to 800), and with an increase in radio transmission range (up to 200% when the sensor’s transmission range increases from 70 m to 200 m).
Comparative analyses of cellular interaction networks enable understanding of the cell’s modular organization through identification of functional modules and complexes. These techniques often rely on topological features such as connectedness and density, based on the premise that functionally related proteins are likely to interact densely and that these interactions follow similar evolutionary trajectories. Significant recent work has focused on efficient algorithms for identification of such functional modules and their conservation. In spite of algorithmic advances, development of a comprehensive infrastructure for interaction databases is in relative infancy compared to corresponding sequence analysis tools. One critical, and as yet unresolved aspect of this infrastructure is a measure of the statistical significance of a match, or a dense subcomponent. In the absence of analytical measures, conventional methods rely on computationally expensive simulations based on ad-hoc models for quantifying significance. In this paper, we present techniques for analytically quantifying statistical significance of dense components in reference model graphs. We consider two reference models—a G(n, p) model in which each pair of nodes in a graph has an identical likelihood, p, of sharing an edge, and a two-level G(n, p) model, which accounts for high-degree hub nodes generally observed in interaction networks. Experiments performed on a rich collection of protein interaction (PPI) networks show that the proposed model provides a reliable means of evaluating statistical significance of dense patterns in these networks. We also adapt existing state-of-the-art network clustering algorithms by using our statistical significance measure as an optimization criterion. Comparison of the resulting module identification algorithm, SIDES, with existing methods shows that SIDES outperforms existing algorithms in terms of sensitivity and specificity of identified clusters with respect to available GO annotations.
As large-scale databases become commonplace, there has been significant interest in mining them for commercial purposes. One of the basic tasks that underlies many of these mining operations is querying of transaction sets for frequencies of specified attribute values. The size of these databases makes it important to develop summary structures capable of high compression ratios as well as supporting fast frequency queries. The nature of the problem and its differences with respect to traditional text compression allows very high compression ratios. In this paper, we propose a binary trie-based summary structure for representing transaction sets. We demonstrate that this trie structure, when augmented with an appropriate set of horizontal pointers, can support frequency queries several orders of magnitude faster than raw transaction data. We improve the memory characteristics of our scheme by compressing the trie into a Patricia trie and demonstrate that this does not have a significant adverse effect on frequency query time. We further reduce the size of this trie by selectively pruning branches to compute a “dominant†trie that is capable of approximate frequency querying. The complement trie called the “deviant†trie is also useful in many data mining applications. Recompressing the “dominant†trie into a Patricia trie results in further compression of the trie. Finally, we demonstrate that our binary compressed trie structure has better memory (compression) characteristics compared to related schemes. We support our claims with experimental results on datasets from the IBM synthetic association data generator
This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.
Uniform sampling in networks is at the core of a wide variety of randomized algorithms. Random sampling can be performed by modeling the system as an undirected graph with associated transition probabilities and defining a corresponding Markov chain (MC). A random walk of prescribed minimum length, performed on this graph, yields a stationary distribution, and the corresponding random sample. This sample, however, is not uniform when network nodes have a non-uniform degree distribution. This poses a significant practical challenge since typical large scale real-world unstructured networks tend to have non-uniform degree distributions, e.g., power-law degree distribution in unstructured peer-to-peer networks.
In this paper, we present a distributed algorithm that enables efficient uniform sampling in large unstructured non-uniform networks. Specifically, we prescribe necessary conditions for uniform sampling in such networks and present distributed algorithms that satisfy these requirements. We empirically evaluate the performance of our algorithm in comparison to known algorithms. The performance parameters include computational complexity, length of random walk, and uniformity of the sampling. Simulation results support our claims of performance improvements due to our algorithm.
In this paper we study two important problems - coverage-boundary detection and implementing distributed hash tables in ad-hoc wireless networks. These problems frequently arise in service location and relocation in wireless networks. For the centralized coverage-boundary problem we prove a /spl Omega/(n log n) lower bound for n devices. We show that both problems can be effectively reduced to the problem of computing Voronoi overlays, and maintaining these overlays dynamically. Since the computation of Voronoi diagrams requires O(n log n) time, our solution is optimal for the computation of the coverage-boundary. We present efficient distributed algorithms for computing and dynamically maintaining Voronoi overlays, and prove the stability properties for the latter - i.e., if the nodes stop moving, the overlay stabilizes to the correct Voronoi overlay. Finally, we present experimental results in the context of the two selected applications, which validate the performance of our distributed and dynamic algorithms.
Motivated by the needs and success of projects such as SETI@home and genome@home, we propose an architecture for a sustainable large-scale peer-to-peer environment for distributed cycle sharing among Internet hosts. Such networks are characterized by highly dynamic state due to high arrival and departure rates. This makes it difficult to build and maintain structured networks and to use state-based resource allocation techniques. We build our system to work in an environment similar to current file-sharing networks such as Gnutella and Freenet. In doing so, we are able to leverage vast network resources while providing resilience to random failures, low network overhead, and an open architecture for resource brokering. This paper describes the underlying analytical and algorithmic substrates based on randomization for job distribution, replication, monitoring, aggregation and oblivious resource sharing and communication between participating hosts. We support our claims of robustness and scalability analytically with high probabilistic guarantees. Our algorithms do not introduce any state dependencies, and hence are resilient to dynamic node arrivals, departures, and failures. We support all analytical claims with a detailed simulation-based evaluation of our distributed framework.
Simulation and control are two critical elements of Dynamic Data-Driven Application Systems (DDDAS). Simulation of dynamical systems such as weather phenomena, when augmented with real-time data, can yield precise forecasts. In other applications such as structural control, the presence of real-time data relating to system state can enable robust active control. In each case, there is an ever increasing need for improved accuracy, which leads to models of higher complexity. The basic motivation for system approximation is the need, in many instances, for a simplified model of a dynamical system, which captures the main features of the original complex model. This need arises from limited computational capability, accuracy of measured data, and storage capacity. The simplified model may then be used in place of the original complex model, either for simulation and prediction, or active control. As sensor networks and embedded processors proliferate our environment, technologies for such approximations and real-time control emerge as the next major technical challenge. This paper outlines the state of the art and outstanding challenges in the development of efficient and robust methods for producing reduced order models of large state-space systems.
Node clustering in sensor networks increases scalability, robustness, and energy-efficiency. In hostile environments, unexpected failures or attacks on cluster heads (through which communication takes place) may partition the network or degrade application performance. We propose REED (Robust Energy-Efficient Distributed clustering), for clustering sensors deployed in hostile environments in an interleaved manner with low complexity. Our primary objective is to construct a k-fault-tolerant (i.e., k-connected) clustered network, where k is a constant determined by the application. Fault tolerance is achieved by selecting k independent sets of cluster heads (i.e., cluster head overlays) on top of the physical network, so that each node can quickly switch to other cluster heads in case of failures. The independent cluster head overlays also give multiple vertex-disjoint routing paths for load balancing and security. Network lifetime is prolonged by selecting cluster heads with high residual energy and low communication cost, and periodically re-clustering the network. We prove that REED asymptotically achieves k-connectivity if certain conditions on node density are met. We also discuss inter-cluster routing and MAC layer considerations, and investigate REED clustering properties via extensive simulations.
We study the sleep/wake scheduling problem in the context of clustered sensor networks. We conclude that the design of any sleep/wake scheduling algorithm must take into account the impact of the synchronization error. Our work includes two parts. In the first part, we show that there is an inherent tradeoff between energy consumption and message delivery performance (defined as the message capture probability in this work). We formulate an optimization problem to minimize the expected energy consumption, with the constraint that the message capture probability should be no less than a threshold. In the first part, we assume the threshold is already given. However, by investigating the unique structure of the problem, we transform the non-convex problem into a convex equivalent, and solve it using an efficient search method. In the second part, we remove the assumption that the capture probability threshold is already given, and study how to decide it to meet the quality of services (QoS) requirement of the application. We observe that in many sensor network applications, a group of sensors collaborate to perform common task(s). Therefore, the QoS is usually not decided by the performance of any individual node, but by the collective performance of all the related nodes. To achieve the collective performance with minimum energy consumption, intuitively we should provide differentiated services for the nodes and favor more important ones. We thus formulate an optimization problem, which aims to set the capture probability threshold for messages from each individual node such that the expected energy consumption is minimized, while the collective performance is guaranteed. The problem turns out to be non-convex and hard to solve exactly. Therefore, we use approximation techniques to obtain a suboptimal solution that approximates the optimum. Simulations show that our approximate solution significantly outperforms a scheme without differentiated treatment of the nodes.
Mobile communications technologies have reached a significant penetration today and the development of technologies and applications is still emerging. The Internet has become the major core network around which several wireless access networks are inter-connected. These access networks are not just single wireless links, but are becoming diverse and complex. For example, sensors are connected via sensor networks, cars connect via mobile ad hoc networks, and users in areas without GSM/UMTS coverage might only be reached via satellites. Moreover, the requirements to mobile communications are increasing. Security is a major concern in such networks and small wireless/mobile devices need to save as much power as possible to ensure long lifetimes. The 4th International Conference on Wired/Wireless Internet Communications (WWIC 2006) took place at University of Bern (Switzerland) from May 10 to 12, 2006. WWIC 2006 addressed relevant research issues such wireless networks, UMTS and OFDM, mobile ad hoc networks, power saving and sensor networks, voice and video over wireless networks, mobility, transport protocol issues as well as signalling, charging, and security. The goal of the conference was to present high-quality results in the field. The international conference program committee selected 29 papers out of 142 submissions for conference presentation. Finally, five papers from the ones presented at the conference have been selected for this special issue. The selected papers have been improved based on the conference reviews and extended in order to present latest research results in more detail.
Fair bandwidth allocation is critical in wireless communication networks, since the wireless channel is often shared by a number of stations in the same neighborhood. With fair scheduling, bandwidth can be shared by competing flows in proportion to their assigned weights. In this paper, we propose a credit-based distributed protocol for fair allocation of bandwidth in IEEE 802.11 wireless LANs. Our protocol is derived from the distributed coordination function in the IEEE 802.11 medium access control (MAC) protocol. Analytical and simulation results demonstrate that the protocol achieves the desired bandwidth allocations. An important feature of our protocol is its backward compatibility, which allows legacy IEEE 802.11 stations to coexist with stations adopting the new MAC protocol.