Predicting Failures in Distributed Cloud-Based Systems
Project Members
Sebastian Moreno, Andrew Newell, Rahul Potharaju, Cristina Nita-Rotaru, and Jennifer Neville
Sebastian Moreno, Andrew Newell, Rahul Potharaju, Cristina Nita-Rotaru, and Jennifer Neville
Abstract
Distributed cloud based systems consist of a set of geographically distributed routers organized in an overlay network, which promise to deliver high quality networking services to their customers (e.g., packet delivery within 200ms to/from any clients). To accomplish this requirement, their overlay network needs to be functional 24-7. Even a minor failure, such as a routing path that goes down for a couple of seconds, could negatively impact the performance of the system. However, to date there are few methods to predict or adaptively prevent failures in these distributed system. In this poster, we conducted an analysis of 2Tb of distributed system log files to identify example "failures" (i.e., signatures) that can be used to develop automated prediction methods via machine learning. Although the majority of the log information consists of normal behavior, we were able to characterize an important "outage" type of event where a significant number of customers jointly drop or change configurations in the network. Based on this pattern definition, we were able to identify several new examples of outage problems in the data. Considering these new set of training examples, we are now working on automated methods to discriminate among different types of failures and predict possible outages ahead of time, before they lead to large-scale failures.