Mining Console Logs for Large-Scale System Problem Detection


Today's large-scale Internet services run on large server clusters in datacenters and cloud computing environments. The scale and complexity of such systems make it very difficult to monitor, debug and maintain the services. On the other hand, modern computers have more and more computing cores, and multiple cores can be allocated to monitoring the system itself; moreover, cloud computing makes it easy to use massively parallel infrastructure to process large-scale data for delivering timely monitoring and diagnosis results.

There is one source of information that is built into almost every piece of software that provides detailed information that reflects the original developers' ideas about noteworthy or unusual events, but is typically ignored: the humble console log. In this project, we take advantage of the abundant computing cycles to mine console logs for system monitoring, problem detection and diagnosis. Instead of asking users to search, we provide tools to automatically find "interesting" log messages. Since unusual log messages often indicate the source of the problem, it is natural to formalize log analysis as an anomaly detection problem in machine learning.

We studied logs and source code of many popular software systems used in Internet services, and observed that a typical console log is much more structured than it appears: the definition of its ``schema'' is implicit in the log printing statements, which can be recovered from program source code. This observation is key to our log parsing approach, which yields detailed and accurate message structure recovery, feature construction and problem detection.


Figure 1. From raw logs to structured logs: Using source code information to parse console logs

Our novel approach for mining console logs integrates source code analysis with text mining to extract structured information from textual console logs. This makes it very easy and flexible for system operators to create a variety of (application-specific) features, so that powerful machine learning methods can be applied to perform high quality pattern mining and accurate problem detection for the system. Our research yielded the first automated log mining process that can not only detect a large portion of runtime anomalies, but also provide easy-to-understand explanations to system operators.


Figure 2. Our four-step methodology that allows source code analysis, information retrieval and machine learning techniques to be applied to textual console logs to find system operational problems without any manual input

Publications


  • Detecting Large-Scale System Problems by Mining Console Logs. Wei Xu, Ling Huang, Armando Fox, David Patterson and Michael I. Jordan. To appear in Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP'09), Big Sky, October 2009.
  • Mining Console Logs for Large-Scale System Problem Detection. Wei Xu, Ling Huang, Armando Fox, David Patterson and Michael I. Jordan. In Proceedings of the Third Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML) , San Diego, December 2008. [pdf]

Researchers


Collaborators