Anomaly detection Engine for Linux Logs (ADE)
Why run ADE
To answer the question. Are your systems behaving badly?
Many everyday activities can introduce system anomalies and initiate failures in complex, integrated data centers; these activities include:- Increased volume of business activity
- Application modifications to comply with changing regulatory requirements
- Standard operational changes, such as adding or upgrading hardware or software, or changing network configurations.
message logs to help solve the problem but the sheer volume of messages can make this task a daunting one.
ADE helps you look through the massive volumes of log data to find the portions of the log to focus on for further detailed review.
Running ADE
To run ADE to detect anomalies in Linux logs requires the following manual steps to understand the problem before ADE is run
- Determine if you want to find anomalies in a single time period for root cause analysis or if you want to have anomaly detection continuously available to help support manage Linux systems
- Determine which Linux systems you want anomaly detection for
- Understand how the workload run on those Linux systems
Basic approach
The basic approach is
- Pick data to prime the model
- Determine how to group systems into “model groups”
- Prime the database with Linux logs
- Create a model of normal behavior from a set of Linux logs
- Analyze additional logs to detect anomalies
- Examine the results written to the file system using your favorite web browser
Pick data to prime the model
For ADE to create a model that will generate useful analytic results, there needs to be a sufficient number of unique message ids (message keys). Because ADE uses unsupervised learning, ADE does not require the user to label either messages or intervals, it requires that the systems being analyzed are “relatively” stable.
Almost any Linux system that is used to support production will be stable enough for ADE to find anomalies.
Determining how to group systems into "model groups"
ADE supports grouping similar systems together when building the model. Here is an example of eleven servers and one way you can assign them to model groups:
- model group 1
- primary mail server
- secondary mail server to handle traffic which overflows from the primary server
- deployment mail server is the Linux image on which a new version of the mail server code is deployed
- model group 2
- external web server 1
- external web server 2
- external web server 3
- model group 3
- internal web server 1
- internal web server 2
- internal web server 3
- model group 4
- database server 1
- model group 5
- database server 2
Examine a time period to find if unusual behavior occurred during that time period - root cause analysis
Prime the database with Linux logs
To prime the ADE database:
- Delete any information left in the database controldb delete
- Identify when the potential anomaly occurred
-
Load Linux logs from time period immediately before that time period upload -f filename or directoryname
-
Check if sufficient data has been loaded verify model group name
- If there is sufficient information then proceed to training
- else
- If there are additional logs from before the set of logs loaded then load additional Linux logs
- If there aren't any more logs available then reduce the number of model groups. Using the example above, if the problem is with model group 2 consider combining model group 2 and model group 3
- If after loading additional logs and simplifying
the model group structure
verify still
indicates there is insufficient information then, try training but
remember that the results may be questionable.
Create a model of normal behavior from a set of Linux logs
To create a model of the normal behavior of the Linux systems:
- Issue train command for each model group train model group name
Analyze additional logs to detect anomalies
To analyze the time period for anomalies
- Issue Analyze command for the time periods of interest analyze -f filename or directoryname
Examine the results written to the file system using your favorite web browser
To examine the results for a period, point your web browser at the index.xml file for the time period and system of interest. To select a specific interval for further review
- Click on the box in the graph
- Click on link XML for the interval
To examine a specific ten minute interval point your web browser at the interval_nnn.xml for the time period, interval, and system of interest. The xslt provided will display a summary of the interval.
The following samples illustrate how the analysis output is written to files using the defaults specified in setup.props:- directory system_name 1
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
- directory system_name 2
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
- index.xml (summary of intervals within the period)
- directory intervals
- interval_nnn.xml ( details of messages issued during this interval)
- interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
Continuously process Linux Logs so anomaly information is always available
To set up ADE to provide continuous analysis results, the following steps need to be scheduled to run automatically:
- Creating a model which should be run after a certain period of time has elapsed
- Creating analysis results which should be run either
- when the logs are being rotated or
- after a certain period of time
- remove results which are no longer needed
If there are duplicate time periods in the logs, ADE will overlay the existing time period in the database with the time period being added by either upload or analyze.
Prime the database with Linux logs
To prime the ADE database:
-
load Linux logs from time period immediately before today upload -f filename or directoryname
-
check if sufficient data has been loaded verify model group name
- if there is sufficient data then proceed to training
- else
- if there are additional logs from before the set of logs loaded then load additional Linux logs
- if there aren't any more logs available then reduce the number of model groups; using the example above if the problem is with model group 2 consider combining model group 2 and model group 3
- if after loading additional logs and simplifying
the model group structure
verify still
indicates there is insufficient information then, wait until
the Linux systems generate more logs before trying training.
Create a model of normal behavior from a set of Linux logs
To create a model of the normal behavior of the Linux systems:
- issue train command for each model group train model group name
- the suggested length of the training interval (time from the start date to the end date) is 120 days
- it is suggested that training be redone every 30 days or after a substantial change to the Linux systems like deploying additional software on those systems has occurred.
Analyze additional logs to detect anomalies
Routinely analyze the available logs so anomalies information is available when needed
- issue analyze command for the next available logs analyze -f filename or directoryname
Examine the results written to the file system using your favorite web browser
To examine the results for a period, point your web browser at the index.xml file for the time period and system of interest. To select a specific interval for further review
- Click on the box in the graph
- Click on link XML for the interval
To examine a specific ten minute interval point your web browser at the interval_nnn.xml for the time period, interval, and system of interest. The xslt provided will display a summary of the interval.
After the automation has been running for a few days, you will probably want to make sure that it is generating the appropriate results.
Delete no longer needed results
After training has run, use standard linux commands to delete the results that are no longer valuable. For example, you could choose to delete all the ADE results which are older than one year.