Anomaly detection Engine for Linux Logs (ADE)

Why run ADE

To answer the question. Are your systems behaving badly?

Many everyday activities can introduce system anomalies and initiate failures in complex, integrated data centers; these activities include:

Increased volume of business activity
Application modifications to comply with changing regulatory requirements
Standard operational changes, such as adding or upgrading hardware or software, or changing network configurations.

You can use a combination of existing system management tools to determine whether any of these activities is causing one or more systems to behave abnormally, but none can detect every possible combination of change and failure. Even when using these tools, you might have to look through
message logs to help solve the problem but the sheer volume of messages can make this task a daunting one.

ADE helps you look through the massive volumes of log data to find the portions of the log to focus on for further detailed review.

Running ADE

To run ADE to detect anomalies in Linux logs requires the following manual steps to understand the problem before ADE is run

Determine if you want to find anomalies in a single time period for root cause analysis or if you want to have anomaly detection continuously available to help support manage Linux systems
Determine which Linux systems you want anomaly detection for
Understand how the workload run on those Linux systems

Basic approach

The basic approach is

Pick data to prime the model
Determine how to group systems into “model groups”
Prime the database with Linux logs
Create a model of normal behavior from a set of Linux logs
Analyze additional logs to detect anomalies
Examine the results written to the file system using your favorite web browser

Pick data to prime the model

For ADE to create a model that will generate useful analytic results, there needs to be a sufficient number of unique message ids (message keys). Because ADE uses unsupervised learning, ADE does not require the user to label either messages or intervals, it requires that the systems being analyzed are “relatively” stable.

Almost any Linux system that is used to support production will be stable enough for ADE to find anomalies.

Determining how to group systems into "model groups"

ADE supports grouping similar systems together when building the model. Here is an example of eleven servers and one way you can assign them to model groups:

model group 1
- primary mail server
- secondary mail server to handle traffic which overflows from the primary server
- deployment mail server is the Linux image on which a new version of the mail server code is deployed
model group 2
- external web server 1
- external web server 2
- external web server 3
model group 3
- internal web server 1
- internal web server 2
- internal web server 3
model group 4
- database server 1
model group 5
- database server 2

Examine a time period to find if unusual behavior occurred during that time period - root cause analysis

Prime the database with Linux logs

To prime the ADE database:

Delete any information left in the database controldb delete
Identify when the potential anomaly occurred
Load Linux logs from time period immediately before that time period upload -f filename or directoryname
Check if sufficient data has been loaded verify model group name
- If there is sufficient information then proceed to training
- else
  - If there are additional logs from before the set of logs loaded then load additional Linux logs
  - If there aren't any more logs available then reduce the number of model groups. Using the example above, if the problem is with model group 2 consider combining model group 2 and model group 3
  - If after loading additional logs and simplifying the model group structure verify still indicates there is insufficient information then, try training but remember that the results may be questionable.

Create a model of normal behavior from a set of Linux logs

To create a model of the normal behavior of the Linux systems:

Issue train command for each model group train model group name

Analyze additional logs to detect anomalies

To analyze the time period for anomalies

Issue Analyze command for the time periods of interest analyze -f filename or directoryname

Examine the results written to the file system using your favorite web browser

To examine the results for a period, point your web browser at the index.xml file for the time period and system of interest. To select a specific interval for further review

Click on the box in the graph
Click on link XML for the interval

The xslt provided will display a summary of the period(day).

To examine a specific ten minute interval point your web browser at the interval_nnn.xml for the time period, interval, and system of interest. The xslt provided will display a summary of the interval.

The following samples illustrate how the analysis output is written to files using the defaults specified in setup.props:

directory system_name 1
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
directory system_name 2
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))
- directory yearMonthDay
  - index.xml (summary of intervals within the period)
  - directory intervals
    - interval_nnn.xml ( details of messages issued during this interval)
    - interval_nnn_debug.xml.gz ( information to debug problems with scorers (gzipped))

Continuously process Linux Logs so anomaly information is always available

To set up ADE to provide continuous analysis results, the following steps need to be scheduled to run automatically:

Creating a model which should be run after a certain period of time has elapsed
Creating analysis results which should be run either
- when the logs are being rotated or
- after a certain period of time
remove results which are no longer needed

If there are duplicate time periods in the logs, ADE will overlay the existing time period in the database with the time period being added by either upload or analyze.

Prime the database with Linux logs

To prime the ADE database:

load Linux logs from time period immediately before today upload -f filename or directoryname
check if sufficient data has been loaded verify model group name
- if there is sufficient data then proceed to training
- else
  - if there are additional logs from before the set of logs loaded then load additional Linux logs
  - if there aren't any more logs available then reduce the number of model groups; using the example above if the problem is with model group 2 consider combining model group 2 and model group 3
  - if after loading additional logs and simplifying the model group structure verify still indicates there is insufficient information then, wait until the Linux systems generate more logs before trying training.

Create a model of normal behavior from a set of Linux logs

To create a model of the normal behavior of the Linux systems:

issue train command for each model group train model group name
the suggested length of the training interval (time from the start date to the end date) is 120 days
it is suggested that training be redone every 30 days or after a substantial change to the Linux systems like deploying additional software on those systems has occurred.

Analyze additional logs to detect anomalies

Routinely analyze the available logs so anomalies information is available when needed

issue analyze command for the next available logs analyze -f filename or directoryname

Examine the results written to the file system using your favorite web browser

To examine the results for a period, point your web browser at the index.xml file for the time period and system of interest. To select a specific interval for further review

Click on the box in the graph
Click on link XML for the interval

The xslt provided will display a summary of the period(day).

After the automation has been running for a few days, you will probably want to make sure that it is generating the appropriate results.

Delete no longer needed results

After training has run, use standard linux commands to delete the results that are no longer valuable. For example, you could choose to delete all the ADE results which are older than one year.

ADE