Anomaly detection Engine for Linux Logs (ADE)
How ADE analyzes a Linux system to find anomalies
Using unsupervised machine learning, ADE extracts and organizes message data to build a model of behavior for each Linux model group. ADE use the model for the model group which contains the Linux system to compare the expected behavior with the actual behavior and flag the difference as anomalies. ADE analyzes each message within a time slice (interval) to determine how different the behavior of the message(s) within the interval are from expected. It then totals the differences for all the messages within an interval and compares this value with the normal value from the model to calculate interval anomaly score.
Time slice (interval)
To produce meaningful analysis results for a monitored system, ADE divides the log into time slices. These time slices are called the analysis interval, the length of which varies depending on the volume of message traffic. For Linux systems, which tend to produce lower volume and less consistent message traffic, the default analysis interval length in the flowlayout.xml is 60 minutes.To display and record the results of analysis intervals, ADE produces an analysis snapshot every 10 minutes for each monitored system. Each analysis snapshot is a point-in-time record of the anomaly score for an analysis interval. For example:
- For a Linux system, the snapshot recorded at 09:00 UTC represents the analysis score and number of unique messages issued by that system from 08:00 to 9:00 UTC. The next snapshot is taken at 09:10 UTC for the analysis interval from 08:10 to 09:10, and so on. Because of the 60-minute analysis interval, every snapshot for a Linux system overlaps with previous snapshots.
Model Group
Because the message traffic on Linux systems often can be relatively light, and because Linux images are typically configured in pools of dynamically activated images, ADE is designed to provideanalysis results for Linux systems through the use of model groups. Through model groups, multiple systems contribute to the generation of a single model for the group; the more systems in the group, the more data ADE can use to build the model.
Defining model groups and their member systems
A model group is a collection of one or more systems that handle the same type of workload, and thus can be expected to exhibit similar behavior. When considering Linux systems to group together in a single model group, use the following guidelines:- Group together Linux systems that support very similar workloads. For example, group a set of Linuxweb servers in one model group, and a set of Linux database servers in another model group.
Building a model for a model group
ADE builds one model for a group of Linux systems with similar workloads, and uses that model to compare to current syslog data from each system in the group. To build a robust model of Linuxsystem behavior, ADE generally needs a minimum of 120 days of message data. Analysis can begin, however, as soon as the system data that is available for training meets the criteria for building a
valid model.
Measuring behavior of an Interval
ADE provides four measure of the how unusual the interval is- Number of unique message ids
- Interval anomaly score
- Number of messages not in the model
- Number of message which have not been seen by analyze
Number of unique message ids
Interval anomaly scores
The interval anomaly score indicates the difference in current
behavior compared to the expected behavior that is reflected in the
model. If the analysis interval contains messages that are relatively
normal,
common messages for that system, ADE assigns a low score to the
analysis interval and low score to the analysis snapshot. For
example, suppose that you have analyzed a relatively stable test
system.
On this test system, various daemons, are recycled on a regular basis.
This behavioral pattern is reflected in the model that ADE uses for
analysis. When a current daemon recycle completes normally, the
intervals for daemon recycling receive a low interval anomaly score,
because the pattern of messages issued during a successful recycle
match an expected behavior in the model. However, if any unexpected
messages are issued during a current daemon recycle, ADE assigns a
higher interval anomaly score to those analysis intervals that contain
the unexpected or unique messages.
The possible interval anomaly scores are:
0 through 99.4
The
analysis interval contains messages and message clusters that match or
exhibit relatively insignificant differences in expected behavior, as
defined in the ADE model. A score of 0 is possible because ADE
eliminates all expected, in-context messages from its scoring
calculation. A score of 0 indicates intervals that exhibit no
difference in behavior compared to the group model.
Analysis intervals with scores that are greater than 0 but less
than 99.5 contain some messages that are unexpected or issued out of
context. Scores in this range indicate intervals that do not vary
significantly from the system model. Analysis intervals with this
score contain some rarely seen, unexpected, or out-of-context messages.
Generally speaking, this score indicates analysis intervals with some
differences from the system or group model but do not contain messages
of much diagnostic value.
99.6 - 100
Analysis
intervals with this score contain rarely seen messages (these messages
appear in the model only once or twice), or many messages that are
unexpected or issued out of context. This score indicates analysis
intervals with more differences from the group model; these
intervals can contain messages that might help you diagnose anomalous
system behavior.
101
Analysis
intervals with this score exhibit the most significant differences from
the group model; these intervals contain messages that merit
investigation. ADE assigns this score to analysis intervals
that contain:
- Unusual or unexpected messages.
- A much higher volume of messages than expected.
Messages not in the model
In the message traffic for a Linux system, ADE detected one or more messages that are not in the current model that is in use for analysis. These messages might have been issued by the Linux system before, and therefore might have been included in previous models, but are not in the current model. In the Anomaly Detection Engine Interval View, the entry for this type of new message displays one of the following values in the Periodicity Status column:- NEW, if the message has never been reported in analysis results
- NOT_PERIODIC, if the message was in a previous model and was random
- NOT_IN_SYNC, if the message was in a previous model at an expected time
Message new to analyze
In the message traffic for a specific Linux system, ADE detected one or more messages that the system has issued for the first time since the day on which the ADE began reporting analysis results for this system. In the Anomaly Detection Engine Interval View, the entry for this type of new message displays the following attributes:- The value New in the Periodicity Status column
- A dash (–) in the Last Issued column Only if the initial occurrence of the message is followed by subsequent occurrences within the same analysis interval, the Last Issued column contains a date and time.
Measuring behavior of a message
A message anomaly score is created by comparing the patterns of message traffic observed during the interval being analyzed with the expected pattern of message traffic observed during all the intervals that are within the training period.
- Through the training process, the ADE determines which messages are issued during routine system events. For such system events, ADE identifies and recognizes groups of messages that are associated with each event. The message groups are called clusters and define the normal context for the messages in the cluster. When ADE detects a specific message that is issued outside of its expected context (that is, without the other messages in the cluster), ADE assigns a higher message anomaly score, which is combined with the other message anomaly scores in the interval to assign the interval anomaly score.
- ADE also detects messages that are issued periodically; for example, a message that is issued every 11 minutes. This attribute affects the anomaly score when a periodic message is not issued as expected.
- Also through the training process, ADE determines the distribution of each unique message key (message ID) within a collection of intervals in the message data used for training. This distribution influences the interval anomaly score that the ADE creates for an interval