Log Analysis Basics Page 3

By Martin Brown (Send Email)
Posted Jun 10, 2004


Converting Logs Into Useful Information

Contents
Log Types
Log Contents
Converting Logs Into Useful Information
Tracking Rather than Analysis

A host of log analysis tools translate a range of information based on the content of various logs into useful statistics, but it is actually relatively easy and straightforward to build your own. Although you can achieve this in any language, a scripting solution like Perl or Python is the most practical way to go because of its flexible data handling and built-in data types, like the Perl hash and Python dictionary.

Once you know the log format, pulling out the information is relatively easy. For example, here's a very simple parser, in Perl, for a standard Apache access log:

while (<INLOG>)
{
    chomp;

    ($host,$ident,$user,$time,$url,$success,$bytes) = m/^(\S+)\s+(\S+)\s+(\S+)\s+\
 [(.*)\]\s+"(.*)"\s+(\S+)\s+(\S+)[ ]*$/;
    ($day,$mon,$year,$hour,$min,$sec) = ($time =~ m%(..)/(...)/(....):(..):(..):(..)%);
}

Here's a similar solution, in Python:

while 1:
    line = file.readline()
    if line:
        splitline = string.split(line)
        if len(splitline) < 10:
            print splitline
            continue
        (host,ident,user,time,offset,req,
         loc,httpver,success,bytes) = splitline

Now that we have the basic fields, we can build counters and cross-referencing systems to track and report on different elements. For example, to get a list of the unique URLs accessed we can use a hash or dictionary to count them up. In Perl this looks like:

$urlaccesses{$url} += 1;

while in Python we have to embed it into an try statement to set the initial value:

try:
   urlaccess[loc] = urlaccess[loc] + 1
except:
   urlaccess[loc] = 1
            

You can repeat the same basic process with any of the other values in the log that we've picked out through the field information. Once you've processed the log, simply output the summary information generated as the log was processed.

If you just want to extract specific fields of information from a log to report on the contents and ignore the unnecessary parts, you can ignore the statistical gathering and use the parser as a reformatting tool. The syslog extraction tool mentioned earlier, which extracts the mail source and destination, is written in Perl and looks like this:

my (%from, %to, $time, @id);
open(DATA,"/var/log/syslog");
while(<DATA>)
{
    if (m/mail.info/)
    {
        if (m/(\S+\s+\d+\s+[\d:]+).*?mail.info\]\s+(\S+):.*?from=<(.*?)>/)
        {
            push @id, $2;
            $from{$2} = $3;
            $time{$2} = $1;
        }
        if (m/mail.info\]\s+(\S+):.*?to=(.*?),/)
        {
            $to{$1} = $2;
        }            
    }
}

We use a regular expression to pick out the necessary information, and in the process create a number of hashes that map the unique ID for each e-mail with its sender, destination, and date/time. To report on the information, we process through one of the hashes and print out the corresponding data:

foreach my $id (@id)
{
    if (exists($to{$id}))
    {
        $to{$id} =~ s/[<>]//g;
        my ($pre,$post) = split /@/,$to{$id};
        next if ($pre =~ /ESMTP/);
        next if ($from{$id} =~ /admin\@mcslp.com/);
        my ($frombat,$fromaat) = split /\@/,$from{$id};
        $frombat = substr($frombat,0,8);
        printf("%s %-40s => %s\n",$time{$id},"$frombat\@$fromaat",$pre);
    }
}

This example also demonstrates how to filter the information or parts of the log we don't like. In the above example, e-mail destined for the main administration account of the domain is ignored because we are interested only in user-related e-mails.

Finally, again as shown in the previous example, there is nothing to stop you from adjusting or reformatting source data to suit your requirements. In this case, we've removed the domain information in the destination address. In Weblogs, you might want to filter the report to work only on a particular part of the site or perhaps ignore URLs other than those relating to HTML or CGI content (e.g., images and movies).

Tracking Rather Than Analysis

Summarizing logs and generating statistics is relatively easy — we're talking about counting information based on a specific field's content, such as the URL or host. This information is fine if all you want is basic counts and statistics, but it may not provide for all of your information needs, as occasionally you may want to trace the progress of a user, issue, or element through the history of the log.

If you are tracking an individual user through the system, for example, you would want to identify which pages he or she viewed. This type of analysis goes beyond the basic statistical systems highlighted here. The basic processing and parsing of the log into an internal data structure remains the same, but how you later analyse that information differs.

If you want to know more about tracking and more complex log analysis techniques, let me know, and I'll cover the topic in an upcoming article. Please include any specific examples that you might want covered and provide details on the type of information you want to track and report.

Page 3 of 3


Comment and Contribute

Your name/nickname

Your email

(Maximum characters: 1200). You have characters left.