Advanced Logging Techniques With Apache

Advanced Logging Techniques With Apache


July 2, 2004

Logs in Apache are more configurable than most people realize. Not only can you organize the fields in your logs, but you can also create formats and layouts. Access logs can be split and divided up to make them easier to process by reporting specific items or ignoring those items that have no relevance.

Logging in Apache

Contents
Logging in Apache
Creating Custom Log Formats
Conditional Logging
Subdividing the Logs
Logging Directly to a Database or Application
Speeding up the Logging Process

When was the last time you seriously thought about the content of your Web logs — Not processing the logs and producing statistics, but the actual raw content and information the logs store?

Many people will never choose a format beyond the standard common log format most systems use. They will generate logs containing all the requests, errors, and other information. This can be wasteful on resources — both in terms of Apache reporting the information and the disk space required to store it. There are, however, ways in which you can simplify and customize the logging system to suit your needs.

Creating Custom Log Formats

Both versions of Apache allow you to create custom log formats with only the fields you want using the format and layout that you desire.

The custom log format system enables the clear definition of exactly which fields of information to include in the log and the surrounding format. This means you can use spaces, tabs, or other delimiters to specify the format of each line written to the file.

Individual fields within the log format are specified using a special format string. For example, %a is replaced in the actual log with the IP address of the remote host. For a full list of the fields available, see the Apache mod_log_config documentation.

To specify a format, use the LogFormat directive. For example, the Common Log Format would be set using the directive

LogFormat "%h %l %u %t \"%r\" %>s %b" common

The actual format of the information (i.e., the information surrounding the fields you include) is entirely up to you. For example, the following log format generates a log using an XML structure. It is formatted for clarity — the actual definition would have to be within a single line, but the embedded \n character will generate a new line when the log is written.

<programlisting><![CDATA[LogFormat "<logitem><host>%h</host>\n
<ident>%l</ident>\n
<user>%u</user>\n
<datetime>%t</datetime>\n
<url>%r</url>\n
<statuscode>%>s</statuscode>\n
<bytes>%b</bytes>\n</logitem>" commonxml

Formats can be given a name (as they are in both of the examples above), and we can use it when specifying a log for a host or virtual host to request the desired log format:

CustomLog logs/access_log common

It's possible to create a variety of log formats and for the individual logs to contain different information, so we can spread access, errors, scripts, referrer, and other information across a range of different files. We can also be selective about what information appears in each file by using the conditional logging system.

>> Conditional Logging

Conditional Logging

Not all log entries are useful or interesting. For example, you might want to ignore all the log entries the testing department generates but continue recording all user errors. Or, you might want to ignore everything but HTML accesses in the logs, therefore ignoring images and other sundry items that might make up your page.

Contents
Logging in Apache
Creating Custom Log Formats
Conditional Logging
Subdividing the Logs
Logging Directly to a Database or Application
Speeding up the Logging Process

You can ignore these elements when you are processing the logs, but additional processes are then required. It's also worth remembering that anything recorded and not used carries CPU and disk overhead. For example, a single page with 20 images on it will generate 21 entries in the log — a lot of additional work and storage space for something that will be ignored.

When using conditional logging, the environment variable system sets a variable based on the request. The CustomLog directive accepts an environment condition that will be applied to the CustomLog configured. For example, let's say you wanted to log requests for only .html or .shtml files from within the Web server. First, create an environment rule specifying which files to include in the logs. For example, to set up an environment rule for our html/shtml files:

SetEnvIf Request_URI "(\.html|\.shtml)$" html

Now, apply this environment variable setting to the CustomLog directive:

CustomLog logs/access.log common env=html

To reject some files, for example to include everything except image files, create the rule the same way:

SetEnvIf Request_URI "(\.gif|\.png|\.jpg)$" image

But use an exclamation mark before the environment rule in the CustomLog directive to exclude those matching the environment variable that has been set:

CustomLog logs/access.log common env=!image

The SetEnvIf directive is quite extensive. More information about using the directive is found in the online Apache HTTPD documentation.

The effects achievable with this are quite extensive. On some sites, for example, different departments (as identified by the first directory in the URI) may be written to special directories is available only to each department:

      
SetEnvIf Request_URI "^/weather/" weather_dept
SetEnvIf Request_URI "^/news/" news_dept
CustomLog /groups/weather/www-access_log common env=weather_dept
CustomLog /groups/news/www-access_log common env=news_dept

Conditional logs can restrict or split the information however you want. But keep processing overhead in mind.

>> Subdividing the Logs

Subdividing the Logs

Contents
Logging in Apache
Creating Custom Log Formats
Conditional Logging
Subdividing the Logs
Logging Directly to a Database or Application
Speeding up the Logging Process

Conditional logging enables you to configure information written to a log, but it's limited to controlling the output of a particular CustomLog definition. Sometimes it's preferable to divide up the content of the log file based on the available information, such as host, referrer, or HTTP headers.

The typical log configuration creates a separate access and error log for each server and virtual host the Apache installation is serving. To configure logs for virtual hosts, place the ErrorLog and CustomLog directives within the VirtualHost directive:

<VirtualHost *>
        ServerAlias www.mcslp.pri www
        DocumentRoot /export/http/webs/pri.mcslp
        ServerName www.mcslp.pri
        ErrorLog logs/www-error_log
        CustomLog logs/www-access_log common
        ScriptAlias /cgi-bin/ /export/http/webs/pr.mcslp/cgi-bin/
</VirtualHost>
        

The logs can even be divided up further to create additional information, for example the error output from scripts and CGI applications, as well as referrer information. For Web programmers in particular, the the first option is invaluable, as it enables an admin to monitor the errors in a Web application without having to sift through the general error or access log to find needed information.

A number of options are available for dividing up logs without using the conditional logging system:

  • The CookieLog directive configures Apache to log both incoming and outgoing cookies sent from the browser and returned by applications on the Web server. This is very useful when testing but generally useless in a production environment.
  • The ScriptLog directive configures Apache to place the error output from CGI applications into a separate log. Again, this is not recommended on production servers. You can further configure the log with a maximum bugger size and a maximum log length. For more information, consult the mod_cgi documentation.
  • The RewriteLog directive configures Apache to record results of rewrite rules. Although you can use this on a production server, it's likely that once the rewrite system is configured, you won't need the output.
  • A separate log of referrers (the URLs pointing to a site), User Agents (browsers), and other HTTP request header information can be configured through the CustomLog directive by specifying a format using the %{}i format string. For example, %{Referrer}i would log the referrer info.

All of these directives apply globally or within each VirtualHost directive.

One final trick with the custom log formatting system is to selectively log fields based on the status code of the request. Normally, all requests are logged, even if the request failed for some reason. By prefixing a list of status codes before each field, status-based logs are created. If, for example, you wanted to log all redirection requests to a specific log to determine whether an old URL was still being used, you could parse the log to determine this, or you could create a special log containing just this information using the following fragment in the configuration file:

LogFormat "%301U" redirects
CustomLog logs/redirects.log redirects

Now, it is evident at a glance which URLs are still being used and redirected to new locations.

>> Logging Directly to a Database or Application

Logging Directly to a Database or Application

Many people will take their logs and, using various techniques, reformat the information into a more useful layout. There are a lots of ways of doing this, and a number of tools, like analog, simplify the process.

Contents
Logging in Apache
Creating Custom Log Formats
Conditional Logging
Subdividing the Logs
Logging Directly to a Database or Application
Speeding up the Logging Process

For large installations with a high number of servers or sites, it can be more practical to write the information into a database, which is then used to report on the information directly. Running an SQL query to pick out the number of hits for a given URI is quicker than parsing a 20 MB text file and picking out the information.

There are three ways you can achieve this: pipes, third-party modules, and post-processing. The first uses a pipe method to a log directive and then uses a script to directly parse the information and insert it into a database. For example, the following line would write a log entry in the common custom format to an SQL database through a script called apachetosql:

CustomLog "|/usr/local/apachetosql" common

The script would work just like a post-processor, reading the line Apache sent to it, extracting the fields, and then writing a suitable INSERT query to add the line to the database. If you are going to use this method, consider using the %v custom format to write the virtual host name into the log entry so accesses to a specific site are trackable.

The only issue with the piped method is the security and additional overhead required to process the contents. The script or application used is another load on your server, and it needs to be error proof. The script will also be executed

A more extensive solution is provided by modules such as the mod_log_sql module. This module automatically inserts the log information into a suitable table within a MySQL database. Unlike the pipe method, it uses a direct connection to the MySQL socket, reducing the overhead required to process the information. Because it's built into Apache, an extra process is not required; nor do we need to worry about the security of the script.

With the two direct methods, the main issue is the availability of the SQL server itself. Any error in the availability of the SQL server runs the risk of information being lost. Even with the module method, additional processing is required to log the information. That means either additional local processing or network bandwidth on the log server.

The post-processing method allows continued logging to standard text files. You can later import the log data into your MySQL or other database, without worrying about the concurrent processing overhead or connectivity issues. Then, if anything goes wrong, the text files are there to to fall back on.

Speeding up the Logging Process

Although for most users logging is a critical part of the monitoring process, it can put a burden on machines serving a large number of sites and virtual hosts as well as particularly busy Web servers. In theory, the logging process shouldn't cause too much of a problem, but for those worried about its effects on the server, a few tricks are available.

  • Make sure you are tracking only those files on which you will later want to report. This limits the number of lines and data reported. Get your file selections right, because you can't go back and get the information at a later date!
  • Switch off hostname lookups in log data. With lookups switched off, Apache will record only the IP address, and these are easily resolved into hostnames at a later stage. To disable, use the HostnameLookups directive with the option Off.
  • Unless you absolutely need it, leave the IdentityCheck directive off. This prevents the log from containing validated identity information for users logged in using the HTTP authentication system. Checking the information is time consuming, so switching it off (or leaving it off, since this is the default) should help to reduce the load.
  • Use a single log file for all virtual hosts. This limits the number of open files within Apache used for logging. You can split up the files later using the split-logfile program supplied with Apache. To enable a single central log file, omit logging directives from within the VirtualHost directives and specify a custom access log format starting with the %v pattern, which inserts the virtual host name in each line of log.
  • Unless you really need the information, create only an access log and an error log. Referrer logs, cookie logs, and other information is generally useless on a production server. On a test server, have as many logs as you want, provided you are not testing performance!

While these techniques will not guarantee a massive improvement in performance, they will make a difference, and it's always important to strike a balance between the amount of information needed and the effect on performance to achieve it right. Log everything, and you'll waste time and disk space; don't record enough, and you may run foul of your marketing department and enterprise regulatory requirements.