Advanced Logging Techniques With Apache
July 2, 2004
Logs in Apache are more configurable than most people realize. Not only can you organize the fields in your logs, but you can also create formats and layouts. Access logs can be split and divided up to make them easier to process by reporting specific items or ignoring those items that have no relevance.
Logging in Apache
When was the last time you seriously thought about the content of your Web logs Not processing the logs and producing statistics, but the actual raw content and information the logs store?
Many people will never choose a format beyond the standard common log format most systems use. They will generate logs containing all the requests, errors, and other information. This can be wasteful on resources both in terms of Apache reporting the information and the disk space required to store it. There are, however, ways in which you can simplify and customize the logging system to suit your needs.
Creating Custom Log Formats
Both versions of Apache allow you to create custom log formats with only the fields you want using the format and layout that you desire.
The custom log format system enables the clear definition of exactly which fields of information to include in the log and the surrounding format. This means you can use spaces, tabs, or other delimiters to specify the format of each line written to the file.
Individual fields within the log format are specified using a special format string. For example, %a is replaced in the actual log with the IP address of the remote host. For a full list of the fields available, see the Apache mod_log_config documentation.
To specify a format, use the LogFormat directive. For example, the Common Log Format would be set using the directive
The actual format of the information (i.e., the information surrounding the fields you include) is entirely up to you. For example, the following log format generates a log using an XML structure. It is formatted for clarity the actual definition would have to be within a single line, but the embedded \n character will generate a new line when the log is written.
Formats can be given a name (as they are in both of the examples above), and we can use it when specifying a log for a host or virtual host to request the desired log format:
It's possible to create a variety of log formats and for the individual logs to contain different information, so we can spread access, errors, scripts, referrer, and other information across a range of different files. We can also be selective about what information appears in each file by using the conditional logging system.
Not all log entries are useful or interesting. For example, you might want to ignore all the log entries the testing department generates but continue recording all user errors. Or, you might want to ignore everything but HTML accesses in the logs, therefore ignoring images and other sundry items that might make up your page.
You can ignore these elements when you are processing the logs, but additional processes are then required. It's also worth remembering that anything recorded and not used carries CPU and disk overhead. For example, a single page with 20 images on it will generate 21 entries in the log a lot of additional work and storage space for something that will be ignored.
When using conditional logging, the environment variable system sets a variable based on the request. The CustomLog directive accepts an environment condition that will be applied to the CustomLog configured. For example, let's say you wanted to log requests for only .html or .shtml files from within the Web server. First, create an environment rule specifying which files to include in the logs. For example, to set up an environment rule for our html/shtml files:
Now, apply this environment variable setting to the CustomLog directive:
To reject some files, for example to include everything except image files, create the rule the same way:
But use an exclamation mark before the environment rule in the CustomLog directive to exclude those matching the environment variable that has been set:
The SetEnvIf directive is quite extensive. More information about using the directive is found in the online Apache HTTPD documentation.
The effects achievable with this are quite extensive. On some sites, for example, different departments (as identified by the first directory in the URI) may be written to special directories is available only to each department:
Conditional logs can restrict or split the information however you want. But keep processing overhead in mind.
Conditional logging enables you to configure information written to a log, but it's limited to controlling the output of a particular CustomLog definition. Sometimes it's preferable to divide up the content of the log file based on the available information, such as host, referrer, or HTTP headers.
The typical log configuration creates a separate access and error log for each server and virtual host the Apache installation is serving. To configure logs for virtual hosts, place the ErrorLog and CustomLog directives within the VirtualHost directive:
The logs can even be divided up further to create additional information, for example the error output from scripts and CGI applications, as well as referrer information. For Web programmers in particular, the the first option is invaluable, as it enables an admin to monitor the errors in a Web application without having to sift through the general error or access log to find needed information.
A number of options are available for dividing up logs without using the conditional logging system:
All of these directives apply globally or within each VirtualHost directive.
One final trick with the custom log formatting system is to selectively log fields based on the status code of the request. Normally, all requests are logged, even if the request failed for some reason. By prefixing a list of status codes before each field, status-based logs are created. If, for example, you wanted to log all redirection requests to a specific log to determine whether an old URL was still being used, you could parse the log to determine this, or you could create a special log containing just this information using the following fragment in the configuration file:
Now, it is evident at a glance which URLs are still being used and redirected to new locations.
Many people will take their logs and, using various techniques, reformat the information into a more useful layout. There are a lots of ways of doing this, and a number of tools, like analog, simplify the process.
For large installations with a high number of servers or sites, it can be more practical to write the information into a database, which is then used to report on the information directly. Running an SQL query to pick out the number of hits for a given URI is quicker than parsing a 20 MB text file and picking out the information.
There are three ways you can achieve this: pipes, third-party modules, and post-processing. The first uses a pipe method to a log directive and then uses a script to directly parse the information and insert it into a database. For example, the following line would write a log entry in the common custom format to an SQL database through a script called apachetosql:
The script would work just like a post-processor, reading the line Apache sent to it, extracting the fields, and then writing a suitable INSERT query to add the line to the database. If you are going to use this method, consider using the %v custom format to write the virtual host name into the log entry so accesses to a specific site are trackable.
The only issue with the piped method is the security and additional overhead required to process the contents. The script or application used is another load on your server, and it needs to be error proof. The script will also be executed
A more extensive solution is provided by modules such as the mod_log_sql module. This module automatically inserts the log information into a suitable table within a MySQL database. Unlike the pipe method, it uses a direct connection to the MySQL socket, reducing the overhead required to process the information. Because it's built into Apache, an extra process is not required; nor do we need to worry about the security of the script.
With the two direct methods, the main issue is the availability of the SQL server itself. Any error in the availability of the SQL server runs the risk of information being lost. Even with the module method, additional processing is required to log the information. That means either additional local processing or network bandwidth on the log server.
The post-processing method allows continued logging to standard text files. You can later import the log data into your MySQL or other database, without worrying about the concurrent processing overhead or connectivity issues. Then, if anything goes wrong, the text files are there to to fall back on.
Although for most users logging is a critical part of the monitoring process, it can put a burden on machines serving a large number of sites and virtual hosts as well as particularly busy Web servers. In theory, the logging process shouldn't cause too much of a problem, but for those worried about its effects on the server, a few tricks are available.
While these techniques will not guarantee a massive improvement in performance, they will make a difference, and it's always important to strike a balance between the amount of information needed and the effect on performance to achieve it right. Log everything, and you'll waste time and disk space; don't record enough, and you may run foul of your marketing department and enterprise regulatory requirements.