Apache Guide: Logging, Part 4 -- Log-File Analysis Page 2
HTTP is a stateless, anonymous protocol. This is by design, and is not, at least in my opinion, a shortcoming of the protocol. If you want to know more about your visitors, you have to be polite, and actually ask them. And be prepared to not get reliable answers. This is amazingly frustrating for marketing types. They want to know the average income, number of kids, and hair color, of their target demographic. Or something like that. And they don't like to be told that that information is not available in the log files. However, it is quite beyond your control to get this information out of the log files. Explain to them that HTTP is anonymous.
And even what the log files do
tell you is occasionally suspect. For example, I
have numerous entries in my log files indicating
that a machine called
my web site today. I can tell that this is a
machine that is on the AOL network. But because
of the way that AOL works, this might be one
person visiting my site many times, or it might
be many people visiting my site one time each.
AOL does something called proxying, and
you can see from the machine address that it is a
proxy server. A proxy server is one that one or
more people sit behind. They type an address into
their browser. It makes that request to the proxy
server. The proxy server gets the page
(generating the log file entry on my web site).
It then passes that page back to the requesting
machine. This means that I never see the request
from the originating machine, but only the
request from the proxy.
Another implication of this is that if, 10 minutes later, someone else sitting behind that same proxy requests the same page, they don't generate a log file entry at all. They type in the address, and that request goes to the proxy server. The proxy sees the request and thinks "I already have that document in memory. There's no point asking the web site for it again." And so instead of asking my web site for the page, it gives the copy that it already has to the client. So, not only is the address field suspect, but the number of request is also suspect.
It might sound like the data that you receive is so suspect as to be useless. This is in fact not the case. It should just be taken with a grain of salt. The number of hits that your site receives is almost certainly not really the number of visitors that came to your site. But it's a good indication. And it still gives you some useful information. Just don't rely on it for exact numbers.
So, to the real meat of all of this. How do you actually generate statistics from your Web-server logs?
There are two main approaches that you can take here. You can either do it yourself, or you can get one of the existing applications that is available to do it for you.
you have custom log files that don't look
anything like the
Common log format,
you should probably get one of the available apps
out there. There are some excellent commercial
products, and some really good free ones, so you
just need to decide what features you are looking
So, without further ado, here's some of the great apps out there that can help you with this task.