Secure Apache: Out, Damned Bot!

ServerWatch content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

The introduction of almost any technology is closely followed by attempts to figure out how to abuse it. As the technology matures, the methods of abuse become more and more sophisticated.

Learn how to defend your Web server against abusive spiders and ‘flies.’

The Web is no different; almost as soon as people started publishing content, others began trying to figure out how to steal it. I’ll call these people and their ilk ‘perps.’ As soon as pages became read/write instead of just read-only, perps began figuring out how to use them to publish their own content on other people’s servers. Wikis and blog comments are one example of this.

Describing ways of dealing with such abuses is the end goal of this series of articles, but I’m going to cover some more basic issues first.

Spiders and Flies

The Bouncer

When the toxic spider problem first surfaced, it took the form of simply gathering too much information (and thereby occasionally affecting server performance). It wasn’t long before a solution appeared — the Robot Exclusion Standard (RES). It described the format of a file called “robots.txt” that you could put on your site to indicate which areas are available for crawling and which were not.

The RES is intended to stand at the door of your site and control access by the paparazzi, er, spiders. The idea was that legitimate bots would check for the file and obey its restrictions. Of course, it doesn’t directly help block those that don’t check the file, but it was quickly adopted by the Spiders in the White Hats and has become a fixture of today’s Web.

However, a standard like this is a little like a traffic signal; it works only when people agree to abide by the rules. Spiders that don’t abide by the rules can often cause crashes.

With a little cleverness, we can use the toxic spiders’ RES non-compliance against them. To flog the analogy a little bit more, note that some municipalities have installed cameras to take photographs of malefactors who break the traffic laws.

We’re going to do something little bit like that to deal with these naughty bots. Consider these possibilities, listed in order by increasing nastiness:

The spider checks for robots.txt, and doesn’t crawl prohibited areas. (Good bot! Here, have a cookie.)
The spider checks for robots.txt, but doesn’t comply with the restrictions.
The spider doesn’t even bother to check for robots.txt at all.
The spider reads robots.txt, scans for ‘allow’ stanzas¹ that apply to other spiders, and then masquerades as those in order to access the protected areas.
The spider reads robots.txt and explicitly tries to scan prohibited areas.

The first case covers the Spiders in the White Hats, so we will not worry about it. Handling the others requires applying some intelligence to the process, which means recording what a particular bot is doing and making decisions based on its activities.

¹ The original RES didn’t support ‘allow’ stanzas, and not all RES-compliant bots recognize them. However, the basic issue is the same even for ‘disallow’ stanzas — a bot with evil intentions can conceivably change its access by pretending to be one of those for which you have explicit rules.

Page 2: Dynamic robots.txt

This article was originally published on Enterprise IT Planet.

Secure Apache: Out, Damned Bot!

Spiders and Flies

The Bouncer

Get the Free Newsletter!

Latest Posts

What Is a Container? Understanding Containerization

What Is E-Waste? What You Need to Know

What Is a Print Server? | How It Works and What It Does

6 Best Linux Virtualization Software for 2024

5 Top Benefits of Virtualization

Related Stories

Advertisers

Menu

Our Brands