The introduction of almost any technology is closely followed by attempts to figure out how to abuse it. As the technology matures, the methods of abuse become more and more sophisticated.
Learn how to defend your Web server against abusive spiders and ‘flies.’
The Web is no different; almost as soon as people started publishing content, others began trying to figure out how to steal it. I’ll call these people and their ilk ‘perps.’ As soon as pages became read/write instead of just read-only, perps began figuring out how to use them to publish their own content on other people’s servers. Wikis and blog comments are one example of this.
Describing ways of dealing with such abuses is the end goal of this series of articles, but I’m going to cover some more basic issues first.
Spiders and Flies
The tools and robots that crawl the Web looking for content (for whatever reason) are frequently called ‘spiders,’ or sometimes ‘bots.’ Some spiders are good, such as the Google bot, which loads the Google search engine with what it finds. Others have a much more questionable goodness quotient, such as those that search Web pages for e-mail addresses to add to spam lists, or look for trademark references so that the information can be sold to the trademark holders for possible lawsuits.
While the term spider is in common use, I’ve never heard anyone give a name to the other type of abuse — that of hijacking writable Web pages such as blog comments and wikis. I’m going to coin the term ‘flies’ for abusive tools of this type, since they cluster around and crawl all over pages, leaving flyspecks and crap on them.
Abuses can be handled either proactively, reactively — or I suppose there’s the third option of ‘not at all.’
Proactive measure include SSL, user memberships, credential-protected pages, and scrutiny of submitted content (called ‘moderation’) before acceptance. As usual, a common result is that innocent users suffer because of the bad behavior of the perps, having to jump through hoops, click through multiple pages, and CAPTCHA challenges. (You know, like the obfuscated images of warped words you must type in to prove you’re not a bot.)
Unsure About an Acronym or Term?
Handling abuses reactively usually means you detect when someone misbehaves and enact restrictions that will prevent it from happening again. Doing this correctly can be an art, since making the conditions too narrow will let similar-but-not-identical abuses get through, while making them too broad can lock out legitimate visitors.
When the toxic spider problem first surfaced, it took the form of simply gathering too much information (and thereby occasionally affecting server performance). It wasn’t long before a solution appeared — the Robot Exclusion Standard (RES). It described the format of a file called “robots.txt” that you could put on your site to indicate which areas are available for crawling and which were not.
The RES is intended to stand at the door of your site and control access by the paparazzi, er, spiders. The idea was that legitimate bots would check for the file and obey its restrictions. Of course, it doesn’t directly help block those that don’t check the file, but it was quickly adopted by the Spiders in the White Hats and has become a fixture of today’s Web.
However, a standard like this is a little like a traffic signal; it works only when people agree to abide by the rules. Spiders that don’t abide by the rules can often cause crashes.
With a little cleverness, we can use the toxic spiders’ RES non-compliance against them. To flog the analogy a little bit more, note that some municipalities have installed cameras to take photographs of malefactors who break the traffic laws.
We’re going to do something little bit like that to deal with these naughty bots. Consider these possibilities, listed in order by increasing nastiness:
- The spider checks for robots.txt, and doesn’t crawl prohibited areas. (Good bot! Here, have a cookie.)
- The spider checks for robots.txt, but doesn’t comply with the restrictions.
- The spider doesn’t even bother to check for robots.txt at all.
- The spider reads robots.txt, scans for ‘allow’ stanzas1 that apply to other spiders, and then masquerades as those in order to access the protected areas.
- The spider reads robots.txt and explicitly tries to scan prohibited areas.
The first case covers the Spiders in the White Hats, so we will not worry about it. Handling the others requires applying some intelligence to the process, which means recording what a particular bot is doing and making decisions based on its activities.
1 The original RES didn’t support ‘allow’ stanzas, and not all RES-compliant bots recognize them. However, the basic issue is the same even for ‘disallow’ stanzas — a bot with evil intentions can conceivably change its access by pretending to be one of those for which you have explicit rules.
Page 2: Dynamic robots.txt
This article was originally published on Enterprise IT Planet.