Secure Apache: Out, Damned Bot! Page 2
So, let's make robots.txt a dynamic document a PHP script. That allows us to scan a database that can be updated in real time by other processes, making our robots.txt rules really dynamic.
Thoreau had a good idea when he advised us to "simplify, simplify!" Let's assume you have different restrictions for different bots such as for Google versus Yahoo!, for example. If your robots.txt file is static, it will need to have stanzas for each of the specific bot rules which means that all bots can see what their competitors' access rules are. (Can you see that I'm going after case #4?)
If it's a dynamic document, however, we can feed each bot only those rules that apply to it and it alone. The robots.txt that the bot will see is much simpler than our overall set of rules and that's where "simplify, simplify" comes in even though it means a little more work being done by our server. One of the truisms of security work is that increased security always costs something. I say that a lot, so get used to it.
For a first step, let's make your existing robots.txt file in PHP script that just returns its current contents. Nothing new and fancy, just making it dynamic in the most basic way. Where I need to mention server configuration directives, I'll use those for the Apache Web server. Make adjustments as appropriate for whatever server you're using.
1. First, make the server aware that robots.txt is a script and not an actual text file. Add the following to your httpd.conf file and then restart Apache.
2. Edit your robots.txt file and add the following to the top:
Header('Content-type: text/plain'); ?>
3. Try to access it from your Web browser. If all is working properly, you'll see only the normal rules, not the PHP code segment you just added.
(If you're not familiar or comfortable with PHP, feel free to use some other scripting language of your choice. All my code examples are going to be in PHP, though.)
You should now have a basic dynamic robots.txt document. Go right ahead and play with it to see what you can do. You may actually want to work with a different file (called new-robots.txt or something like that), so that if you make any mistakes you won't screw up the rules spiders are currently using to crawl your site.
In my next article I'll go into much more detail about fleshing this basic script out to do some actual work. This one is primarily intended to raise your consciousness, get you started, and possibly spur you to do a little research on your own.
This article was originally published on Enterprise IT Planet.