Web Automation: Dynamic Directory Indexing Page 2

Line 1 places the filesystem name of the root folder we want to index into a scalar variable called . Line 2 placed the URL we will want to substitute (eventually) for the filesystem name in a scalar variable called . Line 3 opens the directory, and assigns a directory handle named PRJD to it. Line 4 places all the objects in that directory into the array called @dirs. Line 5 simply closes the open directory. From here, we have all of the directories (as well as files and symbolic links) stored in the @dirs array. It may not make sense why we're using and right now, but that will all be much clearer on the other side.

Step 2: For every directory, open the index.html file if it exists
So now we need to iterate over our @dirs array. The simplest way to do this is with the for statement. Every time the loop goes around a special scalar named sh will contain the name of the object (be it directory, file, or link), and the loop will terminate when all objects have been processed. For sanity's sake, the code snippet below will ignore any object that isn't a directory or any that starts with a "." (period).

1: for(@dirs) {
2: if(sh=~ /^\./) { next; }
3: unless(-d sh) { next; }
4: my ="sh/index.html";
5: my =Get_Title("");
6: # print the entry here
7: }

Line 1 starts the loop, iterating over our array-o-directory-objects. Line 2 says "if this object begins with a dot, then skip this object and cycle the loop". Line 3 says "if this object is not a directory, then skip this object and cycle the loop". Line 4 placed the sh/index.html magic into the scalar variable. Line 5 calls a mystery function (that we will be writing very soon) to extract the title from the webpage. Line 6 is a comment, holding the place for some code we will shoe-horn in here during Step 4. Line 7 is the end of this loop.

If you look at line 4 above, you'll notice a sh/ magic. The is the variable we set up in Step 1 that contains the filesystem name of the root directory we want to index. The sh variable, as I mentioned before, contains the name of the object we're currently processing. The trailing / is simply to append a slash to the end of the directory name. So, if we were currently processing the Apache object, would contain /usr/local/apache/htdocs/projects/, sh would contain Apache so sh/ would be the concatenation of all three, which is /usr/local/apache/htdocs/projects/Apache/! What wonderful magic.

Step 3: For every index.html file, extract the title of the page
As I mentioned when describing Line 5 of Step 2, we have to write a Get_Title function that takes in the name of the file, processes it, and returns the title of the page. Fortunately, titles are pretty easy to extract:

1: sub Get_Title {
2: my =shift;
3: unless(-f "") { return("NO INDEX"); }
4: open(HTML,"<");
5: while(<HTML>){
6: if(sh=~ /<title>(.*)<\/title>/i) {
7: close HTML;
8: return "";
9: }
10: }
11: close HTML;
12: return "Untitled";
13: }

This article was originally published on Jun 2, 2000
Page 2 of 4

Thanks for your registration, follow us on our social networks to keep up-to-date