Web Automation: Generating Dynamic Tables of Contents

Web Automation: Generating Dynamic Tables of Contents


June 21, 2000

In my last article, I described how to make a Dynamic Directory Index using a Perl CGI. That CGI did four things:

  1. Obtained a list of directories (only one directory deep)
  2. For every directory, opened the index.html file if it existed
  3. For every index.html file, extracted the title of the page
  4. For every pilfered title, printed it back to the user as a link to the given page

This is nice if you have a hierarchical Web directory structure and want to strategically place these CGIs in high-traffic or frequently added/updated areas. But what if you want a table of contents that listed every index.html page? By slightly modifying our script from last time, we can turn our one-deep directory index, into a full-blown table of contents generator--and even a search engine, with a little ingenuity.

Configuring Apache

Apache is pretty much ready to go if you want to implement these features. You might want to change the AddHandler directive for cgi-script, as I have demonstrated below, and turn on ExecCGI for whatever directory you have this script housed:

AddHandler cgi-script .pl .cgi

Thinking About the Problem

As I mentioned before, the four things our last CGI did will continue to be the core of this script- We just need to add in some recursion, and spiff up the output formatting a little bit. So here's what this script needs to do (condensed a bit):

  1. Obtain a list of directories
  2. For every directory, open the index.html file and extract the title of the page
  3. Print back the title, with a link to the page
  4. Dive into the directory, and back to Step 1

The Functions

There are three parts of this program that fit very nicely into their own functions.

  1. Given a directory, return a list of subdirectories (one deep)
  2. Given the path to an HTML page, extract and return its title
  3. Given any path, judge its "depth"

Function 1: Given a directory, return a list of subdirectories (one deep)

The method used in the last article to do this was fairly clumsy and not scalable. Instead of trying to brute-force the same code into this example, I've improved this method. The function below, named Get_Dirs, takes the name of a directory, and returns an array of all of the subdirectories (one deep):

1: sub Get_Dirs {
2: my $basedir=shift;
3: opendir(GD,"$basedir") or return;
4: my @DIRS;
5: for(readdir(GD)) {
6: my $temp="$_";
7: if($temp =~ /^\./) { next; }
8: if(-d "$basedir$temp") {
9: push(@DIRS,"$basedir$temp/");
10: }
11: }
12: closedir GD;
13: return @DIRS;
14: } 
Line 1 starts the Get_Dirs function. Line 2 stuffs the argument (path of the "base" directory) into the scalar. Line 3 opens the base directory and assigns the GD handle to it. Line 4 creates an array called @DIRS for later use. Line 5 starts looping through the contents of the base directory. Line 6 assigns the item to the scalar . Line 7 essentially says, "if this item begins with a dot, then skip it." Line 8 checks to make sure that the item is question is a directory. Line 9 adds the item to the @DIRS array if it is a directory (as determined in the previous line). Line 10 ends the previous If statement. Line 11 ends the For loop. Line 12 closes the directory handle. Line 13 returns the contents of the @DIRS array.

Function 2: Given the path to an HTML page, extract and return its title

Here we have the same Get_Title function from the last script. This function takes an HTML filename as an argument and returns the title if one is found:

1: sub Get_Title {
2: my $filename=shift;
3: unless(-f "$filename") { return("NO INDEX"); }
4: open(HTML,"<$filename");
5: while(<HTML>){
6: if($_=~ /<title>(.*)<\/title>/i) {
7: close HTML;
8: return "$1";
9: }
10: }
11: close HTML;
12: return "Untitled";
13: }

Don't let this snippet scare you; it's actually quite logical once dissected. Line 1 declares the function Get_Title. Line 2 takes the parameter we passed to the function (that's the name of the HTML file), and shifts it into the scalar variable . Line 3 says, "unless this is a file, return the text 'NO INDEX'." Line 4 opens the file for reading and assigns the handle HTML to it. Line 5 begins a while iteration over every line of the open file (every line will cause a new iteration of the loop, the contents of the line will be stored in the special variable sh). Line 6 says, "if this line contains a <title> and a </title> place the stuff in between in the special variable and continue inside the brackets." Line 7 is inside the if statement and closes the HTML file. Line 8 returns the text of the title and exits the function. Line 9 ends the if statement. Line 10 ends the while statement. Line 11 will close the HTML file if no title has been found. Line 12 will return the word Untitled in the advent that no title has been found. Line 13 ends the function. This function is a bit complex in code, but I like how it demonstrates a lot of Perl's power and flexibility. The if statement in line 6 contains a regular expression that it's case-insensitive (note the i after the last /), so that different capitalizations all appear the same to the if).

Function 3: Given any path, judge its "depth"

For the purposes of this example, I'm defining "depth" as the number of forward-slashes. This function takes a path, and returns the number of fore-slashes:

1: sub Get_Depth {
2: $_ = shift;
3: return tr/\///;
4: }

Line 1 begins the function Get_Depth. Line 2 stores the passed argument (the path in question) in the special variable sh. Line 3 uses a transliteration regular expression to count the number of fore-slashes, and return it. Line 4 ends the function.

There are many ways to do anything- And this function is no exception. I initially wrote an more expanded, twelve-line version that essentially did the same thing, but decided that I should use a bit of Perl magic, and make it a little shorter.

The Glue: the "Main" function

Extracting those three functions from the code cleans it up quite a bit (although there is still room for another function or two). So now we have to build a system that utilizes those functions to perform our four tasks:

  1. Obtain a list of directories
  2. For every directory, open the index.html file and extract the title of the page
  3. Print back the title, with a link to the page
  4. Dive into the directory, and back to Step 1

If you were wondering why I was concerned with "depth" back in the Functions section, here's where it will make sense. I use the depth of the folder, to stratify where in an unordered list the item belongs (how "deep" in the list). This give our table of contents a more readable, and understandable feel, than just throwing out a flat list of links:

1: my $dir="/usr/local/apache/htdocs/";
2: my $url="http://mattwork.potsdam.edu/";
3: my $depth=Get_Depth("$dir")+1;
4: my @dirs=Get_Dirs("$dir");
5: print "Content-Type: text/html\n\n";
6: print "<html><head><title>Table of Contents</title></head><body>\n";
7: print "<ul>\n";
8: while(@dirs) {
9: my $currdir=shift(@dirs);
10: my $currdepth=Get_Depth("$currdir");
11
: unshift(@dirs,Get_Dirs("$currdir"));
12: if($currdepth > $depth) { print "<ul>\n"; }
13: if($currdepth < $depth) {
14: my $diff=$depth - $currdepth;
15: print "</ul>\n" x $diff;
16: }
17: $depth=$currdepth;
18: my $path="$currdir" . "index.html";
19: unless(-e "$path") { next; }
20: my $title=Get_Title("$path");
21: $path =~ s/$dir/$url/i;
22: print "<li><a href=\"$path\">$title</a></li>\n";
23: }
24: print "</ul>\n";
Line 1 stuffs the local path of the root web directory into the variable . Line 2 stuffs the URL we will want to replace the local path with, in the variable . Line 3 calls the Get_Depth function on our initial director and adds 1 to it (adding 1 is important, because logically we'll never be in a directory that is at the same depth of the htdocs directory (eg. the logs and conf directories are at the same depth--we won't be going there)). Line 4 calls the Get_Dirs function to obtain a list of subdirectories in the root of our webspace. Lines 5 and 6 send the default HTTP content header to the browser. Line 7 starts the unordered list. Line 8 starts iterating over the @dirs array using a while loop. Line 9 removes the first item from the @dirs e array, and stuffs it into the scalar (My rationale for doing this is scalability. By using a while loop instead of a for loop, and removing items with shift, it keeps the array from growing unnecessarily large and needlessly consuming resources. If I didn't remove the entries as I used them, at the end of this script the @dirs array would contain all of the directories in webspace--this is not necessary). Line 10 calls the Get_Depth function to obtain the depth of the current directory, and stuffs the value into the scalar. Line 11 first calls the Get_Dirs function to obtain a list of subdirectories, and then prepends that list to the beginning of the @dirs array (making them the "next" to be iterated onto). Line 12 says, "if the current depth is greater than the previous depth, indent." Line 13 catches the occurrences when the current depth is less than the previous depth. Line 14 calculates the difference of the current and previous depths and store the result in . Line 15 print the </ul> tag however many time is necessary according to the difference between the current and previous depths (note the x operator). Line 16 ends this If. Line 17 assigns the to . Line 18 concatenates the current directory path with index.html, and stores the value in . Line 19 says, "unless exists, skip to the next iteration" (which is the same as saying "if does not exist, skip to the next iteration"). This is to prevent directories that don't have an index.html file from making things icky. Line 20 calls the Get_Title function on , and assigns that value to the scalar. Line 21 contains a substitution regular expression that replaces the base directory part of the path we placed in line 1, with the base URL we provided in line 2. Line 22 prints the title of the page as a link to the page, as a list item. Line 23 ends this while loop. Line 24 prints the final </ul>.

Summary Discussion

As always, this script is not the end-all of table-of-contents generators- But it is a good place to start. It is short and fairly memory efficient. It scales fairly well to sites with 5000 directories, and perhaps even beyond.

You don't even have to run this as a CGI. You can run it from the command line and output the results to an HTML file that you have in your webspace. This allows you to periodically generate this table of contents file, without having millions of users hammering your CGI. For example you could have cron, or any other scheduling service periodically execute the command perl toc.pl > /usr/local/apache/htdocs/toc.html--which will rebuild your table of contents. Pointing users to the toc.html page might not give them a real-time view of your webspace, but might be more feasible.

I'd also like to shamelessly plug Lincoln Stein's CGI.pm module, which is available on CPAN, and probably is already in your Perl distribution. I use this module to take a lot of the HTMLizing out of my hands. I wrote this example using no modules, so you could see what was going on, but I highly recommend using CGI.pm to do a lot of the HTML stuff for you.

Full Text of this Example

1: #!/usr/bin/perl -w
2: sub Get_Title {
3: my $filename=shift;
4: unless(-f "$filename") { return("NO INDEX"); }
5: open(HTML,"<$filename");
6: while(<HTML>){
7: if($_=~ /<title>(.*)<\/title>/i) {
8: close HTML;
9: return "$1";
10: }
11: }
12: close HTML;
13: return "Untitled";
14: }
15: sub Get_Dirs {
16: my $basedir=shift;
17: opendir(GD,"$basedir") or return;
18: my @DIRS;
19: for(readdir(GD)) {
20: my $temp="$_";
21: if($temp =~ /^\./) { next; }
22: if(-d "$basedir$temp") {
23: push(@DIRS,"$basedir$temp/");
24: }
25: }
26: closedir GD;
27: return @DIRS;
28: }
29: sub Get_Depth {
30: $_ = shift;
31: return tr/\///;
32: }
33: my $dir="/home/html/mattwork/htdocs/";
34: my $url="http://mattwork.potsdam.edu/";
35: my $depth=Get_Depth("$dir")+1;
36: my @dirs=Get_Dirs("$dir");
37: print "Content-Type: text/html\n\n";
38: print "<html><head><title>Table of Contents</title></head><body>\n";
39: print "<ul>\n";
40: while(@dirs) {
41: my $currdir=shift(@dirs);
42: my $currdepth=Get_Depth("$currdir");
43: unshift(@dirs,Get_Dirs("$currdir"));
44: if($currdepth > $depth) { print "<ul>\n"; }
45: if($currdepth < $depth) {
46: $diff=$depth - $currdepth;
47: print "</ul>\n" x $diff;
48: }
49: $depth=$currdepth;
50: my $path="$currdir" . "index.html";
51: unless(-e "$path") { next; }
52: my $title=Get_Title("$path");
53: $path =~ s/$dir/$url/i;
54: print "<li><a href=\"$path\">$title</a></li>\n";
55: }
56: print "</ul>\n";