Web Automation: Dynamic Directory Indexing Page 2
Line 1 places the filesystem name of the root folder we want to index into a
scalar variable called . Line 2 placed the URL we will want to
substitute (eventually) for the filesystem name in a scalar variable called
. Line 3 opens the directory, and assigns a directory handle
named PRJD to it. Line 4 places all the objects in that directory
into the array called @dirs. Line 5 simply closes the open
directory. From here, we have all of the directories (as well as files and
symbolic links) stored in the @dirs array. It may not make sense
why we're using and right now, but that
will all be much clearer on the other side.
Step 2: For every directory, open the index.html file if it
exists
So now we need to iterate over our @dirs array. The simplest way
to do this is with the for statement. Every time the loop goes
around a special scalar named sh will contain the name of the
object (be it directory, file, or link), and the loop will terminate when all
objects have been processed. For sanity's sake, the code snippet below will
ignore any object that isn't a directory or any that starts with a
"." (period).
1: for(@dirs) {
2: if(sh=~ /^\./) { next; }
3: unless(-d sh) { next; }
4: my ="sh/index.html";
5: my =Get_Title("");
6: # print the entry here
7: }
Line 1 starts the loop, iterating over our array-o-directory-objects. Line 2
says "if this object begins with a dot, then skip this object and cycle
the loop". Line 3 says "if this object is not a directory, then skip
this object and cycle the loop". Line 4 placed the
sh/index.html magic into the scalar
variable. Line 5 calls a mystery function (that we will be writing very soon)
to extract the title from the webpage. Line 6 is a comment, holding the place
for some code we will shoe-horn in here during Step 4. Line 7 is the end of
this loop.
If you look at line 4 above, you'll notice a sh/ magic. The
is the variable we set up in Step 1 that contains the
filesystem name of the root directory we want to index. The sh
variable, as I mentioned before, contains the name of the object we're
currently processing. The trailing / is simply to append a slash
to the end of the directory name. So, if we were currently processing the
Apache object, would contain
/usr/local/apache/htdocs/projects/, sh would contain
Apache so sh/ would be the concatenation of all
three, which is /usr/local/apache/htdocs/projects/Apache/! What
wonderful magic.
Step 3: For every index.html file, extract the title of the
page
As I mentioned when describing Line 5 of Step 2, we have to write a
Get_Title function that takes in the name of the file, processes
it, and returns the title of the page. Fortunately, titles are pretty easy to
extract:
1: sub Get_Title {
2: my =shift;
3: unless(-f "") { return("NO INDEX");
}
4: open(HTML,"<");
5: while(<HTML>){
6: if(sh=~ /<title>(.*)<\/title>/i) {
7: close HTML;
8: return "";
9: }
10: }
11: close HTML;
12: return "Untitled";
13: }

