Tuesday, July 03, 2007

[TAOREs] File or Directory from a listing

Welcome!! This is the first article in my new column on regex's, where I will cover regex's that I have written or found(obviously giving credit where credit is due) and also lessons on writing regex's. I know that I am starting this column without even a lesson on regex's, so if you are new to regex's and really don't know much or anything about them, then this will seem quite foreign and I appologize for that. Please know that the lessons will follow shortly. Without further ado, lets jump in feet first.

I was working on a script a couple of weeks ago where the script had to FTP into a server, grab a directory listing from a specific directory, and then send an email to a distribution list if a file was found in the directory.

Seems pretty straightforward, but I quickly realized that this was an opportunity for me to get a little practice writing regex's. The remote server was windows, so when the FTP connected and ran the "dir" command, the output had the usual FTP banter, but also a line similar to the following when a file was found:

drwxr-xr-x 7 Administrators group 0 Jun 04 22:07 myfile.txt

This is the typical output format for FTP on a Windows machine. Yes, it is the same format as the long listing provided by an "ls -l" on a unix machine. So, knowing that, I quickly set out to write a regular expression to match just the "myfile.txt" file name of that line.

Here is the regular expression that I came up with to match the file name in that line:


m/^.+\s+\d\s+\w+\s+\w+\s+\d+\s+\w+\s+\d+\s+\d{1,2}:\d+\s+(\w+\.\w+)$/


This is pretty straight forward (if you know something about regex's). Let me break it down with the regex formatted slightly differently:


m/ # start the match
^ # start from the beginning of the line
. # match any single character
+ # match preceding element one or more times
\s # match a space
+ # match preceding element one or more times
\d # match a digit
\s # match a space
+ # match preceding element one or more times
\w # match a word character
+ # match preceding element one or more times
\s # match a space
+ # match preceding element one or more times
\w # match a word character
+ # match preceding element one or more times
\s # match a space
+ # match preceding element one or more times
\d # match a digit
+ # match preceding element one or more times
\s # match a space
+ # match preceding element one or more times
\w # match a word character
+ # match preceding element one or more times
\s # match a space
+ # match preceding element one or more times
\d # match a digit
+ # match preceding element one or more times
\s # match a space
+ # match preceding element one or more times
\d{1,2} # match a digit, at least once, but up to twice
: # match a colon
\d # match a digit
+ # match preceding element one or more times
\s # match a space
+ # match preceding element one or more times
( # Start subexpression group for capturing
\w # match a word character
+ # match preceding element one or more times
\. # match a period, yes, this needs to be escaped as the period is a regex character too
\w # match a word character
+ # match preceding element one or more times
) # End the subexpression group for capturing
$ # match the end of line
/ # end the regex


I know, it seems a bit messier, but this is one way of writing a regular expression so that you can annotate what is happening. Personally, I like the first version better.

Now, with this regex, if your file name (or directory name) was in a different format that something like "myfile.txt", then you would have to edit the regex to reflect that difference, or risk the code not working.

In the above though, the part of the regex that is enclosed in ( ) will be placed in the special variable $1 so that its value can be referenced elsewhere in the script.

No comments:

 
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.