Monday, June 11, 2007

The Art of the Automated Download

How many of us have found a site that ended up having a whole bunch of files on the site, for download, that you really wanted? Well, I found a site just like that. Too many sites use PHP or something else, to where it used ID's and other codes to refer to the documents, instead of just putting links into their web pages. Once in a while though, you do come across a site that DOES put the actual links to the files on the page.

In the case of the site that I am referring to, there are a little over 2000 files that I wanted to grab. The thing is, I didn't want to have to site there and "right-click-save as" for each and every one as that would have taken "days" to complete. So, noticing that all of the files had actual URLs that led right to them, I looked at the page source. There it hit me. Perl!

So, I copied the source and quickly drummed up a regular expression which grabs all the URLs of all of the pdf files on the page. After I grabbed the URLs, I put together some code which quickly went out and grabbed each and every one in turn and saved it to my hard drive.

This sounds all simple and stuff, but for someone like me who still considers him a novice in the Perl world, it did take a couple of hours of research. First I tried to use the "WWW::Mechanize" module and was able to retrieve a complete list of the pdf files and their paths, but not the actual files. I tried other packages and such, delving into LWP itself, but I could not for the life of me get this code to actually download the files.

I found "lwp-download" and gave it a shot. Wow! It looked like it was working, up until the 904th document, where it died. I couldn't figure it out nor understand what was going on. Why was this dying at the same point every time. Well, I did eventually figure it out and was able download ALL of the over 2000 files to my hard drive. I couldn't believe it, but the routine I used only took a few minutes to download all of the files (granted, I have 15 Mb FiOS for my internet, so please bear that in mind as well).

Just as an FYI (and you can get this looking at the code, I used the LWP::Simple::getstore() routine to download the files. It was a lot easier than going through the process of figuring out why my WWW::Mechanize wasn't working, believe me. I will figure that module out later, but for now, this did exactly what I wanted.

This is probably a bit much and others would more than likely have a better way to do it, but here it is. Here is the code I used to parse HTML code for its links and download them.


#!/usr/bin/perl

use strict;
use warnings;
use File::Basename;
use LWP::Simple;


#############################################
# Read the entire HTML file into an array, line by line so that we can
# parse out the information we need, one line at a time.
#############################################
my @code = `cat /home/jlk/development/perl/stampAlbums/code.txt`;

if (-e "/home/jlk/development/perl/stampAlbums/files.txt")
{
`rm /home/jlk/development/perl/stampAlbums/files.txt`;
`touch /home/jlk/development/perl/stampAlbums/files.txt`;
}


foreach my $line (@code)
{


############################################
# The following code takes the site's html file(in this case, it
# is the stampalbums.com download site) and parses out all of the
# downloads URL's.
############################################
if($line =~ m/^\s+ {
my @splitLine = split(/\"/, $line);

##############################################
# Now, open the file in which you will store all the URL's to
# the files and write each URL on a seperate line.
##############################################

open(FILES, ">>/home/jlk/development/perl/stampAlbums/files.txt");
print FILES ("$splitLine[1] \n");
close(FILES);
}

}

###########################################
# Open the file containing all of the URLs
###########################################
open(FILES, "< /home/jlk/development/perl/stampAlbums/files.txt");

###########################################
# Do the download of the files
###########################################
foreach()
{
my $localdir = "/home/jlk/development/perl/stampAlbums/albumPages/";
my $path = "$_";
my($filename, $directories, $suffix) = fileparse($path);

LWP::Simple::getstore($_, $filename);
}

Wednesday, June 06, 2007

Further IP Validation information

After yesterday's posting on one way (of many) to validate an IPv4 address, I have some more information for those interested.

I had posted a question at one point to thescripts.com Perl forum, regarding an issue I was having. One of the more knowledgeable users over there (he goes by Miller), enlightened me as to a CPAN module that does the same thing without re-inventing the wheel. Please don't think I regret writing my script. On the contrary, I am very happy I did as I learned a couple of lessons in doing so.

Anywho, the module mentioned above actually validates IP addresses for you to check validity just as my script from yesterday does, it just does so slightly differently.

I thank Miller for his gracious input and the link to the CPAN module!!

Tuesday, June 05, 2007

IPv4 Address validation

Well, in my quest to learn Perl and practice putting together regex's, I took on the task yesterday of putting together a script which validates if an IP address is valid or not. Now, when you say "is valid", what exactly does that mean?

Well, there are two criteria that IP addresses need to really meet:

1) They need to have 4 octets, each octet containing from 1 to 3 digits.
2) Each octet's digits must make up a number between 0 and 255.

With that in mind, I went directly after the first of the two, validating the format of the IP address entered. The regex that I initially came up with is listed in #1 and the other two regex's came from the "Mastering Regular Expressions" book:

#1 m/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/
#2 m/\d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?/
#3 m/\d(\d\d?)?\.\d(\d\d?)?\.\d(\d\d?)?\.\d(\d\d?)?/

What is so great about Perl is TIMTOWTDI, which affords us being able to have 3 different regex's that come to exactly the same solution.

You may be asking yourself, what do I do with the regex(s) above? Well, you can take them and use them in your code to validate an entered IP address, or put it into a loop to test an entire file full of IP's to test their validity.

Here is an example of how to incorporate the above regex(s) into some code. Please know that this is only how I do it and my differ considerably from how you implement it:


  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4. print("What is the IP Address you would like to validate: ");
  5. my $ipaddr = ;
  6. chomp($ipaddr);
  7. if( $ipaddr =~ m/^(\d\d?\d?)\.(\d\d?\d?)\.(\d\d?\d?)\.(\d\d?\d?)/ )
  8. {
  9. print("IP Address $ipaddr --> VALID FORMAT! \n");
  10. if($1 <= 255 && $2 <= 255 && $3 <= 255 && $4 <= 255)
  11. {
  12. print("IP address: $1.$2.$3.$4 --> All octets within range\n");
  13. }
  14. else
  15. {
  16. print("One of the octets is out of range. All octets must contain a number between 0 and 255 \n");
  17. }
  18. }
  19. else
  20. {
  21. print("IP Address $ipaddr --> NOT IN VALID FORMAT! \n");
  22. }

Well, there you go, IP address validation. No, this script obviously does not take into account IPs that are reseverd such as the 192.168 or 10.246 addresses, but you can test for whatever you need to given the grouping in the regex that provides the $1, $2, $3 and $4 variables.

Happy Perl Coding!!!

Monday, June 04, 2007

Update on curing the hunger

You know, it is never a good idea to make rash judgments. They always tend to lead to something you didn't want or expect. That is one reason I said in my previous post that I would look into the Podcast idea.

Over the weekend, I listened to an older Perlcast that Josh McAdams made, where he interviewed Randal Schwartz. Whether or not that was made before or after Stonehenge started supporting Perlcast is for them to know. Either way, in the interview, Josh asked Randal his opinion on Perlcast and what he said was that he actually really like Perlcast and the idea of reporting the Perl Community news for all to listen to. He also mentioned that it wasn't trying to do Perl instruction over an audio feed, which Randal did not find any use in.

Hearing that I thought longer and harder about what I was leaning towards and realized that he is completely and totally correct. You have to see code to learn it, that is why there are so very many resources and documents on the web and also so very many books on the topic of Perl.

I am one of those people that when I come across something I want to know more about or if I somehow think of an idea for a pet project, I write it down somewhere. These scribblings are actually gathered into lists. Yes, I do eventually get to them, but sometimes it takes a while. Usually if I need something and its on a list or if I get bored I tend to reference my list(s) for something to do. I took a look this weekend at my list of pet projects and did see a couple on there that would afford me the opportunity to delve into some modules that I have been wanting to play with. So, that is where I will turn to to cure my hunger pangs, instead of trying to create a podcast that wouldn't be all that useful. (sure, a video podcast would do the trick, but I have neither the resources nor the hosting to hold the files, so that is definitely out of the question at the moment)

Any who, back to work at learning Perl. Happy Monday everyone!!!

Friday, June 01, 2007

The hunger and thirst for everything Perl

I have been playing with perl for about 2 years. I wrote a Perl script at my last job that just set up some directories and created some log files, nothing extravagant about it. Unfortunately, my last job really left little time for coding or learning to code Perl for that fact, given the insane and overwhelming workload that we were faced with. Even attempting to pick up some Perl while on lunch proved to be a daunting and sometimes impossible task. The only time I could really focus and learn some Perl was while at home after the kids were in bed asleep. Then and only then was I able to concentrate, was I awake enough to do so after being mentally fried from all of the days events.

Thankfully, the new job that I started a couple of weeks ago has afforded me nothing but the opportunity to code in Perl. Most of what I do is Perl coding and I am happier than a clam. I have learned so much in the last two weeks that I am beside myself at my adjustment to the learning curve. Before starting the job I had obtained Brian D Foy's "Student Workbook" that accompanies the Llama book (Learning Perl). The Llama book has questions and exercises at the end of every chapter, but the Student Workbook contains more exercises to assist with getting you in a better Perl state of mind.

Well, having completely turned my 'learning' attentiveness towards the realm of Perl, I have been thirsting for everything that I can get my hands on to satisfy my burgeoning hunger for Perl knowledge. One of the resources I turned to in hopes of "data....input" was podcasts. Upon searching though, I only found one true podcast related to Perl and that is "Perlcast". Don't get me wrong, Perlcast is absolutely AWESOME, even having a very prominent members of the Perl community, Randal Schwartz, as the roving reporter reporting all the Perl news that's fit to report. Other than reporting the latest Perl news, Perlcast also does interviews with members of the Perl community and segments from conferences.

For me though, this just isn't enough. I want MORE! I have searched a fair amount of the web and unfortunately cannot find any other Perl related podcasts. What was I hoping for? beleive it or not, I was hoping for a podcast on Perl that covered a different topic in each episode. Maybe go over topics such as:

  • Scalars
  • Arrays
  • Hashes
  • Modules (creating)
  • many different episodes that each time choose a different module and go over it in detail
  • Regular Expressions
  • and so on, and so forth
There is such a multitude of possible topics, that the number of episodes could and would be immense.

Sure, some are out there probably saying, "So why aren't you producing these podcasts"? I have thought of that and believe it or not, I am still thinking about it. My biggest issue would be where to host it as I would need enough space to hold all of the podcast files. Yes, this area would be new to me and I don't have much expertise in creating these, so please bear with me. If anyone has any information, such as software recommendations for creating podcasts(OSS preferably) and also a good place to host it or share it as well would be great.

No, this is not an "I will do it", this is more of a "I will look into the possiblity". This would benefit me as well as I am still up and coming. I guess there is no better way to cure your hunger than to make your own meal(s).
 
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.