Thursday, May 15, 2014

How To Recursively Download A Website / Web Directory

On the web you can find plenty of open directories to browse, or directories where a company keeps its files for download.  Using Google, you can use its built in advanced search functionality to tell it what you want to find.  For instance, to find open directories that have jpg and gif images, you could use:

     -inurl:(htm|html|php) intitle:”index of” +”last modified” +”parent directory” +description +size +(jpg|gif)

If you are interested in what you can do with your Google search, check out Google's Search Operators page.  It is certainly enlightening for those wanting to go beyond the basic search.

The problem with the site in the browser is that you need to click on each file to download it.  You could even use a coding language to write a script to go to the directory and download all files of a specific type.  Or, if you have the wget application on your computer, you can download that directory listing without having to click on every file.  wget has a number of different options to do its job, and you should spend a little time reading its associated manpage in order to familiarize yourself with them.
For our purposes, we are going to use wget with the following options:

     wget -r --no-parent

The '-r' tells wget to retrieve files recursively.  There is a '-l' or '--level' option to specify how many levels to max out on.  The default is 5 levels.  We are sticking with the max of 5.

The --no-parent option tells wget that it should not follow the parent directory link ( '..' ) in the directory.  That way it limits the recursion to just the directory structure you are in, making where you started it (better, where you specified to start in the url), the top level.The is the url to the directory where you want to download from.  

Now, wget, by default, obeys robots.txt.  That said, website owners can control where you can and cannot go, among other things.  Of course you can also turn off the obeying of robots.txt by using the '-e robots=off' option.   Here is an example of the above command, disobeying robots.txt:

     wget -r --no-parent -e robots=off 

Please know that using this option is rude and could possibly get you (and/or your IP) banned from accessing a site, should the site owner find out.  Especially if you overload their site with your downloading.  Regardless of your intentions, remember to keep in mind that those who can't see you or don't know you, have no idea if your a good guy or a bad guy.

wget is quite flexible and useful.  Read the manpage and it will become one of the many awesome tools in your arsenal.


Update:  One more thing.....

If you are attempting to download the content from an open directory structure and your download fails, you are probably going to want to continue the download from where you left off, without re-downloading.

You will want two other options for this, the '-nc' option and also '-t 0'.  The '-nc' means no-clobber.  Seeing this, wget will not try to download a newer copy of the file if a copy already exists locally.  No, this is not the option for updating, just for not re-downloading if you don't care about updates.

The '-t 0' option basically tells wget to wait, infinitely, for a download to become available, and keep re-requesting.  You may find that a site might shut you out or maybe get bogged down.  This will keep the requests going for the files.  You may find that if it hangs, you'll have to kill the process and restart it.

1 comment:

Anonymous said...

Thanks for those details JLK; I will give it a go.

 
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.