Thursday, May 22, 2014

Update: Python Virtualenv Setup Project

I have been doing a bit of work the last day or so on my script for setting up a python development environment called "Python Virtualenv Setup".  When I first mentioned the project on here several posts ago, I mentioned that I was planning on support for auto-creating a bitbucket repository during the script's run.

Well, I not only added support for creating a bitbucket repository, but I added support for github as well.  So, the two popular and competing services are now supported.  I even went so far that if you create one of the remote repos, the script also does a 'git init' in the project directory for you and also provides you the command for setting your origin for pushing your code to your new, remote repository.

I feel like Tim Robbins in Antitrust, "God, I love this stuff!".

Thursday, May 15, 2014

How To Recursively Download A Website / Web Directory

On the web you can find plenty of open directories to browse, or directories where a company keeps its files for download.  Using Google, you can use its built in advanced search functionality to tell it what you want to find.  For instance, to find open directories that have jpg and gif images, you could use:

     -inurl:(htm|html|php) intitle:”index of” +”last modified” +”parent directory” +description +size +(jpg|gif)

If you are interested in what you can do with your Google search, check out Google's Search Operators page.  It is certainly enlightening for those wanting to go beyond the basic search.

The problem with the site in the browser is that you need to click on each file to download it.  You could even use a coding language to write a script to go to the directory and download all files of a specific type.  Or, if you have the wget application on your computer, you can download that directory listing without having to click on every file.  wget has a number of different options to do its job, and you should spend a little time reading its associated manpage in order to familiarize yourself with them.
For our purposes, we are going to use wget with the following options:

     wget -r --no-parent

The '-r' tells wget to retrieve files recursively.  There is a '-l' or '--level' option to specify how many levels to max out on.  The default is 5 levels.  We are sticking with the max of 5.

The --no-parent option tells wget that it should not follow the parent directory link ( '..' ) in the directory.  That way it limits the recursion to just the directory structure you are in, making where you started it (better, where you specified to start in the url), the top level.The is the url to the directory where you want to download from.  

Now, wget, by default, obeys robots.txt.  That said, website owners can control where you can and cannot go, among other things.  Of course you can also turn off the obeying of robots.txt by using the '-e robots=off' option.   Here is an example of the above command, disobeying robots.txt:

     wget -r --no-parent -e robots=off 

Please know that using this option is rude and could possibly get you (and/or your IP) banned from accessing a site, should the site owner find out.  Especially if you overload their site with your downloading.  Regardless of your intentions, remember to keep in mind that those who can't see you or don't know you, have no idea if your a good guy or a bad guy.

wget is quite flexible and useful.  Read the manpage and it will become one of the many awesome tools in your arsenal.


Update:  One more thing.....

If you are attempting to download the content from an open directory structure and your download fails, you are probably going to want to continue the download from where you left off, without re-downloading.

You will want two other options for this, the '-nc' option and also '-t 0'.  The '-nc' means no-clobber.  Seeing this, wget will not try to download a newer copy of the file if a copy already exists locally.  No, this is not the option for updating, just for not re-downloading if you don't care about updates.

The '-t 0' option basically tells wget to wait, infinitely, for a download to become available, and keep re-requesting.  You may find that a site might shut you out or maybe get bogged down.  This will keep the requests going for the files.  You may find that if it hangs, you'll have to kill the process and restart it.

Monday, May 12, 2014

A Couple Of virtualenv Notes

A little while ago I posted about starting a python project, and posted a link to a bitbucket project I posted that can be used to create a virtual environment (and have the environment auto activate upon entering the project directory).  I wanted to post just a couple of things that I learned, one of which bit me a touch when working on a recent python script.

I was working on said script and kept getting erroneous errors.  The errors were not making sense, telling me that the module I was working with was not installed.  I called 'bull' as I know I had installed it in the environment using pip.  Well, I thought I had.  It bit me in that I did not remember that when you are working in a python virtual environment, you NEED to use the pip and python executables installed under the environment (in the environment's bin directory).  If you don't, and just call pip or python, it will reference the system version by default.

This was brought even more to my attention when I ran 'pip freeze' to see what modules I had installed, and the list was looooooong.  I quickly realized that I was not using what I thought I was.  So, always make sure you reference the correct application.   Of course, one gotcha to remember is that once you develop your application, if you move it out of the virtual environment, you need to fix any path's that point to the python interpreter.

Consider that a lesson learned for me.  Hopefully it will allow you to forego learning it too painfully (or just annoyingly).
 
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.