Wednesday, August 01, 2007

Checking for duplicates

If there is one thing that I love about Perl, it is that there is always something new to learn. In my case, I like it to be a few things every day, but that is just me.

In my last post, I mentioned about one liners and that I was working with some code that was rather puzzling to figure out. Well, I figured it out and with the help of Learning Perl, 3rd Edition. I have said it many times before and I will say it again. As much as the Camel book is famed as the "Bible of Perl", I tend to keep the Learning Perl book much closer to my keyboard.

There one liner that I was working on figuring out was as follows:

perl -e '$count=0; while (<>) {if (! ($var{$_}++)) {print $_; $count++;}} warn "\n\nRead $. lines.\nTook union and removed duplicates, yielding $count lines.\n"' ./file1 ./file2.txt > ./combined.txt

This code is supposed to take in the two files (file1 and file2) and combine them into one file (combined.txt), all the while, removing any duplicate entries. What puzzled me was HOW IS IT DOING IT? Yes, if you are wondering, it does work. Any Perl guru's out there are already nodding their heads as they probably already know how.

The magic of this code is in the "$var{$_}++". What happens is this, the code takes in the first file and reads it line by line. It then takes each line in turn and creates a key in the hash with it, but it is UNDEF as there is no value assigned. This ends up being a true test. The next line is read in and again it creates a hash key with the line as the key, only this time, if the key already exists, then the test is false as it is already existing and undef, so, the line will not be added to the output file. Its a little confusing, I know, but it works and it is how it was designed to work. Personally, its a great, short system for removing duplicates.

If you still have questions, I recommend you look at the example on page 153 of Learning Perl, 3rd Edition. Yes, I know they are up to 4th Edition, but I have my 3rd edition copy with me at the moment.

Happy Coding!!

No comments:

 
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.