For a number of our projects we have to either do data cleaning or transform data from a number of different sources / formats into subtly different formats. While it is possible to write custom (Python, Perl, Java, AWK) scripts to do this sometimes it's just a little too tedious getting to the bits we're really interested in. Luckily there's a great tool that makes it easy to work with messy and / or disparate data sources and work with the data in a spreadsheet style interface that fits the bill in a number of ways: namely Google Refine which is calls itself a 'power tool for working with messy data'
While it looks deceivingly like a spreadsheet it is anything but that. Spreadsheets are great for working with data, doing repeat calculations and making pretty graphs. Refine is great at working with lots of data, doing lots of slightly different transformations, and understanding the breadth of your data, and is not made for making graphs but you can even do that if you really want to.
I recently used Google Refine to help populate some species (or taxon) pages in the BDRS. To do this, I used the taxon profile importer in the BDRS, which uses a list of LSIDs to query the ALA, and populate species pages from that content. This was a function created to help the ALA staff (and us!) help other groups get taxon pages up and running quickly, building on the significant investment that the ALA has made already.
To do that, I knew I could get a list of LSIDs from the ALA' name services, running at biodiversity.org.au. I searched for the order Insecta, and used Google Refine to parse the JSON string and export the data in a format that I could then copy and paste directly into the BDRS, and populate the taxon pages.
The screencast is about 4 minutes long, and I've added annotations; this is a pretty accurate reflection of the time it would take to do this process, and go from a BDRS with no species pages to one with a whole bunch of them in around 5 minutes. While this doesn't flex Google Refine's muscles very much, it does make use of it's ability to pull data from a URL, parse JSON and export the resulting data in a specified format using templating.