For a number of our projects we have to either do data cleaning or transform data from a number of different sources / formats into subtly different formats. While it is possible to write custom (Python, Perl, Java, AWK) scripts to do this sometimes it’s just a little too tedious getting to the bits we’re really interested in. Luckily there’s a great tool that makes it easy to work with messy and / or disparate data sources and work with the data in a spreadsheet style interface that fits the bill in a number of ways: namely Google Refine which is calls itself a ‘power tool for working with messy data’
- It’s open source so fits in with our corporate policy of using FOSS wherever possible
- It’s cross platform
- While it is browser based, It keeps all the data locally so we can ensure that we don’t unwittingly send our clients’ data somewhere where they don’t want it
- It understands a broad range of file formats and web protocols, but doesn’t limit you to those
- It has unlimited undo history
- It has data templating so you can customise you export formats
- It clusters data intelligently making normalisation / deduplication a doddle
- It’s extensible – your tranformations can be written in GREL (it’s own expression language), Jython or Clojure
- It can export in numerous formats and links well with Google Docs and Fusion Tables as well as being compatible with Microsoft Excel and OpenOffice/LibreOffice.
While it looks deceivingly like a spreadsheet it is anything but that. Spreadsheets are great for working with data, doing repeat calculations and making pretty graphs. Refine is great at working with lots of data, doing lots of slightly different transformations, and understanding the breadth of your data, and is not made for making graphs but you can even do that if you really want to.
I recently used Google Refine to help populate some species (or taxon) pages in the BDRS. To do this, I used the taxon profile importer in the BDRS, which uses a list of LSIDs to query the ALA, and populate species pages from that content. This was a function created to help the ALA staff (and us!) help other groups get taxon pages up and running quickly, building on the significant investment that the ALA has made already.
To do that, I knew I could get a list of LSIDs from the ALA’ name services, running at biodiversity.org.au. I searched for the order Insecta, and used Google Refine to parse the JSON string and export the data in a format that I could then copy and paste directly into the BDRS, and populate the taxon pages.
The screencast is about 4 minutes long, and I’ve added annotations; this is a pretty accurate reflection of the time it would take to do this process, and go from a BDRS with no species pages to one with a whole bunch of them in around 5 minutes. While this doesn’t flex Google Refine’s muscles very much, it does make use of it’s ability to pull data from a URL, parse JSON and export the resulting data in a specified format using templating.