A view from the trenches of Collection Management

When I started Gaia Resources, one of my biggest projects was supporting the Western Australian Museum’s collections databases.  This work has sent me to a range of the Museum sites, and especially the unassuming Joobaitch House in Welshpool, the Museum’s Collections Research Centre (CRC).

20140527_134051The CRC in Welshpool on a grey winter’s day

From some of our old blog articles (like this one and this one) you’d know that we’ve been working with the Museum to migrate their Collection Management System from the old Microsoft Access / SQL Server database to an online system, using CollectiveAccess.

This week I’ve been on site in Welshpool again, and my job this week has been to start preparing the documentation and training materials for the Museum in this new system.  So Kehan, Ben and I thought it would be a good time to provide an update on the project while we’re all out here on site, in the trenches, so to speak, and to focus on the highlights of the project to date, namely the work Ben and Kehan have done in terms of data mapping, plugins and unit tests, who have written the details on these below (beware: technical content incoming…).

Data Mapping

The largest part of the project so far has involved mapping the Museum’s current collections data set into CollectiveAccess.  The current data set is widely varied, included data from other collection systems, Microsoft Access databases, FileMaker Pro data files, Microsoft Works Databases and Excel spreadsheets.  We worked with the in-house database team at the Museum, who are processing each of the collections and delivering them into a SQL Server database instance with the same (often flat) structure of the original source data in each case.  They also fix certain anomalies in the source data.  Although CollectiveAccess does support a range of data sources, this has given us a standard starting point for each of the Museum’s various collections.

Every field in each of these source databases is currently being inspected, and the correct mapping to CollectiveAccess is being determined and entered into a set of mapping spreadsheets.  Our installation profile, which describes the structure of our data and the fields associated with the various records, is being updated as well so that the database is prepared for the import process.  The source data, installation profile, mapping spreadsheets, and a lot of processing time will be imported into CollectiveAccess – creating the initial state of the production version of CollectiveAccess.

To provide some detail, there were many fields that we determined should be a list field (i.e. the user selects values from a list, possibly also being able to add values to the list).  In these cases, we inspected the source data using Google Refine.  Sometimes the data was clean, but often it was denormalised in some way, usually as a result of it being a plain-text field rather than a list field in the source system.  For example, a field containing the sex determination of a specimen might contain in the source data “M”, “m”, “male”, “M.”, “man”, and so on.  The CollectiveAccess import framework allows a simple one-to-one value mapping for cases such as this (“m” => “M”, “male” => “M”, etc).

DataMappingPart of the Import Mapping for the Zoology Collection -using tools like Google Drive to collaborate

In more complex cases, some algorithm was needed to determine the correct value based on the input, and in some cases there wasn’t a straight one-one relationship between fields.  Many source fields have been used to populate a single field within CollectiveAccess, and vice versa.  In some more complex cases, such as with biological taxonomy, a hierarchy is defined by a flat set of fields in the source data and have to be processed to suit.  Other fields needed their value adjusting, such as truncating a repeated token, stripping control characters or applying regular expressions.

Some other data cleaning was also carried out using routines in Pentaho Data Integration, an ETL suite that lets you create reusable pipelines for manipulating data.

CollectiveAccess has a powerful data import framework (see their wiki for details) that provides tools to deal with these situations.  There are many “refineries” provided, to perform common data mapping tasks, and a template engine which allows data to be constructed using values from any number of fields.  There is also an API to develop any refinery that you find lacking, or which suits the specific needs of your data migration.

As an example, we developed a refinery that reads a “date” field, in which the values are always fully formed dates, and a “date accuracy” field, which is a code representing a date, month or year level of certainty in the accuracy of the date.  CollectiveAccess has a date data type that allows values that aren’t fully-formed, such as “2014/05” or even just “2014”.  Our “Date Accuracy Joiner” takes the date and the accuracy and truncates the uncertain part of the date; for example, “2014/05/01” and “month” accuracy would result in “2014/05”.

Plugins

We have some requirements to generate additional data based on user-entered data, to help reduce repetitive data entry and maintain consistency.  CollectiveAccess includes a plugin mechanism, which allows custom code to hook into various parts of the workflow, including when data is manipulated, when a page is loaded, and for periodic tasks.  We have developed two plugins:

  • A configurable title generator, which assigns a generated “preferred label” value to records (including objects, collections, events, etc) based on their type, assigned collection(s), and accession details.  CollectiveAccess has historically provided several “title generator” plugins, however they were not as configurable as our contribution and were much more customised to specific databases.
  • A configurable relationship generator, which automatically adds and removes relationships for an object to any other identifiable record (including objects, collections, events, etc).  This is highly generic and can be configured to base its decision-making process on any available data related to the record being saved.

An interesting aspect of these two plugins is that the output of the relationship generator is possibly used as input by the title generator.  That is, a relationship might be generated that affects the generated title for a given object.  There is no mechanism for causing the plugins to fire in a particular sequence, however we found that this was not relevant because, upon adding the relationship to the object, the object is saved again, which causes all the plugins to fire again, and the title is actually updated regardless of the sequence of plugins.  However, this form of recursion means that plugins of this type (i.e. those that modify the database) need to be careful to only actually call the `update()` method when there is an actual change, otherwise an infinite loop will result.

We will be developing a much more complex plugin in the coming weeks, so watch this space.  We’ll have already contributed some of these back to the CollectiveAccess repositories. We will continue to do this so that they become freely available.

Unit Tests

To confirm the correctness and validity of any code, the current best practice is to write unit tests for that code.  Unit tests typically exercise small sections of code (“units”) by abstracting away complex or temporal fixtures like databases and other portions of code, providing known inputs to the code, and testing its output against known expectations.

ptfEnough said, right?

Unit tests should cover the range of possible inputs, including valid and invalid inputs.  For example, a unit test for a function that adds lists of numbers together might include cases for positive, negative and zero numbers as inputs, lists of varying length including empty lists, and lists containing non-numeric items (“unhappy path testing”, where the unit is expected to fail with an appropriate exception or error message).  Unit tests are often contrasted with integration tests, which test the interaction between components within the software, or the system as a whole (“end-to-end tests”).

CollectiveAccess is a software system with a long history, and only a small number of existing unit tests, mostly for the most fundamental classes and functions only.  The unit tests are written using the well-known testing framework, phpunit.  There was no existing mechanism for unit testing plugins, and in fact the tight coupling of much of the code with the database means that “true” unit tests were difficult or impossible to write for much of the plugin code.

To solve this problem, we have developed a pattern for writing tests for the plugins that were in some ways unit tests, but were in many other ways more correctly classified as integration tests.  These tests expect a working database, and will generate any data that is required for the test, inserting it into the database in the setup phase, and removing it in the cleanup phase.  In this way, the test should leave the database in very close to the same state as it was before the test was run.  The exception is that sequence numbers (such as those for auto-generated id values) will have progressed, but that is not a concern in most situations.  This pattern is implemented as an abstract base class for plugin integration tests.  Tests for specific plugins should extend this base class and use the documented pattern to define the `setUp()` and `tearDown()` methods.

Again, we’ll be submitting these back to the CollectiveAccess community down the track.

Rollout!

So now you have a better idea of some of the work being done behind the scenes on this project.

We’re getting towards the crunch time, where we are going to be heading into a production instance soon.  Data is starting to flow in via the mappings, there is new functionality appearing from the plugins, and we are starting to see what the system is actually going to be looking like.  The project is at that exciting time when it all starts to come together and we see changes happening readily and rapidly.

While I’m documenting the systems and coming up with training material, I’m also testing the system from the end user perspective (and may even go so far as to start doing some automated testing using Selenium).  So far I’m really impressed with the work and the documentation is going smoothly, and I’m sure that this is going to lead to a nice customised system for the curators and staff here at the WA Museum.

So for now from the trenches, we’ll sign off.  Tally ho, chaps!

Piers, Ben and Kehan

Leave a comment below, or drop us a line on the Gaia Resources twitter account or Facebook page.

Comments are closed.