Data Science is a large and growing multidisciplinary field that employs scientific method, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It aims to unify data analysis, machine learning and related methods to understand the complexity of the world through large, often aggregated datasets. Together with Data Analytics – the discovery, interpretation, and communication of meaningful patterns in data – they are especially valuable in areas rich with recorded information. We’ve done a lot of work in both Data Science and Data Analytics at Gaia Resources over the years.
What prompted me to focus on Data Science and Analytics in this week’s blog is the imminent publication of a paper I contributed to – ‘AusTraits – a curated plant trait database for the Australian flora’ (Falster D et al., 2021 – in press). As the paper says:
“AusTraits synthesises data on 375 traits across 29230 taxa from field campaigns, published literature, taxonomic monographs, and individual taxa descriptions. Traits vary in scope from physiological measures of performance (e.g. photosynthetic gas exchange, water-use efficiency) to morphological parameters (e.g. leaf area, seed mass, plant height) which link to aspects of ecological variation. AusTraits contains curated and harmonised individual-, species- and genus-level observations coupled to, where available, contextual information on-site properties. This data descriptor provides information on version 2.1.0 of AusTraits which contains data for 937243 trait-by-taxa combinations. We envision AusTraits as an ongoing collaborative initiative for easily archiving and sharing trait data to increase our collective understanding of the Australian flora.”
I and other colleagues from the Western Australian Herbarium were invited to contribute our data from the Descriptive Catalogue initiative, which contributed a small number of observed traits for c. 12,000 WA plant taxa. To my mind, one key strategy for data science is that major datasets are developed and maintained in a manner that can contribute to even larger integrative projects such as AusTraits, for further data analysis, again as we outline in the paper:
“AusTraits version 2.1.0 was assembled from 351 distinct sources, including published papers, field campaigns, botanical collections, and taxonomic treatments. Initially, we identified a list of candidate traits of interest, then identified primary sources containing measurements for these traits. As the compilation grew, we expanded the list of traits considered to include any measurable quantity that had been quantified for a moderate number of taxa (n > 20). To harmonise each source into the common a format AusTraits applied a reproducible and transparent workflow – a custom workflow to clean and standardise taxonomic names using the latest and most comprehensive taxonomic resources for the Australian flora: the Australian Plant Census (APC) and the Australian Plant Names Index (APNI).”
The AusTraits project is hosted by the Australian Research Data Commons (ARDC) formed in July 2018. The ARDC is “a transformational initiative that aims to enable the Australian research community and provide industry access to nationally significant, leading-edge data-intensive eInfrastructure, platforms, skills and collections of high-quality data”. This hosting contributes towards the maintenance aspect I mentioned above.
Full details on those processes will be available in the forthcoming publication, a link to which I’ll add when it becomes available. Meanwhile, if you’d like to know more about this project, or about what we can offer in the Data Science and Analytics areas, please drop me a line at email@example.com, or connect with us on Twitter, LinkedIn or Facebook.