Skip to main content

Award winner uses data mining and machine learning to identify collectors and duplicated herbarium specimens


Nicky Nicolson, a PhD candidate from the United Kingdom at Brunel University London, is one of two recipients of the Global Biodiversity Information Facility (GBIF) Young Researchers Award for 2019.

As a staff member at the Royal Botanic Gardens, Kew, Nicolson is well-known within the GBIF community, having recently switched from a technical software development role, to senior research leader in biodiversity informatics while pursuing her graduate studies.

For centuries, naturalists collecting plants in the field have generally gathered at least five or six representative samples of any specimen. After drying and preparing the specimens and returning to the herbarium, they would then often split up these groups and distribute the individual specimens to other institutions. This strategy maximises researchers’ access to specimens, makes efficient use of collection space and safeguards knowledge against the catastrophic loss of any single collection. However, with collections pursuing independent digitisation processes, links between duplicate specimens remain hidden when shared through the GBIF infrastructure.

Nicolson’s research starts with clustering digital data from herbarium specimens shared through, to identify the collectors responsible for gathering specimens in the field and the expeditions they conducted. Data-mining these implicit entities— often not formally managed, but recognised by researchers and specimen collection holders alike—allows the reshaping of data aggregations like GBIF. Effective summaries of large-scale data enable novel outputs such as data visualisations and network graph representations.

Representing this data as a network makes it possible to infer connections and reveal hidden or forgotten aspects of specimen collections. For example, reconciling information about collectors’ names, dates and locations can detect collecting trips and closely related specimens now dispersed and managed independently, even if the vouchers and digital records no longer explicitly record them as coming from the same collection event.

“Annotations held on specimens represent both the most costly part of the specimen digitisation process – georeferencing – and the most valuable kinds of usage – identifications, and the citation of type specimens formally attached to scientific names,” said Nicolson. “By identifying the collecting events from specimens pooled in GBIF, I hope to combine their histories and form a richer, shared basis of evidence.”

“The vast amount of digitised information held at Kew and mobilised through GBIF offers a rich resource to mine knowledge about botanical specimens, and how they are collected internationally,” said Dr Allan Tucker, senior lecturer in computer science at Brunel and head of the Intelligent Data Analysis Group. ”By developing state-of-the-art algorithms, Nicky has enabled new inferences to be made about the botanical scientific process, revealing how science in the field has changed over time and undoubtedly informing future efforts.”

“Nicky’s research to reconcile specimen duplicates holds the promise of enabling shared curation effort between organisations worldwide, making the most efficient use of expert input," said Dr Alan Paton, Head of Science Collections at the Royal Botanic Gardens, Kew. "Her work also has great potential for mining details of specimen use from literature and other data sources, demonstrating the value of digitised collections as a basic scientific infrastructure for addressing environmental challenges. Applying these new computational techniques to the collections data strengthens the data-level connections between institutions, helping us to scale up mass digitisation.”

Like the botanical technique of propagation which grows new plants from diverse sources like seeds and cuttings, the computational techniques Nicolson has developed aim to cultivate connections between collections. Sharing metadata elements and annotations from related specimens held in different herbaria could enable the dissemination of better, more consistent data that efficiently enriches records for them all.

Meanwhile, researchers would benefit from improved standardisation, documentation and linkage of digital specimen information, and detection of collection patterns within individual expeditions. Collections could also see positive effects in reduced data management costs and improved data quality and data usage reporting. Finally, revealing the latent relationships between collections with shared specimen material could highlight institutions that stand to benefit from collaborations aimed at enabling community curation.

The award jury, led by GBIF science committee vice chair Anders G. Finstad of the Norwegian University of Science and Technology (NTNU), lauded Nicolson for her “highly original and innovative” approaches and her successful “use of data from GBIF to combine geographically distant collections using only minimal information on the specimen.”

The GBIF Science Committee selected Nicolson and Marcos Daniel Zárate, a PhD candidate from Argentina, from a pool of 11 candidates nominated by heads of delegation from seven GBIF Participant countries, including the United Kingdom, whose delegation nominated Nicolson for the award. Zárate and Nicolson will each receive a €5,000 award and recognition at the 26th GBIF Governing Board in Leiden, the Netherlands, in October 2019.

Reported by:

Press Office, Media Relations
+44 (0)1895 266867