Last modified: 2011-09-18
Abstract
The Entomology Department holds one of the worlds largest entomological collections, estimated at 28 million specimens. Because of the large size of the collection and limited resources, when we started digitising it was decided to create a species-level index rather than undertake time consuming specimen-level digitisation. In following years we have undertaken a significant, but less coordinated effort, to digitise at specimen-level.
The species-level index
This is an index of all the valid taxa plus invalid taxa where we hold type material. Initially the driving force for digitisation was to enable museum staff to manage the collections e.g. record what taxa we hold, the nature of that material and where it is located.
The creation and maintenance of the index is driven by a coordinated top-down approach where databasing is an integral part of the daily work of collections staff. The index was initially developed using Paradox and this ran successfully between 1995 and 2007. The whole dataset was migrated over to the collection management software KE-EMu in 2007.
Currently there are around 700,000 taxa in the index. The true total, however, is likely to be less due to data migration issues. The index has comprehensive coverage with only a few small groups not yet digitised.
Key data fields in the index are taxon name, current name (where the collection taxon is invalid), preservation, type material details and location. Geographic distribution is also sometimes recorded.
Specimen-level databasing
Specimen-level databasing sprung up from a number of bottom-up initiatives from individuals working on specific projects for a variety of purposes e.g. imaging type specimens, harvesting label data and molecular studies.
Prior to the adoption of a single centralised data repository (KE-EMu) there were a wide variety of specimen-level databases and data models, often created for a specific need and sometimes poorly designed. All these individual datasets were migrated into KE-EMu which currently holds around 250,000 specimen-level records.
Because much of the specimen-level digitisation effort is contributed by short term staff and visiting researchers it is not always pragmatic to capture data directly into KE-EMu owing to training and access issues. We have developed data entry templates to ensure data captured outside KE-EMu can be imported into the repository.
The two databasing approaches complement one another and we plan to continue with this dual-track strategy. The species-level index is crucial for the day-to-day management of the collection. The index, with a limited number of core fields makes it possible to attain high digitisation coverage of the collections, it focuses on high data quality and allows the dataset to be maintainable in the long term. Specimen-level digitisation is an increasingly popular activity as images and label data are captured. There is inevitably less strategic control over this as it is a more bottom-up approach and the effort more dispersed. We are developing protocols to ensure these data meet the necessary standards and can be incorporated into the KE-EMu data repository so that they can be fed to the wider bioinformatics community.