The first step in many image analysis tasks is the segmentation of objects of interest from a full image. This is the case for ZooScan images. The ZooScan is a waterproof flatbed scanner dedicated to the digitization of samples of zooplankton, from sizes of 300µm and up. The jar of plankton is poured on the scanning window, objects are physically separated as best as possible and the image is acquired. After background subtraction, the full grayscale image is segmented based on a simple grey intensity threshold and each segmented object is measured (in terms of area, transparency etc.). These segments, usually called "vignettes", are then classified taxonomically, often through the help of machine learning based on the measurements. The measurements also allow estimating the size and volume of each object.
Despite the carefulness of operators, it is frequent for some of the 1000 to 2000 vignettes typically detected on a single scan to contain more than one object, hence biassing the measurements and the further quantification of concentration and biovolume of plankton. To avoid this, operators go back on the initial full frame and digitally separate touching objects by drawing white lines between them. This dataset contains ~14k vignettes with objects separated by white lines, ~5k vignettes of single, correctly detected objects as well as the binary masks of all of them. This can be used to train deep learning segmentation models, such as semantic, instance or panoptic segmenters. All these images were acquired with a ZooScan, from samples taken by a WP2 net in various places of the world, during the Tara Oceans cruise.
Data preprocessing
The full zooscan image gets its background subtracted by ZooProcess. Then contiguous regions are detected using a connected component algorithm (that considers neighbouring pixels along the diagonal to be touching too). The pre-processed (background subtracted) scan and the mask resulting from manual separation with white lines are cropped to the regions of interest detected.
Data splitting
The dataset is split in ~70% training set, 15% validation set, 15% test set.
Classes, labels and annotations
All splits are organised the same way: an images directory, with grayscale png images of objects + a masks directory with binary png masks of objects to be detected.
When the binary mask contains only one region, the object is a single plankter.
Parameters
The dataset does not contain or allow the computation of any standard variable. It is related to the computation of concentrations (http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/) and biovolume (http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/) of plankton.
Data sources
All images are taken with a ZooScan (http://vocab.nerc.ac.uk/collection/L22/current/TOOL1581/).
Data quality
The images encompass a range of sizes of the organisms. The minimum area (number of pixels) of an organism is 358; the maximum is 1,1206,650. The pixel size is 0.0106mm. The smaller images can therefore be blurry and pixelated.
Image resolution
Images range in size from 12 to 6020 px in width and from 6 to 13625 px in height.
Contact information
For more information on this dataset, please contact Jean-Olivier IRISSON (irisson@normalesup.org)