Deep-sea sediments of the global ocean mapped with Random Forest machine learning algorithm

DOI

The seafloor lithology of deep-sea sediments of the global ocean was spatially predicted. Five lithology classes were predicted: Calcareous sediment, Clay, Diatom ooze, Lithogenous sediment, and Radiolarian ooze. The dataset contains probability surfaces of the five seafloor lithologies, the probability of the most probable class (maximum probability) and the predicted seafloor lithology. The results are presented as geo-referenced floating-point TIFF-files with a spatial resolution of 10 km and Wagner IV equal-area projection as spatial reference.Seafloor lithologies were mapped by building a predictive spatial model. This entails a two-step approach: Initially, the relationship between a set of predictor variables and a response variable is modelled from observations (samples). The established model is then employed to predict the response variable at unsampled locations for which values of the predictor variables are known.The response variable is seafloor lithology, a qualitative multinomial variable. Seafloor lithology data were sourced from Dutkiewicz et al. (2015) and pre-processed in the following way: Only samples deeper than 500 m were used, and duplicates were removed from the original sample dataset. The number of records was therefore reduced from 14,400 to 10,438.The original classification with 13 classes (Dutkiewicz et al., 2015) was reduced to 5 classes: Clay, Diatom ooze, and Radiolarian ooze were retained. Calcareous ooze and Fine-grained calcareous sediment were grouped together as Calcareous sediment. Gravel and coarser, Sand and Silt were grouped together as Lithogenous sediment. The rare classes Ash and volcanic sand/gravel, Sponge spicules and Shells and coral fragments and the mixed classes Fine-grained calcareous sediment and siliceous mud were removed.The choice of predictor variables was initially informed by the current understanding of the controls on the distribution of deep-sea sediments and the availability of data with full coverage of the deep sea at a reasonable resolution. The predictor variable raster layers from Bio-ORACLE (Assis et al., 2018; Tyberghein et al., 2012) and MARSPEC (Sbrocco and Barber, 2013) were utilised. Whenever available, statistics of the variable other than mean were downloaded. These included the minimum, maximum and the range (maximum – minimum). The raster layers were stacked, limited to water depths below 500 m and projected to Wagner IV global equal-area projection with a pixel resolution of 10 km by 10 km.A variable selection wrapper algorithm (Kursa and Rudnicki 2010) was used to identify important predictor variables. Subsequently, the set of variables was reduced to those that were uncorrelated (|r| < 0.5). The selected predictor variables, in decreasing order of importance, were sea-surface maximum salinity, bathymetry, sea-floor minimum temperature, sea-surface minimum silicate, sea-surface maximum primary productivity, sea-surface temperature range, distance to shore and sea-surface salinity range.A Random Forest (Breiman 2001) classification model was trained and the model accuracy assessed by applying a spatial leave-one-out cross validation scheme. A balanced version of Random Forest was utilised to account for imbalances in the input data set. Initial tuning of the number of trees in the forest and the number of variables to consider at any given split showed a very limited impact on model performance, while at the same time the tuning process was very time-consuming. It was therefore decided to use the default parameter values.

Identifier
DOI https://doi.org/10.1594/PANGAEA.911692
Related Identifier https://doi.org/10.1111/geb.12693
Related Identifier https://doi.org/10.1023/A:1010933404324
Related Identifier https://doi.org/10.1130/G36883.1
Related Identifier https://doi.org/10.18637/jss.v036.i11
Related Identifier https://doi.org/10.1890/12-1358.1
Related Identifier https://doi.org/10.1111/j.1466-8238.2011.00656.x
Metadata Access https://ws.pangaea.de/oai/provider?verb=GetRecord&metadataPrefix=datacite4&identifier=oai:pangaea.de:doi:10.1594/PANGAEA.911692
Provenance
Creator Diesing, Markus ORCID logo
Publisher PANGAEA
Publication Year 2020
Rights Creative Commons Attribution 4.0 International; https://creativecommons.org/licenses/by/4.0/
OpenAccess true
Representation
Resource Type Dataset
Format application/zip
Size 38.8 MBytes
Discipline Earth System Research