Harvesting of Metadata

Harvesting channels

Harvesting is the process of automatically fetching remote metadata. This section describes how B2FIND harvests metadata records from data provider sites. While OAI-PMH, as the de facto standard for metadata harvesting, is preferred, B2FIND also supports other APIs, as described in the section Harvesting channels . Once one of these transfer methods has been successfully implemented, B2FIND first takes up a few test samples to analyse their content, as described in the section Initial uptake of a new data provider. As soon as the harvesting and mapping has been consolidated and the data provider gives their consent, the metadata are published in the B2FIND database, and an operational and stable ingestion process is established (see section Synchronous and operational ingestion.

OAI-PMH

OAI-PMH is B2FIND’s preferred metadata harvesting protocol. It can be used to fetch metadata directly from the data providers within research communities. The simplicity of the protocol allows a controlled and easy-to-manage transfer of metadata. Very little information must be provided to enable B2FIND to perform the harvesting process using this protocol:

  • OAI endpoint: This is the URL of the OAI provider server on data provider site, which must be open for OAI-PMH read requests.
  • OAI mdprefix: This is the OAI acronym for the metadata schema in which the provided XML records are coded in.
  • OAI sets (optional): It is recommended to group your records in subsets, because this simplifies the controlled harvesting.

Example:

To harvest all Dublincore (OAI mdprefix is oai_dc) from the subset ANDS-Centre_1 of the OAI provider of DataCite (oai.datacite.org/oai),we submit a HTTP request with the verb ListRecords and the following OAI options set: https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=ANDS.CENTRE-1

If necessary, EUDAT will help the data providers to enable OAI-PMH harvesting of their metadata. Please also check module 02 of the B2FIND training materials, where you will find a step-by-step guide to set up and configure an OAI server. For a detailed documentation of the OAI-PMH protocol, see http://www.openarchives.org/OAI/openarchivesprotocol.html.

JSON-API

Some data providers offer their metadata encoded as JSON records, which can be retrieved, queried and browsed via a REST API. The API is generally RESTful and returns results in JSON, as the API follows the JSONAPI specification.

Example:
The community GBIF (see gbif.org) provides their metadata via the JSON-API at the base URL http://api.gbif.org/v1. By the following request the first 100 JSON records are retrieved from the repository: http://api.gbif.org/v1/dataset?offset=0&limit=100

CSW

Catalog Service for the Web (CSW) is a standard for exposing a catalogue of geospatial records in XML on the Internet (over HTTP). The catalogue is made up of records that describe geospatial data and services. B2FIND uses a CSW 2.0 implementation to harvest XML records from so-called GeoNetwork portals.

Example:

The community Seadatanet (see seadatanet.org) exposes georeferenced metadata via the base GeoNetwork portal with URL endpoint http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET. To retrieve the ISO19139 XML records (namespace specification gmd:MD_Metadata), B2FIND submits a GetRecords request as follows: http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET?SERVICE=CSW&REQUEST=GetRecords&VERSION=2.0.2&typeNames=gmd:MD_Metadata

Initial uptake of a new data provider

Once one of the harvesting methods has been deployed successfully and is working, B2FIND starts with an initial harvesting of a few metadata records. There samples are analysed, metadata elements mapped to correct database indices, and the metadata records are uploaded to a B2FIND test/development server.

When both the harvesting and mapping are at least functional, some of the issues already mentioned in the section on ‘Providing Metadata’ have to be negotiated with the data provider:

  • Scope and extent: Limit the metadata exposed to B2FIND to those which refer to research data. The best practice would be to gather all records designated to be published in B2FIND in dedicated subsets.
  • Grouping and partitioning: Choose the subsets (e.g. OAI sets) which should be harvested (in some cases whole subsets can be assigned to a ‘Discipline’ or can be grouped to ‘sub-communities’ in the B2FIND portal).
  • Selection, assigning and mapping: Check the quality of the mapping of your specific fields to the B2FIND metadata schema (see section ‘Mapping onto EUDAT-B2FIND Schema’ ).

Synchronous and Operational Ingestion

In the long term, not only is it important to have a reliable and sustainable harvesting mechanism established but also to implement a frequent harvesting schedule. This will guarantee sufficient synchronicity between the provider (community) database and EUDAT service (B2FIND). The OAI-PMH the parameter from can be used to harvest only records which are newly created or changed during a given period. If, for instance, an update interval of once a week is agreed on, B2FIND establishes a cron job that is triggered on a weekly basis and with the option from set to a date one week earlier.