Harvesting of Metadata


   Harvesting channels
    OAI-PMH
    JSON-API
    CSW2.0
   Initial uptake of a new data provider
   Synchronous and operational Ingestion

Harvesting channels

Harvesting is the process of automatically fetching remote metadata. This paragraph describes how EUDAT-B2FIND harvests metadata records from data provider sites. While OAI-PMH as the de facto standard for metadata harvesting is preferred, B2FIND supports also other APIs as described in the section Harvesting channels . Once one of these transfer methods has been successfully implemented, B2FIND performs first an initial uptake of a few test samples to analyse their content, as described in the section

OAI-PMH

OAI-PMH is the metadata harvesting protocol preferably used by EUDAT-B2FIND to fetch metadata directly from the data providers within research communities. The simplicity of the protocol allows a controlled and easy to manage transfer of metadata and only little information must be provided to enable B2FIND to perform the harvesting process :
  • OAI endpoint : This is the URL of the OAI provider server on data provider site, which must be open for OAI-PMH read requests.
  • OAI mdprefix : This is the OAI acronym for the metadata schema in which the provided XML records are coded in.
  • OAI sets (optional) : It is recommended to group your records in subsets, because this simplifies the controlled harvesting.

Example

To harvest all Dublincore (OAI mdprefix is oai_dc) from the subset ANDS-Centre_1 of the OAI provider of DataCite (oai.datacite.org/oai),we submit a HTTP request with the ‘verb’ ListRecords and the following OAI options set : https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=ANDS.CENTRE-1
  • The community Seadatanet ( see seadatanet.org ) exposes georeferenced metadata via the base geonetwork portal with URL endpoint http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET . To retrieve the ISO19139 XML records (namespace specification gmd:MD_Metadata ) B2FIND submits a GetRecords request as follows : http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET?SERVICE=CSW&REQUEST=GetRecords&VERSION=2.0.2&typeNames=gmd:MD_Metadata If necessary, EUDAT will help the data providers to enable OAI-PMH harvesting of their metadata. Please check also module 02 of the B2FIND Training materials , where you find a step by step guide to setup and configure an OAI server. For a detailed documentation of the OAI-PMH protocoal we refer to http://www.openarchives.org/OAI/openarchivesprotocol.html .

    JSON-API

    Some data providers offer their metadata encoded as JSON records, which can be retrieved, queried and browsed via a REST API. The API is generally RESTFUL and returns results in JSON, as the API follows the JSONAPI specification.

    Example

    The community GBIF ( see gbif.org ) provides there metadata via the JSON-API at the base URL http://api.gbif.org/v1. By the following request the first 100 JSON records are retrieved from the repository. http://api.gbif.org/v1/dataset?offset=0&limit=100

    CSW

    Catalog Service for the Web (CSW) is a standard for exposing a catalogue of geospatial records in XML on the Internet (over HTTP). The catalogue is made up of records that describe geospatial data and services. B2FIND uses a CSW 2.0 implementation to harvest XML records from so called GEO network portals.

    Example

    The community Seadatanet ( see seadatanet.org ) exposes georeferenced metadata via the base geonetwork portal with URL endpoint http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET . To retrieve the ISO19139 XML records (namespace specification gmd:MD_Metadata ) B2FIND submits a GetRecords request as follows : http://sextant.ifremer.fr/geonetwork/srv/fre/csw-SEADATANET?SERVICE=CSW&REQUEST=GetRecords&VERSION=2.0.2&typeNames=gmd:MD_Metadata

    Initial uptake of a new data provider

    Once one of the harvesting methods has been deployed successfully and is working, B2FIND starts with an initial harvesting of a few metadata records. This samples are analysed, metadata elements mapped to correct database indices, and the metadata records are uploaded to a B2FIND test / development server.

    When both - harvesting and mapping - is at least functional working, some of the issues already mentioned in the paragraph ‘ProvidingMetadata’ has to been negotiated with the data provider :

    • Scope and extent : Shrink the metadata exposed to B2FIND to those which refer to ‘research data’. Best practice would be to gather all records, which are foreseen to get published in B2FIND, in dedicated subsets.
    • Grouping and partitioning : Choose the subsets (e.g. OAI sets) which should be harvested ( in some cases whole subsets can be assigned to a ‘Discipline’ or can be grouped to ‘sub-communities’ in the B2FIND portal)
    • Selection, assigning and mapping : Check the quality of the mapping of your specific fields to the B2FIND metadata schema (see paragraph ‘Mapping onto EUDAT-B2FIND Schema’ ).

    Synchronous and Operational Ingestion

    In long term it is not only important to have a reliable and sustainable harvesting mechanism established, but also to implement a frequent harvesting schedule. This will guarantee a sufficient synchronicity between the provider (community) database and EUDAT service (B2FIND). With OAI-PMH the parameter ‘from’ can be used to harvest only records which are newly created or changed during a given period. If for instance an update interval of once a week is agreed, B2FIND establishes a cronjob that is triggered on a weekly basis and with the option ‘from’ set to a date one week earlier.