Providing Metadata for EUDAT-B2FIND

This section describes the actions a data provider site must take in order to publish metadata in the EUDAT-B2FIND catalogue. There are a few mandatory requirements and some 'best practices' or recommendations . The second section of the document concentrates on metadata encodings and schemas relevant for B2FIND. The 'granularity' of data collections is addressed in the section aggregation levels . In the metadata quality section, we explain how the quality of metadata provided and published in B2FIND can be improved and assured.

Each data provider should complete three main steps:

  1. Analyse and describe the content of the available metadata.
  2. Migrate, if necessary, the existing metadata into a standard format and schema.
  3. Structure, order and group the metadata records into appropriate levels of aggregation.

These topics may need to be discussed iteratively between the domain experts and data managers on the data provider side, and the B2FIND team on the service provider side.

Requirements and Recommendations

There are only a few obligatory preconditions for the metadata that is to be published on B2FIND. Here we describe these requirements, as well as additional best practices that help assure and improve the quality of the metadata.

Requirements

In order to join EUDAT-B2FIND, the data provider must meet a few requirements:

  • The data provider must agree with the licensing principles of EUDAT-B2FIND (see the Terms of Use for EUDAT-B2FIND ) and DKRZ (see the Legal Notice of DKRZ).
  • In particular, the provider must consent to the provided metadata being made publicly available and openly accessible under the Creative Commons Lizenzvertrag Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4) Note: This access licence only applies to the metadata records published and visible in the B2FIND portal, but not necessarily to the underlying data collections referred to and described by the B2FIND datasets.
  • .
  • The data provider agrees to the metadata being made available for free in B2FIND and also for it to be harvested by and re-distributed to other metadata aggregators. No confidential metadata should be provided (although the described research data sets themselves may have access limitations). Copyright-protected metadata can only be published if there is a licence agreement between the data provider and EUDAT that meets the B2FIND requirements.
  • Metadata should be stable and 'good enough'. This is described in more detail in the Metadata Quality section and will be determined with each data provider individually if and when necessary. Some of the central issues include:
    • Metadata records should be as complete as possible and describe the entire dataset, not only a component or part of it.
    • Metadata shall not be encrypted or obfuscated.
    • n addition to the metadata, the data provider must provide documentation needed for successful loading of the records, such as descriptions of the structure, syntax and semantics of the metadata records (see section metadata encodings and schemas ).
    • Two metadata elements are mandatory:
      • The ‘Title’ of the data set, i.e. a unique and unambiguous name or heading by which the referred resource is known (avoid referencing two different data collections by same title).
      • At least one identifier, which has two roles: to identify the described resource, and to facilitate a persistent link to the research data set itself, which should be available in the Web. If the identifier is not persistent (actionable/resolvable), an HTTP URI of the described resource must be provided as well. The URI should be as persistent as reasonably achievable.
    • Metadata must use Unicode with UTF-8 encoding. If the metadata is in a non-Latin script such as Chinese, a version transliterated or transcribed to Latin characters should be provided as well.
  • An interface to retrieve metadata must be available, accessible and usable (see Harvesting Metadata for more information about such interfaces).

Recommendations

In addition to these mandatory requirements, we highly recommend following these 'best practices' as well:

  • More detailed information about granularity should be provided when necessary, as elaborated in the section on aggregation levels below.
  • Metadata should describe research data sets or related resources such as code books. Metadata describing e.g. publications should not be made available for harvesting, since the users do not expect to find such resources from B2FIND.
  • Metadata should be as rich as possible, so as to allow creation of sufficiently rich citations. For instance the 'Creator' and the 'Publisher' of the described data resource should be included.
  • All links should be based on persistent identifiers (PIDs). URLs which act as deep links should be avoided, since they are not likely to be persistent. DOI (Digital Object Identifiers) is the preferred PID system, but other PIDs (Handle, URN, ARK) may also be used.

Metadata Encodings and Schemas

As stated already in the requirements above the provided metadata should be available as formatted records, which follow a defined schema. We list and describe in the following only metadata formats and schemes that are already supported by B2FIND.

In case you use standards which are not listed below, B2FIND maintenance should be contacted. We are open and keen to extend B2FIND to further formats and schemes.

For a detailed and comprehensive overview of metadata specifications and standards, see:

Metadata Format List

Metadata formats provide rules for coding (i.e., the machine readable representation), and the syntax (i.e., the marking and arrangement) of metadata and are required for browsers, search engines, and other programs to reliably process the metadata.

Name Specification Description
XML https://www.w3.org/XML Extensible Markup Language, used for message encoding in many protocols including OAI-PMH
MarcXML http://www.loc.gov/standards/marcxml XML-encoded version of libraries’ MARC 21 metadata format (MARC 21 provides the metadata 'schema')
JSON http://www.json.org/index.html JavaScript Object Notation: collection of key/value pairs

Please contact the B2FIND support, if none of these formats are applicable to encode your metadata.

Metadata Schema List

A Metadata schema is a metadata set with a prescribed syntax and encoding with fixed and semantically unambiguous element designations. It provides rules for which element is allowed for which context and lays down agreements about allowed data types (e.g., text, numeric, date, etc.) and value ranges (numeric ranges, spatial or temporal coverage, controlled vocabularies, etc.).

EUDAT supports multiple discipline-specific metadata schemas. Some of them are listed in the following table.

Name Specification Description
DataCite https://schema.datacite.org/meta/kernel-4.0/ The DataCite Metadata Schema is a list of core metadata properties chosen for an accurate and consistent identification of a resource for citation and retrieval purposes, along with recommended use instructions. Widely used schema, defined by DataCite to provide Digital Object Identifiers (DOIs) to help research communities to locate, identify, and cite research data with confidence.
Dublin Core http://dublincore.org Simple, easy-to-understand, and very widespread metadata standard (mandatory for OAI-PMH). The most used schema for OAI-PMH (as OAI metadata prefix 'oai_dc').
ISO 19115 https://www.iso.org/standard/53798.html ISO 19115-1:2014 defines the schema required for describing geographic information and services by means of metadata. It provides information about the identification, the extent, the quality, the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution, and other properties of digital geographic data and services.
CMDI https://www.clarin.eu/content/component-metadata The Component MetaData Infrastructure (CMDI) provides a framework to describe and reuse metadata blueprints. Description building blocks (“components”, which include field definitions) can be grouped into a ready-made description format (a “profile”).
DDI http://www.ddialliance.org/ The Data Documentation Initiative (DDI) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences.

Example

Here we give an example for a well defined and validated metadata, encoded as a DublinCore XML file and with semantics based on the DataCite schema:
oai:oai.datacite.org:870045 2016-07-08T02:15:17Z ANDS ANDS.CENTRE-1
Clinical and Community Practice Innovation Data Collection Kendall, Elizabeth Sunderland, Naomi Research Centre for Clinical and Community Practice Innovation, Griffith University Griffith University 2011 doi:10.4225/01/4F8E14EA3179E Other The Research Centre for Clinical and Community Practice Innovation consists of researchers from nursing, social work, human services, rehabilitation, psychology, physiotherapy, pharmacy and public health. This PUBLIC collection contains information that will help transform health and community services via the development of collaborative, innovative an… Collection

Aggregation levels

Data providers should analyse the structure and the granularity of their data sets and the files the research data consist of. Based on the analysis, the data provider can decide how to structure the research data set in order to enable reuse and proper citation and annotation. Sometimes the data will be composed of various files or items that belong to a single collection, but the data may also need to be segregated across a number of smaller subsets and datasets. B2FIND is a discovery portal for cited, inter-disciplinary data, and therefore metadata on the finest level of granularity (level 4 below) is out of scope. Metadata should preferably be on the level of citable entities (level 2) or at least on the level of resolvable entities (level 3). B2FIND has defined four levels of aggregation. The table below illustrates them with the help of existing use cases:

# Level Description Spedification w.r.t. B2FIND CMDI Model DKRZ LTA
1 Community/Project level B2FIND Community Complete Corpora/Collection CERA Project
2 Experiment/Study level Referenced by B2FIND record as citable entity (DOI recommended) Subcorpora /e.g. Speech corpus (represented as a 'Virtual collection') CERA Experiment
3 Subgroup or subset level Referenced by B2FIND record as resolvable entity (PID, or better DOI recommended) Subcorpora or corpus components, represented as a 'Virtual collection' CERA Dataset group
4 Digital entity (dataset/file) Often no related B2FIND dataset available, but access redirected to landing page (level 2 or 3) (Recording) sessions (e.g. recording of a dialogue) or individual resources (e.g. text file) CERA Dataset (a file or a fragment within a file)

The use cases

CLARIN - CMDI structure (Component Metadata Infrastructure)

CMDI was developed by members of CLARIN (Common Language Resources and Technology Infrastructure) to establish metadata exchange within the CLARIN infrastructure. It defines different levels of description (granularity): from complete corpora, subcorpora or corpus components, to individual resources, e.g. a recording of a dialogue (sound file + transcript).

For illustration we show the CMDI hierarchy schema as used in The Language Archive (see figure 1) and as the CMDI profile is applied in the project JASMIN (see figure 2).

Hierarchy in The Language Archive
Fig. 1 Hierarchy in the browser of The Language Archive (see at https://corpus1.mpi.nl/ds/asv/?4#
Hierarchy and granularity in JASMIN's CMDI profiles
Fig. 2 Hierarchy and granularity in JASMIN's CMDI profiles (see at https://www.researchgate.net/figure/228732331_fig2_Figure-2-Hierarchy-and-granularity-in-JASMIN-profiles

DKRZ - LTA (Long Term Archive)

Data in the long-term archive of the DKRZ (DKRZ-LTA) is organized in multiple layers with a hierarchical structure. These layers are defined in the associated CERA2 model as Projects, Experiments, Dataset groups and Datasets. The relationships between these different layers are shown as a hierarchical tree structure in Figure 3.

CERA2 Hierarchical Structure
Fig. 3 : CERA2 Hierarchical Structure

Among other climate data, the DKRZ-LTA archive stores the experiments of the model intercomparison project CMIP5 1.This data and its associated metadata are kept in and accessible via the CERA database. B2FIND harvests metadata records corresponding to CMIP5 data collections from the CERA database, but only records for which a DOI is assigned.

This means that the CMIP5 experiments are related to entities of aggregation level 3 in B2FIND, because they are citable aggregations of datasets and each metadata record refers, via a DOI, exactly to the landing page of the associated experiment in the DKRZ-LTA database CERA. There a user can browse through the datasets of the experiment and download the datasets they are interested in.

These examples illustrate the importance of supplying sufficient information about the structure and granularity of the research data to B2FIND. Please let us know about the structure and granularity of your data.

Metadata Quality

Integrity

There are no initial changes to the metadata semantics during the B2FIND ingestion workflow. The harvested metadata may, however, be restructured and reformatted before it is indexed, to allow search and discovery. B2FIND is in the process of improving the quality of metadata via a sophisticated validation and quality check process. We do this in close cooperation with the data providers, taking into consideration their feedback.

Content

Metadata should only describe research data sets or related resources, such as code books, not other resources, e.g. publications.

Metadata may describe research data collections and component parts of research data sets. If so, there should be a two-way link between the collection description and metadata records belonging to the collection, or description of the data set and descriptions of component parts. Links should be based on persistent identifiers. URLs which act as deep links should be avoided, since they are not likely to be persistent.

Metadata records should be as complete as possible. This means that they should not be “dumbed down” as a result of the harvesting process. For instance, if the data set description is a rich DDI record, it should not be migrated into a simple Dublin Core record.

Each record in B2FIND has a unique and persistent identifier of the described resource assigned by the data provider. If the data provider sends an update to a record that has already been harvested to B2FIND, the updated record should contain the same identifier as the original record, so that the original record can be easily replaced by the new one.

Validity

Metadata about bibliographic resources (books, articles, maps etc.) or archival materials is not relevant in B2FIND since users do not expect to find such information on the system. Therefore bibliographic metadata should not be made available for harvesting, and EUDAT may remove such metadata from the B2FIND database or from harvested metadata.

Metadata confidentiality

Any confidential metadata elements should be deleted before the metadata is made available for harvesting. Note that the research data set may have access limitations. Metadata must not be encrypted or obfuscated.

Scope

Metadata should describe the entire dataset, not only a component or part of it, unless the description of the entire dataset is also available and the two metadata records are interlinked.

Structure

The metadata elements title and identifier are mandatory. The identifier must not be repeated, unless the metadata encoding indicates which identifier is the primary one.

Persistent links

Harvested metadata records should have PIDs or other links that connect them to the described data resources. B2FIND shall periodically check whether the links still work. Discontinued data providers will be challenged and their data will be removed if they fail to provide valid links within a specified time frame (the first one of these is The European Library, which is no longer maintained and may be closed for good at any time).

1. For more details about CMIP5 see http://cmip-pcmdi.llnl.gov/cmip5/design_overview.html