Providing Metadata for EUDAT-B2FIND

This section describes the actions a data provider site must do in order to publish metadata in the EUDAT-B2FIND catalogue. There are a few mandatory Requirements and some 'best practices' or Recommendations . The second section of the document concentrates on Metadata Encodings and Schemes relevant for B2FIND. The 'granularity' of data collections is addressed in the section Aggregation Levels . The section Metadata Quality tells what can be done to assure and improve the quality of the metadata to be sent to and published in B2FIND in more detail.

There are three main steps which each data provider should complete :

  1. Analyse and describe the quality of the available metadata.
  2. Migrate, if necessary, the existing metadata into standard format and scheme.
  3. Structure, order and group the metadata records into appropriate levels of aggregation.

These topics may have to be discussed iteratively between the domain experts and data managers at the data provider side and the B2FIND team at the service provider side.


   Requirements and Recommendations
   Metadata Encodings and Schemes
   Aggregation Levels
   Metadata Quality

Requirements and Recommendations

While there are only a few obligatory preconditions, we recommend additional 'best practices', which help to assure and improve the quality of teh metadata.

Requirements

In order to join EUDAT-B2FIND the data providers must meet a few requirements :

  • Data provider has to agree with the licensing principles of EUDAT-B2FIND (see the Terms of Use for EUDAT-B2FIND ) and DKRZ (see the Legal Notice of DKRZ).
  • Especially the consent must be declared that the provided metadata are made publicly available and open accessible under the licence Creative Commons Lizenzvertrag Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4) Note : This access licence is only applied to the metadata records published and visible in the B2FIND portal, but not necessarily applicable for the underlying data collections referred and described by the B2FIND datasets.
  • .
  • The data provider agrees, that the metadata will be made available for free in B2FIND and that it can also be harvested by and re-distributed to other metadata aggregators. No confidential metadata should be made available (although the described research data sets themselves may have access limitations). Copyright protected metadata can only be sent if there is a licence agreement between the data provider and EUDAT which meets the B2FIND requirements.
  • Metadata should be stable and 'good enough'. This is described in more detail in the section Metadata Quality and will be determined with each data provider individually if and when necessary. Some of the central issues include :
    • Metadata records should be as complete as possible and describe the entire dataset, and not only a component or part of it
    • Metadata shall not be encrypted or obfuscated
    • In addition to the metadata, the data provider must provide documentation needed for successful loading of the records, such as description of the structure, syntax and semantics of the meatdata records (see section metadata encodings and schemas ).
    • Two metadata elements are mandatory:
      • The ‘Title’ of the data set, i.e. a name or heading by which the referred resource is known and which should be unique and unambigous (avoid same title referencing two different data collections)
      • At least one identifier, which has two roles: to identify the described resource, and to facilitate a persistent link to the research data set itself, which should be available in the Web.If the identifier is not persistent (actionable / resolvable), a HTTP URI of the described resource must be provided as well. The URI should be as persistent as reasonably achievable.
    • Metadata must use Unicode with UTF-8 encoding. If the metadata is in non-Latin script such as Chinese, a version transliterated or transcribed to Latin characters should be provided as well.
  • An interface to retrieve metadata must be available, accessible and usable (see Harvesting Metadata for more information about these interfaces).

Recommendations

In addition to these musts we express the following ‘best practice’ recommendations :

  • More detailed information about granularity should be provided when necessary, as explained in more details in the section levels of Aggregation below.
  • Metadata should describe research data sets or related resources such as code books. Metadata describing e.g. publications should not be made available for harvesting, since the users do not expect to find such resources from B2FIND.
  • Metadata should be as rich as possible, so as to allow creation of sufficiently rich citations. For instance the 'Creator' and the 'Publisher' of the described data resource should be included.
  • All links should be based on persistent identifiers (PIDs). URLs which act as deep links should be avoided, since they are not likely to be persistent. DOI (Digital Object Identifiers) is the preferred PID system, but other PIDs (Handle, URN, ARK) may also be used.

Metadata Encodings and Schemas

As stated already in the requirements above the provided metadata should be available as formatted records, which follow a defined schema. We list and describe in the following only metadata formats and schemes that are already supported by B2FIND.

In case you use standards which are not listed below, B2FIND maintenance should be contacted. We are open and keen to extend B2FIND to further formats and schemes.

For a detailed and comprehensive overview of metadata specifications and standards, see:

Metadata Format List

Metadata formats provide rules for coding (i.e., the machine readable representation), and the syntax (i.e., the marking and arrangement) of metadata and are required for browsers, search engines, and other programs to reliably process the metadata.

Name Specification Description
XML https://www.w3.org/XML Extendesible Markup Language, used for message encoding in many protocols including OAI-PMH
MarcXML http://www.loc.gov/standards/marcxml XML-encoded version of libraries’ MARC 21 metadata format (MARC 21 provides the metadata 'schema')
JSON http://www.json.org/index.html JavaScript Object Notation : Collection of key/value pairs

Please contact the B2FIND support, if for some reasons none of these formats are applicable to encode your metadata.

Metadata Schema List

A Metadata schema is a metadata set with a prescribed syntax and encoding with fixed and semantically unambiguous element designations. It provides rules which element is allowed for which context and lays down agreements about allowed data types (e.g., text, numeric, date, etc.) and value ranges (numeric ranges, spatial or temporal coverage, controlled vocabularies, etc.).

EUDAT supports multiple discipline specific metadata schemas, some of them are listed in the following table.

Name Specification Description
DataCite https://schema.datacite.org/meta/kernel-4.0/ The DataCite Metadata Schema is a list of core metadata properties chosen for an accurate and consistent identification of a resource for citation and retrieval purposes, along with recommended use instructions. Widely used schema, defined by DataCite to provide Digital Object Identifiers (DOIs) to help the research communities to locate, identify, and cite research data with confidence.
Dublin Core http://dublincore.org Simple, easy-to-understand, and very widespread metadata standards (mandatory for OAI-PMH). The most used schema for OAI-PMH (as OAI metadata prefix 'oai_dc').
ISO 19115 https://www.iso.org/standard/53798.html ISO 19115-1:2014 defines the schema required for describing geographic information and services by means of metadata. It provides information about the identification, the extent, the quality, the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution, and other properties of digital geographic data and services.
CMDI https://www.clarin.eu/content/component-metadata The Component MetaData Infrastructure (CMDI) provides a framework to describe and reuse metadata blueprints. Description building blocks (“components”, which include field definitions) can be grouped into a ready-made description format (a “profile”).
DDI http://www.ddialliance.org/ The Data Documentation Initiative (DDI) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences.

Example

We give here an example for a well defined and validated metadata, encoded as a DublinCore XML file and with semantics based on the DataCite schema :

Aggregation levels

A data provider should analyze the structure and the granularity of his data set and the files the research data consist of. It is also important to investigate the metadata which describes the data collections. Based on the analysis, the data provider can decide how to structure the research data set in order to enable reuse and proper citation and annotation. Sometimes the data will be composed of different files or items that belong to one collection, but the data may also need to be segregated across a number of smaller subsets and datasets. B2FIND is a discovery portal for cited, inter-disciplinary data, and therefore metadata on the finest level of granularity (level 4 below) is out of scope. Metadata should preferably be on the level of 'citable entities' (level 2) or at least on the level of 'resolvable entities' (level 3). B2FIND has defined four ‘Levels of aggregation’. The table below illustrates them by existing use cases :

B2FIND has defined four ‘Aggregation levels’. The table below illustrates them by existing use cases :

# Level Description Spedification w.r.t. B2FIND CMDI Model DKRZ LTA
1 Community/Project level B2FIND Community Complete Corpora/Collection Project
2 Experment/Study level Referrenced by B2FIND record as citable entity (DOI recommended) Sub Corpora / e.g. Speech corpus. Represented as a 'Virtual collection' Experiment
3 Sub group or sub set level Referrenced by B2FIND record as resolvable entity (PID, or better DOI recommended) sub corpora or corpus components, represented as a 'Virtual collection' Datasetgroup
4 Digital entity (dataset/file) Often no related B2FIND dataset available, but access redirected to landing page (level 2 or 3) (recording) sessions (e.g. recording of a dialogue) or individual resources (e.g. text file) Dataset

The use cases

CLARIN - CMDI structure (Component Metadata Infrastructure)

CMDI was developed by members of CLARIN (Common Language ... Infrastructure) to establish metadata exchange within CLARIN infrastructure and defines different levels of description (granularity): From complete corpora, sub corpora or corpus components, to individual resources, e.g. a recording of a dialogue (sound file + transcript).

For illustration we show the CMDI hierarchy schema as used in The Language Archive (see figure 1) and as the CMDI profile is applied in the project JASMIN (see figure 2).

Hierarchy in The Language Archive
Fig. 1 Hierarchy in the browswer of The Language Archive (see at https://corpus1.mpi.nl/ds/asv/?4#
Hierarchy and granularity in JASMIN's CMDI profiles
Fig. 2 Hierarchy and granularity in JASMIN's CMDI profiles (see at https://www.researchgate.net/figure/228732331_fig2_Figure-2-Hierarchy-and-granularity-in-JASMIN-profiles

DKRZ - LTA (Long Term Archive)

Data in the longterm archive of the DKRZ (DKRZ-LTA) is organized in multiple layers with a hierarchical structure. These layers are definded in the associated CERA2 model as Projects, Experiments, Dataset groups and Datasets and the relationships between these different layers is shown as hierarchical tree structure in Figure 3.

CERA2 Hierarchical Structure
Fig. 3 : CERA2 Hierarchical Structure

Belong other climate data the DKRZ-LTA archive stores the experiments of the model intercomparision project CMIP5 1 and the associated metadata are kept in and accessible via the CERA database. B2FIND harvests only metadata records belonging to CMIP5 data collections, for which a DOI is assigned, from the CERA database.

I.e. the CMIP5 experiments are related to entities of aggregation level 3 in B2FIND, because they are citable aggregations of datasets and each metadata record refers via a DOI exactly to the landing page of the associated experiment in the DKRZ-LTA database CERA. There a user can browse through the datasets of the experiment and download the datasets he is interested in.

These examples illustrate the importance of supplying sufficient information about the structure and granularity of the research data to B2FIND. Please let us know about the structure and granularity of your data.

Metadata Quality

Integrity

There are no initial changes to the metadata semantics during the B2FIND ingestion workflow. The harvested metadata may however be restructured and reformatted before it is indexed to allow search and discovery. B2FIND is in the process to improve the quality of metadata via a sophisticated validation and quality check process and close cooperation with and feedback from the data providers.

Content

Metadata should describe just research data sets or related resources such as code books, not other resources such as publications.

Metadata may describe research data collections and component parts of research data sets. If so, there should be a two-way link between the collection description and metadata records belonging to the collection, or description of the data set and descriptions of component parts. Links should be based on persistent identifiers. URLs which act as deep links should be avoided, since they are not likely to be persistent.

Metadata records should be as complete as possible. That is, they should not be “dumbed down” as a result of the harvesting process. For instance, if the data set description is a rich DDI record, it should not be migrated into a simple Dublin Core record.

Each record in B2FIND has a unique and persistent identifier of the described resource assigned by the data provider. If the provider sends an update to a record that has already been harvested to B2FIND, the updated record should contain the same identifier than the original record, in order to enable an easy replacement of the original record with the new one.

Validity

Metadata about bibliographic resources (books, articles, maps, etc.) or archival materials is not relevant in B2FIND since users do not expect to find information like that from the system. Therefore bibliographic metadata should not be made available for harvesting, and EUDAT may remove such metadata from B2FIND database or from harvested metadata. Metadata confidentiality Any confidential metadata elements should be deleted before the metadata is made available for harvesting. Note that the research data set may have access limitations.

Metadata must not be encrypted or obfuscated. Scope

Metadata should describe the entire dataset, and not only a component part of it, unless the description of the entire dataset is also available and the two metadata records are interlinked. Structure

Mandatory metadata elements (title, identifier) must not be missing. Identifier must not be repeated, unless the metadata encoding indicates which identifier is the primary one.

Persistent links

Harvested metadata records should have PID-based or other links to the described data resources. … B2FIND intends to check periodically that the links are still operational Discontinued data providers will be a challenge (the first one of these is The European Library, which is no longer maintained and may be closed for good at any time)

1. [CMIP5 is the Cliamte ... project, see ... for more information]