Providing Metadata for EUDAT-B2FIND
This section describes the actions a data provider must take in order to publish metadata in the EUDAT-B2FIND catalogue. There are a few mandatory Requirements and some good practices or Recommendations . The second section of the document concentrates on Metadata Encodings and Schemas relevant for B2FIND. The granularity of data collections is addressed in the section Aggregation Levels. In the Metadata Quality section, we explain how the quality of metadata provided and published in B2FIND can be improved and assured.
Each data provider should complete three main steps:
- Analyse and describe the content of the available metadata.
- Migrate, if necessary, the existing metadata into a standard format and schema.
- Structure, order and group the metadata records into appropriate levels of aggregation.
These topics may need to be discussed iteratively between the domain experts and data managers on the data provider side, and the B2FIND team on the service provider side.
There are only a few obligatory preconditions for the metadata to be published on B2FIND. Here we describe these requirements, as well as additional good practices that help assure and improve the quality of the metadata.
In order to join EUDAT-B2FIND, the data provider must meet a few requirements:
- In particular, the provider must consent to the provided metadata being made publicly available and openly accessible under CC-BY International v.4.0 or subsequent without any restrictions on reuse in original and derivative forms. Note: This open access licence only applies to the metadata records published and visible in the B2FIND portal, not to the underlying data collections referred to and described by the B2FIND datasets.
- The data provider agrees to the metadata being made available for free in B2FIND and also for it to be harvested by and re-distributed to other metadata aggregators. No confidential metadata should be provided (although the described research data sets themselves may have access limitations). Copyright-protected metadata can only be published if there is a licence agreement between the data provider and EUDAT that meets the B2FIND requirements.
- An interface to retrieve metadata must be available, accessible and usable (see Harvesting Metadata for more information about such interfaces).
The Metadata provided should be stable and 'good enough'. This is described in more detail in the Metadata Quality section and will be determined with each data provider individually if and when necessary. Some of the central issues include:
- Metadata records should be as complete as possible.
- Metadata shall not be encrypted or obfuscated.
- Metadata must use Unicode with UTF-8 encoding. If the metadata is in a non-Latin script such as Chinese, a version transliterated or transcribed to Latin characters should be provided as well.
- In addition to the metadata, the data provider must provide documentation needed for successful loading of the records, such as descriptions of the structure, syntax and semantics of the metadata records (see section Metadata Encodings and Schemas).
Six metadata elements are mandatory:
- the name of the research <Community> or data provider B2FIND harvests from. This ensures the visibility of the data providers and research infrastructures.
- the <Title> of the data set, i.e. a unique and unambiguous name or heading by which the referred resource is known (avoid referencing two different data collections by same title).
- at least one <Identifier>, which has two roles: to identify the described resource, and to facilitate a persistent link to the research data set itself, which should be available on the web. If the identifier is not persistent (actionable/resolvable), an HTTP URI of the described resource must be provided as well. The URI should be as persistent as reasonably achievable.
- the research <Discipline(s)> the metadata adhere to (chosen from b2find_disciplines.json). This list is under constant revision, so missing disciplines can be added.
- <Publisher> and <PublicationYear>
In addition to these mandatory requirements, we highly recommend following these 'good practices' as well:
- More detailed information about granularity should be provided when necessary, as elaborated in the section on Aggregation Levels below.
- The provided metadata should describe research data or related resources such as code books only. Metadata describing e.g. publications should not be made available for harvesting, since the users do not expect to find such resources from B2FIND.
- Metadata should be as rich as possible, so as to allow the creation of sufficiently rich citations. For instance, the <Creator> and the <Description< of the data resource should be included.
- All links should be based on persistent identifiers (PIDs). URLs which act as deep links should be avoided, since they are not likely to be persistent. DOI (Digital Object Identifiers) is the preferred PID system, but other PIDs (Handle, URN, ARK) may also be used.
As stated in the requirements above, the provided metadata should be available as formatted records which follow a defined schema. We list and describe in the following only metadata formats and schemes that are already supported by B2FIND.
In case you use standards which are not listed below, B2FIND maintenance should be contacted. We are open and keen to extend B2FIND to further formats and schemes.
For a detailed and comprehensive overview of metadata specifications and standards, see:
Metadata formats provide rules for coding (i.e., the machine readable representation), and the syntax (i.e., the marking and arrangement) of metadata and are required for browsers, search engines, and other programs to reliably process the metadata.
Please contact the B2FIND support, if none of these formats are applicable to encode your metadata.
A Metadata schema is a metadata set with a prescribed syntax and encoding with fixed and semantically unambiguous element designations. It provides rules that determine which element is allowed in which context and lays down agreements about allowed data types (e.g., text, numeric, date, etc.) and value ranges (numeric ranges, spatial or temporal coverage, controlled vocabularies, etc.).
EUDAT supports multiple discipline-specific metadata schemas. Some of them are listed in the following table.
ExampleHere we give an example for a well defined and validated metadata record, encoded as a DublinCore XML file:
<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Clinical and Community Practice Innovation</dc:title>
<dc:creator>Research Centre for Clinical and Community Practice Innovation, Griffith University</dc:creator>
<dc:description>The Research Centre for Clinical and Community Practice Innovation consists of researchers from nursing, social work, human services, rehabilitation, psychology, physiotherapy, pharmacy and public health. This PUBLIC collection contains information that will help transform health and community services via the development of collaborative, innovative an…</dc:description>
Data providers should analyse the structure and the granularity of their datasets and the files the research data consist of. Based on the analysis, the data provider can decide how to structure the research data set in order to enable reuse and proper citation and annotation. Sometimes the data will be composed of various files or items that belong to a single collection, but the data may also need to be segregated across a number of smaller subsets and datasets.
B2FIND has defined four levels of aggregation. The table below illustrates them with the help of existing use cases. Being a discovery portal for cited, interdisciplinary data, metadata on the finest level of granularity (level 4 below) is out of scope.
Provided Metadata should preferably be on the level of citable entities (level 2) or at least on the level of resolvable entities (level 3).
|#||Level Description||Specification w.r.t. B2FIND||CMDI Model||DKRZ LTA|
The Use Cases
CLARIN - CMDI structure (Component Metadata Infrastructure)
CMDI was developed by members of CLARIN (Common Language Resources and Technology Infrastructure) to establish metadata exchange within the CLARIN infrastructure. It defines different levels of description (granularity): from complete corpora, subcorpora or corpus components, to individual resources, e.g. a recording of a dialogue (sound file + transcript).
For illustration we show the CMDI hierarchy schema as used in The Language Archive (see figure 1) and as the CMDI profile is applied in the project JASMIN (see figure 2).
DKRZ - LTA (Long Term Archive)
Data in the long-term archive of the DKRZ (DKRZ-LTA) is organized in multiple layers with a hierarchical structure. These layers are defined in the associated CERA2 model as Projects, Experiments, Dataset groups and Datasets. The relationships between these different layers are shown as a hierarchical tree structure in Figure 3.
Among other climate data, the DKRZ-LTA archive stores the experiments of the model intercomparison project CMIP5 1.This data and its associated metadata are kept in and are accessible via the CERA database. B2FIND harvests metadata records corresponding to CMIP5 data collections from the CERA database, but only records for which a DOI is assigned.
This means that the CMIP5 experiments are related to entities of aggregation level 3 in B2FIND, because they are citable aggregations of datasets and each metadata record refers, via a DOI, exactly to the landing page of the associated experiment in the DKRZ-LTA database CERA. There a user can browse through the datasets of the experiment and download the datasets they are interested in.
These examples illustrate the importance of supplying sufficient information about the structure and granularity of the research data to B2FIND. Please let us know about the structure and granularity of your data.
There are no initial changes to the metadata semantics during the B2FIND ingestion workflow. The harvested metadata may, however, be restructured and reformatted before it is indexed, to allow optimal searchability and findability. B2FIND is in the process of improving the quality of metadata via a sophisticated validation and quality check process. We do this in close cooperation with the data providers, taking their feedback into consideration.
Metadata should only describe research datasets or related resources, such as code books, not other resources, e.g. publications.
Metadata may describe research data collections and component parts of research datasets. If so, there should be a two-way link between the collection description and metadata records belonging to the collection, or description of the data set and descriptions of component parts (e.g. via the <RelatedIdentifier> element).
Links should be based on persistent identifiers. URLs which act as deep links should be avoided, since they are not likely to be persistent.
Metadata records should be as complete as possible. This means that they should not be “dumbed down” as a result of the harvesting process. For instance, if the data set description is a rich DDI record, it should not be migrated into a simple Dublin Core record.
Each record in B2FIND has a unique and persistent identifier of the described resource, assigned by the data provider. If the data provider sends an update to a record that has already been harvested by B2FIND, the updated record should contain the same identifier as the original record, so that the original record can be easily replaced by the new one.
Metadata about bibliographic resources (books, articles, maps etc.) or archival materials is not relevant in B2FIND since users do not expect to find such information on the system. Therefore bibliographic metadata should not be made available for harvesting, and EUDAT may remove such metadata from the B2FIND database or from harvested metadata.
Any confidential metadata elements should be deleted before the metadata is made available for harvesting. Note that the research datasets may have access limitations. Metadata must not be encrypted or obfuscated.
Metadata should describe the entire dataset, not only a component or part of it, unless the description of the entire dataset is also available and the two metadata records are interlinked.
The metadata elements <Community>, <Title>, <Identifier>, <Discipline>, <Publisher> and <PublicationYear> are mandatory. The <Identifier> must not be repeated, unless the metadata encoding indicates which identifier is the primary one.
Harvested metadata records should have PIDs or other resolvable links that connect them to the described data resources. B2FIND shall check during the ingestion process whether the links still work. Discontinued data providers will be addressed and their data will be removed if they fail to provide valid links within a specified time frame.