Mapping onto EUDAT-B2FIND Metadata Schema
The provided metadata must be mapped to the B2FIND schema in a meaningful way. Currently this is done in close cooperation between the data provider and the B2FIND team. By iteratively discussing the process a suitable solution is reached in each case.
- Specification of community metadata
- Homogenisation and semantic mapping
- EUDAT-B2FIND metadata schema
- Concordance with other standards
The implementation of the mapping, as described in the following subsection, is based on a detailed specification and documentation of the community-specific metadata. We have designed a spreadsheet template for gathering the required data. The Excel template can be requested via the support form, by sending us an email or by downloading the version in the google drive at Community-B2FIND_template.xlsx
. This template is divided into several parts, each in their own tab:
- General Information: In this tab, data providers should provide information about the contact persons and the community.
- Metadata Specification: Please give us more detailed information about the specific metadata formats, schemas and structure used.
- Harvesting: Here the 'harvesting endpoints' (e.g. OAI-URL's) should be provided, as well as the protocols and APIs used, and the subsets, if available.
- Mapping: In this table, the mapping of the community properties to the B2FIND schema and coverage information should be laid out. This is iteratively discussed and developed with the data provider during the initial intake process.
To transform and reformat the harvested raw metadata records to datasets, which can be uploaded to the B2FIND catalogue and indexed and displayed in the B2FIND portal, the following processing steps must be carried out:
- Select entries from the XML records, based on XPATH rules that depend on community-specific metadata formats (see providing metadata).
- Parse through the selected values and assign them to the keys specified in the XPATH rules, i.e. fields of the B2FIND schema.
- Store the resulting key-value pairs in JSON dictionaries.
- Check and validate these JSON records before uploading to the B2FIND repository.
The B2FIND Metadata Schema 1.0 is the current version and was released on August 12, 2017. The associated XSD file is available and downloadable as XSD file from b2find_schema_0.1.xsd .
Currently the schema comprises 19 fields or facets as listed in the following table with their description, allowed values and references to the associated properties in the DataCite Metadata Schema 4.1.
|Metadata Type||B2FIND Name||Description||Allowed values||DataCite 4.0 reference||Obligation||Occurence||Comments and Issues|
|General Information||Title||A name or a title by which a resource is known||Textual||3. Title||Mandatory||1||Coding must be UTF-8 (unicode)|
|Description||An additional information describing the content of the resource. Could be an abstract, a summary or a Table of Content.||Textual||17.Description||Recommended||0-1||Coding should be UTF-8 (unicode)|
|Tags||A subject, keyword, classification code, or key phrase describing the content.||List of strings, filter out 'non nouns' by using 'stop words'||6.Subject||Optional||1||Try to use keyword thesauri from communities|
|Identifier||DOI||A persistent, citable identifier (registered at DataCite) that uniquely identifies a resource.||Must be resolvable URL, registered at DataCite as DOI||1.Identifier 1.1. identiferType = DOI||Mandatory (at least one resource identifier is mandatory)||1-3|
|PID||A persistent identifier (implemented as a handle in a Handleserver) that uniquely identifies a resource.||Must be resolvable URL and registered at a handle server||1.Identifier|
|Source||An identifier (URL) that uniquely identifies a resource.||Should be resolvable URL||1.Identifier|
|MetaDataAccess||Link to the original harvested metadata record (GetRecord request)||Should be resolvable URL||N/A||0-1||Recommended|
|Provenance||Creator||The main researchers involved in producing the data, or the authors of the publication, in priority order.||List of names||2. Creator||Recommended||0-n|
|Publisher||The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource. This property will be used to formulate the citation, so consider the prominence of the role.||List of names||4. Publisher||Recommended||0-n|
|PublicationYear||The year when the data was or will be made publicly available.||UTC Year format (YYYY)||5. PublicationYear||Recommended||0-1|
|Rights||Any rights information for this resource.||Textual||16. Rights||Optional||0-1|
|Contact||Any contact information for this resource.||List of Names||[ may be 7. Contributor]||Optional||0-n|
|Representation||Language||The primary language of the resource.||Allowed values are taken from ISO 639‐1 language codes.||9. Language||Optional||0-1||Examples: English, German, French|
|ResourceType||A description of the resource||Textual||10. ResourceType||Recommended||0-1|
|Format||Technical format of the resource||Textual||14. Format||Optional||0-1|
|Checksum||Checksum of the underlying data resource||MD5 checksum||N/A||Optional||0-1|
|Coverage||Discipline||The scientific disciplines linked with the resource.||Controlled vocabulary, see b2find_disciplines.json||N/A [ sometimes information in 6. Subject ]||Recommended||0-n|
|Spatial Coverage||A geolocation where the research data was gathered or/and about which the data is focused and related to. Content of this category is displayed in plain text. f a longitude/latitude information is given it will be displayed at the map.||Textual geo spatial description (Spatial region or named place (geonames)) and if longitude/latitude information is given displayed at the map.||18. Geolocation||Optional||0-1|
|Temporal Coverage||Period of time the research data itself is related to. Could be a date format or plain text.||Date-time representation||8. Date / [8.1 dateType = Collected?]||Optional||0-1||Not really provided by DataCite in the sense of coverage|
As said before the EUDAT-B2FIND schema is compatible with other widely used standards. In the following table the compatibility with the core schema of EUDAT-B2SHARE and the open access initiative OpenAIRE is shown by referring to the DataCite schema. The obligation is specified for each field, where M stands for mandatory, R for recommended and O for optional.
|DataCite #||DataCite 4.1||B2FIND||B2SHARE||OpenAIRE||Comments and Issues|
|1||Identifier(M) (+ 1.1. identifierType=[DOI])||[Source(URL) | DOI | PID] (M)||PID(M),DOI,URL||Identifier(M) (+ 1.1. identifierType=[DOI , ...])||While for B2SHARE always a PID is provided, B2FIND requires at least an URL linked to the underlying data resource||2||Creator(M)||Creator(R)||Creator(R)||Creator(M)|
|6||Subject(R)||Tags and Discipline(R)||Keywords and Discipline(R)||Subject(O)|
|7||Contributor||[ --> Contact]||Contributors||Contributor (MA/O)|
|8||Date||[ --> Temporal Coverage]||The DataCite definition is here very vague (*Different dates relevant to the work*). For B2FIND we have here *PubicationYear*, i.e. the year the dataset is published, and *TemperalCoverage*, i.e. the interval in time the data covers, with a powerful 'Filter by time' associated.|
|13||Size||N/A||Size per data object (file)||Size(O)|
|15||Version||N/A [ --> checksum]||Version(O)|
|18||GeoLocation(R)||SpatialCoverage(O)||GeoLocation(O)||In B2FIND *SpatialCoverage*, i.e. the geo spatial coverage, is associated with a 'Filter by location' interface.|