X-SRL Dataset and mBERT Word Aligner

Dataset

DOI

This code contains a method to automatically align words from parallel sentences by using multilingual BERT pre-trained embeddings. This can be used to transfer source annotations (for example labeled English sentences) into the target side (for example a German translation of the sentence) by transferring the label into the best-aligned target word. This newly labeled data can be used to train different multilingual SOTA models to improve performance, especially for the lower-resource languages.

Identifier
DOI	https://doi.org/10.11588/data/HVXXIJ
Metadata Access	https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/HVXXIJ

Provenance
Creator	Daza, Angel
Publisher	heiDATA
Contributor	Daza, Angel
Publication Year	2021
Rights	info:eu-repo/semantics/openAccess
OpenAccess	true
Contact	Daza, Angel (Leibniz Institute for the German Language / Department of Computational Linguistics, Heidelberg University)

Representation
Resource Type	program source code; Dataset
Format	text/markdown; application/zip
Size	6131; 38643
Version	1.0
Discipline	Humanities
Spatial Coverage	Leibniz Institute for the German Language