X-SRL Dataset and mBERT Word Aligner

DOI

This code contains a method to automatically align words from parallel sentences by using multilingual BERT pre-trained embeddings. This can be used to transfer source annotations (for example labeled English sentences) into the target side (for example a German translation of the sentence) by transferring the label into the best-aligned target word. This newly labeled data can be used to train different multilingual SOTA models to improve performance, especially for the lower-resource languages.

Identifier
DOI https://doi.org/10.11588/data/HVXXIJ
Metadata Access https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/HVXXIJ
Provenance
Creator Daza, Angel (Leibniz Institute for the German Language / Department of Computational Linguistics, Heidelberg University)
Publisher heiDATA
Contributor Daza, Angel
Publication Year 2021
Rights <p>'X-SRL Dataset and mBERT Word Aligner' is licensed under <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.</p> <p>Please note the licenses of required components:</p> <ul> <li><a href="https://github.com/huggingface/transformers">HuggingFace Transformers</a> (<a href="http://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>)</li> <li><a href="https://www.python.org/downloads/release/python-363/"><strong>Python 3.6.3</strong> (Open Source)</a></li> <li><a href="https://spacy.io/">SpaCy</a> (MIT License)</li> </ul>; info:eu-repo/semantics/openAccess
OpenAccess true
Contact Daza, Angel (Leibniz Institute for the German Language / Department of Computational Linguistics, Heidelberg University)
Representation
Resource Type program source code; Dataset
Format text/markdown; application/zip
Size 6131; 38643
Version 1.0
Discipline Humanities
Spatial Coverage Leibniz Institute for the German Language