Parallel corpus of idiomatic text ParaDiom 1.0

PID

ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English sentences with their Slovene translations. The sampled sentences contain idioms, similes, and proverbs, which are annotated in the corpus. Sentences were sampled based on a selection of 100 Slovene and 92 English idioms and similes by searching through sentences in the corpora ccGigafida (http://hdl.handle.net/11356/1035), ParlaMint (http://hdl.handle.net/11356/1431), and The Corpus of Late Modern English Texts (http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html). All sampled sentences were tagged with MULTEXT-East MSD tags, Universal Dependencies morphological features and lemmas using Stanza (https://github.com/stanfordnlp/stanza) for English and CLASSLA for Slovene (https://github.com/clarinsi/classla) sentences. Some idioms were found as part of proverbs, which were also annotated. Half of the sampled sentences were translated by hand, and the other half were translated using machine translation and post-editing. We used the Q-CAT annotation tool (http://hdl.handle.net/11356/1262) to annotate the idiomatic expressions. The annotated noun, adjective and adverbial idioms were given the label MWE ID (‘idiomatic multiword expression’), verb idioms MWE VID (‘verbal idiomatic multiword expression’), similes MWE SIM (‘simile’), and proverbs MWE P (‘proverb’).

Identifier
PID http://hdl.handle.net/11356/1714
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1714
Provenance
Creator Donaj, Gregor; Antloga, Špela
Publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
Publication Year 2022
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene; English
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics