Parallel corpus of idiomatic text ParaDiom 1.0

Dataset

PID

ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English sentences with their Slovene translations. The sampled sentences contain idioms, similes, and proverbs, which are annotated in the corpus. Sentences were sampled based on a selection of 100 Slovene and 92 English idioms and similes by searching through sentences in the corpora ccGigafida (http://hdl.handle.net/11356/1035), ParlaMint (http://hdl.handle.net/11356/1431), and The Corpus of Late Modern English Texts (http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html). All sampled sentences were tagged with MULTEXT-East MSD tags, Universal Dependencies morphological features and lemmas using Stanza (https://github.com/stanfordnlp/stanza) for English and CLASSLA for Slovene (https://github.com/clarinsi/classla) sentences. Some idioms were found as part of proverbs, which were also annotated. Half of the sampled sentences were translated by hand, and the other half were translated using machine translation and post-editing. We used the Q-CAT annotation tool (http://hdl.handle.net/11356/1262) to annotate the idiomatic expressions. The annotated noun, adjective and adverbial idioms were given the label MWE ID (‘idiomatic multiword expression’), verb idioms MWE VID (‘verbal idiomatic multiword expression’), similes MWE SIM (‘simile’), and proverbs MWE P (‘proverb’).

Identifier
PID	http://hdl.handle.net/11356/1714
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1714

Provenance
Creator	Donaj, Gregor; Antloga, Špela
Publisher	Faculty of Electrical Engineering and Computer Science, University of Maribor
Publication Year	2022
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene; English
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics