ESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)

Dataset

PID

ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations. The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable. The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps. The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous). ESIC has validation and evaluation parts. The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.

Identifier
PID	http://hdl.handle.net/11234/1-5415
Related Identifier	https://www.isca-speech.org/archive/pdfs/interspeech_2021/machacek21_interspeech.pdf
Related Identifier	http://hdl.handle.net/11234/1-3719
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-5415

Provenance
Creator	Macháček, Dominik; Žilinec, Matúš; Bojar, Ondřej
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2024
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/825460
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); http://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	English; Czech; German
Resource Type	corpus
Format	application/zip; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics