Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0

Dataset

PID

ParlaMint-en 3.0 comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 (http://hdl.handle.net/11356/1488) which were machine translated to English and the translation linguistically annotated.

Except for the translation to English, small changes in the metadata and the absence of the British parliament corpus, the corpora included in this entry are all respects identical to the source language corpora, i.e. the entry comprises the same 26 European parliamentary corpora, together with over 1.1 billion words.

The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) with OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/), using the English language model. For NER the conll03 model with 4 NE classes was used.

Note that the automatically produced translation to English contains errors typical of neural machine translation, which also includes factual errors even when a high level of fluency is achieved, and any manual or automatic usage of this corpus should take the machine translation limitations into account. Note also that some metadata errors were noticed after the source 3.0 corpora were released, and were corrected for the MTed corpus, so there are slight differences in the metadata between the two.

The files associated with this entry include the linguistically annotated corpora in several formats: the corpora in thje canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corproa in the CoNLL-U format with TSV speech metadata. In contrast to the source language corpora, the CoNLL-U files are not derived from the TEI encoded corpus but are the ones output by the machine translation and linguistic annotation pipeline as these also contain word-alignment information, which is not present in the TEI version. Also included is the ParlaMint-en-3.0 release of the scripts and samples available at the GitHub repository of the ParlaMint project.

Identifier
PID	http://hdl.handle.net/11356/1810
Related Identifier	http://hdl.handle.net/11356/1864
Related Identifier	https://www.clarin.eu/parlamint
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1810

Provenance
Creator	Kuzman, Taja; Ljubešić, Nikola; Erjavec, Tomaž; Kopp, Matyáš; Ogrodniczuk, Maciej; Osenova, Petya; Fišer, Darja; Pirker, Hannes; Wissik, Tanja; Schopper, Daniel; Kirnbauer, Martin; Mochtak, Michal; Rupnik, Peter; Pol, Henk van der; Depoorter, Griet; de Does, Jesse; Simov, Kiril; Grigorova, Vladislava; Grigorov, Ilko; Jongejan, Bart; Haltrup Hansen, Dorte; Navarretta, Costanza; Mölder, Martin; Kahusk, Neeme; Vider, Kadri; Bel, Nuria; Antiba-Cartazo, Iván; Pisani, Marilina; Zevallos, Rodolfo; Regueira, Xosé Luís; Vladu, Adina Ioana; Magariños, Carmen; Bardanca, Daniel; Barcala, Mario; Garcia, Marcos; Pérez Lago, María; García Louzao, Pedro; Vivel Couso, Ainhoa; Vázquez Abuín, Marta; García Díaz, Noelia; Vidal Miguéns, Adrián; Fernández Rei, Elisa; Diwersy, Sascha; Luxardo, Giancarlo; Coole, Matthew; Rayson, Paul; Nwadukwe, Amanda; Gkoumas, Dimitris; Papavassiliou, Vassilis; Prokopidis, Prokopis; Gavriilidou, Maria; Piperidis, Stelios; Ligeti-Nagy, Noémi; Jelencsik-Mátyus, Kinga; Varga, Zsófia; Dodé, Réka; Barkarson, Starkaður; Agnoloni, Tommaso; Bartolini, Roberto; Frontini, Francesca; Montemagni, Simonetta; Quochi, Valeria; Venturi, Giulia; Ruisi, Manuela; Marchetti, Carlo; Battistoni, Roberto; Darģis, Roberts; van Heusden, Ruben; Marx, Maarten; Depuydt, Katrien; Tungland, Lars Magne; Rudolf, Michał; Nitoń, Bartłomiej; Aires, José; Mendes, Amália; Cardoso, Aida; Pereira, Rui; Yrjänäinen, Väinö; Norén, Fredrik Mohammadi; Magnusson, Måns; Jarlbrink, Johan; Meden, Katja; Pančur, Andrej; Ojsteršek, Mihael; Çöltekin, Çağrı; Kryvenko, Anna
Publisher	CLARIN ERIC
Publication Year	2023
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	English
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; application/gzip; downloadable_files_count: 26
Discipline	Linguistics