Multilingual comparable corpora of parliamentary debates ParlaMint 4.1

PID

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words.

The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata.

The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution).

This entry contains the ParlaMint TEI-encoded corpora and their derived plain text versions along with TSV metadata of the speeches. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint.

Note that there also exists the linguistically marked-up version of the 4.1 ParlaMint corpus (http://hdl.handle.net/11356/1911) as well as a version machine translated to English (http://hdl.handle.net/11356/1910). Both are linked with CLARIN.SI concordancers for on-line analysis.

As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has now speeches also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, where UA also has improved language marking (uk vs. ru) on segments.

Identifier
PID http://hdl.handle.net/11356/1912
Related Identifier https://doi.org/10.1007/s10579-024-09798-w
Related Identifier http://hdl.handle.net/11356/1859
Related Identifier https://www.clarin.eu/content/parlamint
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1912
Provenance
Creator Erjavec, Tomaž; Kopp, Matyáš; Ogrodniczuk, Maciej; Osenova, Petya; Agirrezabal, Manex; Agnoloni, Tommaso; Aires, José; Albini, Monica; Alkorta, Jon; Antiba-Cartazo, Iván; Arrieta, Ekain; Barcala, Mario; Bardanca, Daniel; Barkarson, Starkaður; Bartolini, Roberto; Battistoni, Roberto; Bel, Nuria; Bonet Ramos, Maria del Mar; Calzada Pérez, María; Cardoso, Aida; Çöltekin, Çağrı; Coole, Matthew; Darģis, Roberts; de Libano, Ruben; Depoorter, Griet; Diwersy, Sascha; Dodé, Réka; Fernandez, Kike; Fernández Rei, Elisa; Frontini, Francesca; Garcia, Marcos; García Díaz, Noelia; García Louzao, Pedro; Gavriilidou, Maria; Gkoumas, Dimitris; Grigorov, Ilko; Grigorova, Vladislava; Haltrup Hansen, Dorte; Iruskieta, Mikel; Jarlbrink, Johan; Jelencsik-Mátyus, Kinga; Jongejan, Bart; Kahusk, Neeme; Kirnbauer, Martin; Kryvenko, Anna; Ligeti-Nagy, Noémi; Ljubešić, Nikola; Luxardo, Giancarlo; Magariños, Carmen; Magnusson, Måns; Marchetti, Carlo; Marx, Maarten; Meden, Katja; Mendes, Amália; Mochtak, Michal; Mölder, Martin; Montemagni, Simonetta; Navarretta, Costanza; Nitoń, Bartłomiej; Norén, Fredrik Mohammadi; Nwadukwe, Amanda; Ojsteršek, Mihael; Pančur, Andrej; Papavassiliou, Vassilis; Pereira, Rui; Pérez Lago, María; Piperidis, Stelios; Pirker, Hannes; Pisani, Marilina; Pol, Henk van der; Prokopidis, Prokopis; Quochi, Valeria; Rayson, Paul; Regueira, Xosé Luís; Rii, Andriana; Rudolf, Michał; Ruisi, Manuela; Rupnik, Peter; Schopper, Daniel; Simov, Kiril; Sinikallio, Laura; Skubic, Jure; Tungland, Lars Magne; Tuominen, Jouni; van Heusden, Ruben; Varga, Zsófia; Vázquez Abuín, Marta; Venturi, Giulia; Vidal Miguéns, Adrián; Vider, Kadri; Vivel Couso, Ainhoa; Vladu, Adina Ioana; Wissik, Tanja; Yrjänäinen, Väinö; Zevallos, Rodolfo; Fišer, Darja
Publisher CLARIN ERIC
Publication Year 2024
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); PUB; https://creativecommons.org/licenses/by/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Bulgarian; Croatian; Polish; Slovenian; Slovene; Czech; Icelandic; French; Dutch; Flemish; Danish; Spanish; Castilian; Turkish; English; Italian; Hungarian; Latvian; Bosnian; Catalan; Valencian; German; Greek, Modern (1453-); Greek; Estonian; Portuguese; Serbian; Swedish; Ukrainian; Norwegian; Galician; Russian; Finnish; Basque
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 30
Discipline Linguistics