The SlovParl corpus contains minutes of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after Slovenia became an independent country in 1991. The corpus comprises 232 sessions, 58,813 speeches and 10.8 million words. The corpus contains extensive meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations.
This item comprises three datasets:
- the corpus in TEI (module Transcriptions of speech);
- the corpus in TEI with added automatic linguistic annotation: tokenisation, MSD tagging and lemmatisation;
- the corpus in vertical format used by various concordancers, e.g. CWB and Sketch Engine; this format is simpler and smaller but does not contain all the information from the source TEI.
The SlovParl data originally come from https://github.com/SIstory/SlovParl, but have been converted to use TEI elements for speech.
The first version of this resource is presented in the paper:
Pančur, Andrej. "Označevanje zbirke zapisnikov sej slovenskega parlamenta s smernicami TEI." In the Proceedings of the Conference on Language Technologies & Digital Humanities (Tomaž Erjavec and Darja Fišer, eds.) 142-148. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani, 2016.