Carniolan Provincial Assembly corpus Kranjska 1.0

PID

The corpus contains meeting proceedings of the Carniolan Provincial Assembly from 1861 to 1913 (Obravnave deželnega zbora kranjskega / Bericht über die Verhandlungen des krainischen Landtages). The corpus comprises 694 sessions (15353 pages, approximately 10 million words).

The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia dLib.si (http://www.dlib.si) and History of Slovenia - SIstory (https://www.sistory.si) portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin.

The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit (https://github.com/nlp-uoregon/trankit) for Slovenian and German, while Lingua (https://github.com/pemistahl/lingua-py) is used for language detection.

The documents are in the Parla-CLARIN (https://github.com/clarin-eric/parla-clarin) compliant TEI XML format. Each session in one file.

Identifier
PID http://hdl.handle.net/11356/1824
Related Identifier https://www.inz.si/en/Digital-humanities-resources-tools-and-methods/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1824
Provenance
Creator Kavčič, Alenka; Mundjar, Aleksander; Marolt, Matija
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2023
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene; German
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics