Collection of Slovenian legal texts COLESLAW 1.0

PID

COLESLAW 1.0 is a large-scale collection of Slovenian legal texts compiled from authoritative public sources. The corpus covers legislative, judicial, and governmental legal documents and is designed to support research in legal NLP, information retrieval, contradiction detection, legal reasoning, and domain adaptation of language models.

COLESLAW 1.0 consists of 547,799 unique documents, totalling 771.93 million words, encoded in 12 files. The corpus aggregates documents from four primary domains:

  • PISRS, Legal Information System of the Republic of Slovenia (pisrs.si)
  • SodnaPraksa, Lower Court Decisions (sodnapraksa.si)
  • USRS, Constitutional Court of the Republic of Slovenia (www.us-rs.si)
  • Uradni List, Constitutional Court of the Republic of Slovenia (www.uradni-list.si)

Sources such as PISRS, Uradni List and SodnaPraksa are additionally divided into: - PISRS: enacted laws, repealed laws, legislative proposals, general and individual acts, register of regulations, records of normative authorities - Uradni List: regulatory and annoucement - SodnaPraksa: lower court decisons and non-pecuniary damage claims

Each domain has a corresponding README.txt file detailing contents of the files and providing descriptions for keys and metadata present in the domain.

Documents are stored in structured JSONL format and include unique identifiers, full cleaned text, and source-specific metadata. Legislative files typically contain regulation identifiers and procedural references, judicial files include structured components such as headnotes, operative parts and reasoning, and gazette publications contain issue identifiers and publication metadata. A complete specification of keys, field definitions, and subcollection-level statistics is provided in the respective README.txt files for each domain.

Identifier
PID http://hdl.handle.net/11356/2095
Related Identifier https://journals.uni-lj.si/slovenscina2/issue/view/1698
Related Identifier https://www.cjvt.si/llm4dh/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2095
Provenance
Creator Malenšek, Miha; Žitnik, Slavko; Završnik, Aleš; Krajnc, Saša; Križnar, Primož; Bajec, Marko
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2024
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics