Collection of Slovenian legal texts COLESLAW 1.0

Dataset

PID

COLESLAW 1.0 is a large-scale collection of Slovenian legal texts compiled from authoritative public sources. The corpus covers legislative, judicial, and governmental legal documents and is designed to support research in legal NLP, information retrieval, contradiction detection, legal reasoning, and domain adaptation of language models.

COLESLAW 1.0 consists of 547,799 unique documents, totalling 771.93 million words, encoded in 12 files. The corpus aggregates documents from four primary domains:

PISRS, Legal Information System of the Republic of Slovenia (pisrs.si)
SodnaPraksa, Lower Court Decisions (sodnapraksa.si)
USRS, Constitutional Court of the Republic of Slovenia (www.us-rs.si)
Uradni List, Constitutional Court of the Republic of Slovenia (www.uradni-list.si)

Sources such as PISRS, Uradni List and SodnaPraksa are additionally divided into: - PISRS: enacted laws, repealed laws, legislative proposals, general and individual acts, register of regulations, records of normative authorities - Uradni List: regulatory and annoucement - SodnaPraksa: lower court decisons and non-pecuniary damage claims

Each domain has a corresponding README.txt file detailing contents of the files and providing descriptions for keys and metadata present in the domain.

Documents are stored in structured JSONL format and include unique identifiers, full cleaned text, and source-specific metadata. Legislative files typically contain regulation identifiers and procedural references, judicial files include structured components such as headnotes, operative parts and reasoning, and gazette publications contain issue identifiers and publication metadata. A complete specification of keys, field definitions, and subcollection-level statistics is provided in the respective README.txt files for each domain.

Identifier
PID	http://hdl.handle.net/11356/2095
Related Identifier	https://journals.uni-lj.si/slovenscina2/issue/view/1698
Related Identifier	https://www.cjvt.si/llm4dh/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2095

Provenance
Creator	Malenšek, Miha; Žitnik, Slavko; Završnik, Aleš; Krajnc, Saša; Križnar, Primož; Bajec, Marko
Publisher	Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2024
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics