COLESLAW 1.0 is a large-scale collection of Slovenian legal texts compiled from authoritative public sources. The corpus covers legislative, judicial, and governmental legal documents and is designed to support research in legal NLP, information retrieval, contradiction detection, legal reasoning, and domain adaptation of language models.
COLESLAW 1.0 consists of 547,799 unique documents, totalling 771.93 million words, encoded in 12 files. The corpus aggregates documents from four primary domains:
- PISRS, Legal Information System of the Republic of Slovenia (pisrs.si)
- SodnaPraksa, Lower Court Decisions (sodnapraksa.si)
- USRS, Constitutional Court of the Republic of Slovenia (www.us-rs.si)
- Uradni List, Constitutional Court of the Republic of Slovenia (www.uradni-list.si)
Sources such as PISRS, Uradni List and SodnaPraksa are additionally divided into:
- PISRS: enacted laws, repealed laws, legislative proposals, general and individual acts, register of regulations, records of normative authorities
- Uradni List: regulatory and annoucement
- SodnaPraksa: lower court decisons and non-pecuniary damage claims
Each domain has a corresponding README.txt file detailing contents of the files and providing descriptions for keys and metadata present in the domain.
Documents are stored in structured JSONL format and include unique identifiers, full cleaned text, and source-specific metadata. Legislative files typically contain regulation identifiers and procedural references, judicial files include structured components such as headnotes, operative parts and reasoning, and gazette publications contain issue identifiers and publication metadata. A complete specification of keys, field definitions, and subcollection-level statistics is provided in the respective README.txt files for each domain.