Corpus of term-annotated texts RSDO5 1.1

Dataset

PID

The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually annotated terms, each marked to be either in- or out-domain. The corpus texts were published between 2000 and 2019, are either PhD theses (3), a scientific book based on a PhD thesis (1), graduate level text books (4), or journal articles (4) and belong to the fields of biomechanics (3), linguistics (3), chemistry (3), or veterinary science (3).

Apart from the manually annotated terms, the corpus was automatically annotated with Universal Dependencies annotations, i.e. tokenisation, sentence segmentation, lemmatisation, morpological features and dependency syntax.

As opposed to the previous version, this one adds in- and out-domain marking on terms in the TEI and vertical files.

Identifier
PID	http://hdl.handle.net/11356/1470
Related Identifier	http://hdl.handle.net/11356/1400
Related Identifier	https://rsdo.slovenscina.eu/terminoloski-portal
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1470

Provenance
Creator	Jemec Tomazin, Mateja; Trojar, Mitja; Atelšek, Simon; Fajfar, Tanja; Erjavec, Tomaž; Žagar Karer, Mojca
Publisher	ZRC SAZU
Publication Year	2021
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 4
Discipline	Linguistics