Corpus of term-annotated texts RSDO5 1.1

PID

The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually annotated terms, each marked to be either in- or out-domain. The corpus texts were published between 2000 and 2019, are either PhD theses (3), a scientific book based on a PhD thesis (1), graduate level text books (4), or journal articles (4) and belong to the fields of biomechanics (3), linguistics (3), chemistry (3), or veterinary science (3).

Apart from the manually annotated terms, the corpus was automatically annotated with Universal Dependencies annotations, i.e. tokenisation, sentence segmentation, lemmatisation, morpological features and dependency syntax.

As opposed to the previous version, this one adds in- and out-domain marking on terms in the TEI and vertical files.

Identifier
PID http://hdl.handle.net/11356/1470
Related Identifier http://hdl.handle.net/11356/1400
Related Identifier https://rsdo.slovenscina.eu/terminoloski-portal
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1470
Provenance
Creator Jemec Tomazin, Mateja; Trojar, Mitja; Atelšek, Simon; Fajfar, Tanja; Erjavec, Tomaž; Žagar Karer, Mojca
Publisher ZRC SAZU
Publication Year 2021
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 4
Discipline Linguistics