Terminology identification dataset KAS-term 1.0


The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the frequency threshold of 3. The PhD theses are from the areas of chemistry, computer science and political science. Each of the term candidates is annotated by four annotators as being (1) in-domain term, (2) out-of-domain term, (3) general academic term or (4) not a term. Each term candidate is also annotated with its frequency in the PhD thesis and 7 statistical measures. The resource can serve as a training set for supervised learning of term extraction and for terminology extraction tool benchmarking.

PID http://hdl.handle.net/11356/1198
Related Identifier http://nl.ijs.si/kas/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1198
Creator Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Arhar Holdt, Špela; Bren, Urban; Robnik-Šikonja, Marko; Udovič, Boštjan
Publisher Jožef Stefan Institute
Publication Year 2018
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Language Slovenian; Slovene
Resource Type lexicalConceptualResource
Format application/octet-stream; text/csv; text/plain; application/pdf; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline Linguistics