Bilingual terminology extraction dataset KAS-biterm 1.0


The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD theses from the KAS corpus of Slovene academic writing. Only sentences that have a high chance of containing the term in the original language and its translation into Slovene were chosen, by using three CQL patterms in noSketch Engine. These sentences are manually annotated for (1) terms, (2) partial terms and (3) abbreviations in (a) Slovene, (b) English, or (c) other language. Links between the Slovene terms and their equivalents in the other languages, as well as their abbreviations, are encoded as well. The resource can serve as a training set for supervised learning of bilingual term extraction tools and their benchmarking.

Related Identifier
Metadata Access
Creator Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Bitenc, Maja
Publisher Jožef Stefan Institute
Publication Year 2018
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0);; PUB
OpenAccess true
Contact info(at)
Language Slovenian; Slovene; English
Resource Type corpus
Format application/zip; application/pdf; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics