-
Digital library and corpus of historical Slovene IMP 1.1
The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains... -
Morphological Lexicon of Slovene Sloleks 3.1
Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their... -
Morphological lexicon Sloleks 3.0
Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their... -
Morphological lexicon Sloleks 2.0
Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains... -
Morphological lexicon Sloleks 1.2
Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains... -
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Die Erstellung von Fachgebärdenlexika am Institut für Deutsche Gebärdensprach...
Detailed description of how six corpus-based LSP dictionaries German – German Sign Language (DGS) were produced including elicitation methods, annotation and... -
Transkriptionskonventionen im Vergleich
Synopsis of transcription conventions used in six international sign language research projects including annotation tool and tiers in transcripts, divided into conventional... -
Synergies between transcription and lexical database building: The case of Ge...
Building a lemmatised corpus of German Sign Language (DGS) using iLex, a relational database and annotation tool; consistent token-type matching (lemmatisation) and quality... -
How Much Top-Down and Bottom-Up do We Need to Build a Lemmatized Corpus?
Building a lemmatised corpus of German Sign Language (DGS) using iLex; lemmatisation as top-down and lexicon building as bottom-up process; lemma revision -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Reference corpus of historical Slovene goo300k 1.2
goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0
ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
CMC training corpus Janes-Tag 1.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Serbian Twitter training corpus ReLDI-NormTag-sr 1.0
ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
The CLASSLA-Stanza model for lemmatisation of standard Croatian 2.1
The model for lemmatisation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus... -
Spoken Torlak dialect corpus 1.0 (transcription)
Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local... -
The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.1
The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training... -
ReLDI tag+lemma web service for WebLicht
WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry entry for webservice comprising tokenisation, PoS tagging, and lemmatisation. -
CMC training corpus Janes-Tag 2.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
