Dataset - B2FIND

The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0

This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus...

The CLASSLA-Stanza model for JOS dependency parsing of standard Slovenian 2.0

This model for JOS dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus...

The CLASSLA-Stanza model for UD dependency parsing of standard Croatian 2.1

The model for UD dependency parsing of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the UD-parsed portion of the...

Trankit model for SST 2.15

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank...

Training corpus ssj500k 2.3

The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....

Training corpus SETimes.SR 1.0

The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic...

The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the UD-parsed portion of...

ENIAMtoolkit (2017-03-06)

ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences; - LCG...

48 datasets found