The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of
tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.
The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, https://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, (3) the Janes annotation guidelines for named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (4) the PARSEME guidelines for annotating multi-word expressions, https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/ and (4) the semantic role labelling annotation protocol for Slovenian and Croatian, https://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Gantar-et-al_Towards-Semantic-Role-Labeling-in-Slovene-and-Croatian.pdf.
Different to the previous version of the dataset, it is now encoded in the conllup format, as are other linguistic training datasets for Croatian and Serbian. The PARSEME multi-word expression annotation layer was added as well, together with countless corrections of labels on all available levels.
The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).