The model for lemmatisation of non-standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus (http://hdl.handle.net/11356/1792) and the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1793), using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.23.
The difference to the previous version of the model is that this version is trained on a combination of two corpora (hr500k, ReLDI-NormTagNER-hr).