-
Deep Sequoia corpus - PARSEME-FR corpus - FrSemCor
The Sequoia corpus is a set of 3,099 linguistically-annotated French sentences, originating from four sources (Europarl, European Agency Reports, French regional journal L'Est... -
Parallel Global Voices, Czech-English NER+NEL
Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization,... -
Czech Court Decisions Dataset
We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the... -
Czech Legal Text Treebank 2.0
The Czech Legal Text Treebank 2.0 (CLTT 2.0) annotates the same texts as the CLTT 1.0. These texts come from the legal domain and they are manually syntactically annotated. The... -
Archaeological entities and timespans extracted from all archaeology document...
We trained a BERT language model for Dutch Archaeology, and fine-tuned it to perform Named Entity Recognition for 6 categories of entity: artefacts, materials, time periods,... -
ACTER (Annotated Corpora for Term Extraction Research) v1.5
ACTER (Annotated Corpora for Term Extraction Research) is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains... -
ACTER (Annotated Corpora for Term Extraction Research) v1.4
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised... -
ACTER (Annotated Corpora for Term Extraction Research) v1.3
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0
ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Croatian linguistic training corpus hr500k 2.0
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and... -
Wikipedia talk corpus Janes-Wiki 1.0
Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured... -
Training corpus ssj500k 1.3
The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from... -
CMC training corpus Janes-Tag 2.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Serbian linguistic training corpus SETimes.SR 2.0
The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation,... -
Slovenian parliamentary corpus ParlaMeter-sl 1.0
The ParlaMeter-sl corpus contains minutes of the National Assembly of the Republic of Slovenia and currently covers its VIIth mandate (2014-08-01 to 2018-06-22). The corpus... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0
ReLDI-NormTagNER-hr 2.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
CMC training corpus Janes-Tag 3.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,... -
Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0
This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence... -
Training corpus ssj500k 2.0
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
