-
Training corpus ssj500k 2.0
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
xLiMe Twitter Corpus XTC 1.0.1
The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,... -
Training corpus ssj500k 1.3
The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from... -
ACTER (Annotated Corpora for Term Extraction Research) v1.5
ACTER (Annotated Corpora for Term Extraction Research) is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains... -
ACTER (Annotated Corpora for Term Extraction Research) v1.4
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised... -
ACTER (Annotated Corpora for Term Extraction Research) v1.3
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised... -
KPWr annotation guidelines - named entities
Named entities annotation guidelines describing the process of manual annotation of documents in Polish Corpus of Wrocław University of Technology (KPWr) -
Extended dictionary of named entities NELexicon connected with Linked Open Data
This resource contains Polish named entities connected with terminology from available resources within Linked Open Data (e.g. WordNet, DBPedia, Wikipedia, etc.). -
KPWr dump r240
Dump of the Polish Corpus of Wrocław University of Technology (KPWr) containing a set of documents annotated with named entities and keywords. -
PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...
The task consists in developing a tool for lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines... -
PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...
The task consists in developing a tool for the lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines... -
Brand-Product Relation Extraction Corpora
Brand-Product Relation Extraction Using Heterogeneous Vector SpaceRepresentations, Janz, A., Piasecki, M., Kopociński, Ł., Pluwak, A. -
Slovenian parliamentary corpus ParlaMeter-sl 1.0
The ParlaMeter-sl corpus contains minutes of the National Assembly of the Republic of Slovenia and currently covers its VIIth mandate (2014-08-01 to 2018-06-22). The corpus... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
ReLDI token+tag+lemma+NER web service for WebLicht
WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry entry for webservice comprising tokenisation, PoS tagging and Named Entity Recognition. Tool source files are... -
Training corpus ssj500k 1.4
The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named... -
Croatian parliamentary corpus ParlaMeter-hr 1.0
The ParlaMeter-hr corpus contains minutes of the National Assembly of the Republic of Croatia and currently covers its VIth mandate (2016-11-15 - 2018-11-21). The corpus... -
Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0
This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence... -
Wikipedia Infobox Mapping PL
Mapping between infobox attributes used in Polish Wikipedia and KPWr named entity schema.