SumeCzech-NER

Dataset

PID

SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset).

Format

The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are:

dataset: train, dev, test, oodtest
ne_abstract: list of named entity annotations of article's abstract
ne_headline: list of named entity annotations of article's headline
ne_text: list of name entity annotations of article's text
url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER

Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions.

Tokenization We used the following Python code for tokenization:

from typing import List  
from nltk.tokenize import word_tokenize

def tokenize(text: str) -> List[str]:  
for mark in ('.', ',', '?', '!', '-', '–', '/'):  
    text = text.replace(mark, f' {mark} ')  
tokens = word_tokenize(text)  
return tokens

Identifier
PID	http://hdl.handle.net/11234/1-3505
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-3505

Provenance
Creator	Marek, Petr; Müller, Štěpán
Publisher	Czech Technical University in Prague
Publication Year	2021
Rights	Mozilla Public License 2.0; http://opensource.org/licenses/MPL-2.0; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 4
Discipline	Linguistics