55 datasets found

Keywords: B2SHARE

Filter Results
  • NoticIA

    We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative...
  • Psycholinguistic Experiment Video

    This is a video recording that is being used in psycholinguistic experiments.
  • Laburpen corpusa The Basque Summaries Corpus

    School summaries obtained from Unai Atutxa's thesis (Atutxa, 2022) are available under the CC BY-NC 4.0 license. A total of 1676 extractions and abstractions have been...
  • SemMdf - Semantic Database for Moksha

    This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...
  • SemKpv - Semantic Database for Komi-Zyrian

    This SQLite database contains Komi-Zyrian lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...
  • Skolt Sami - North Sami Cognates

    A human curated list of Skolt Sami (sms) - North Sami (sme) cognates found with an automatic method described in: Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates...
  • Celebrities and Famous People, and their Properties

    Context This dataset is based on the work presented in the following publication, please cite it if you use the data in an academic publication: Alnajjar, K., Hämäläinen, M.,...
  • UralicNLP - The NLP library for Uralic languages

    UralicNLP is a natural language processing library targeted mainly for Uralic languages. UralicNLP can produce morphological analysis, generate morphological forms, lemmatize...
  • El mejor conjunto de datos para identificación del sarcasmo

    Este corpus contiene todas las locuciones de dos episodios de South Park (voces para América Latina) y dos episodios de Archer (voces para España). Cada locución ha sido anotado...
  • s.morfcorpus.6ec19594.20131227-2309

    WMT 2013 Crawled News monolingual corpus, Czech, segmented by Morfessor
  • Exploring genealogical blends_Online Corpus

    The online corpus supplement to the paper "Exploring genealogical blends: the Surinamese Creole Cluster and the Virgin Islands Dutch Creole Cluster", published in the CLARIN...
  • Movie Title Puns

    Context The data is based on the following paper on pun generation: Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a...
  • SemFi: Finnish Semantics with Syntactic Relations

    Context This dataset is covered in detail in the following publication: Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost...
  • Finnish Dialect Normalization Model

    This is an OpenNMT-py model for normalizing spoken Finnish text into written Finnish. For usage, please see https://github.com/mikahama/murre/ This model has been produced in...
  • SentiLex-PT 02

    SentiLex-PT is a sentiment lexicon for Portuguese, made up of 7,014 lemmas, and 82,347 inflected forms. In detail, the lexicon describes: 4,779 (16,863) adjectives, 1,081...
You can also access this registry using the API (see API Docs).