-
TALN - archives : articles des conférences TALN et RECITAL
Archive numérique francophone des articles de recherche en Traitement Automatique des Langues publiés par l'Association pour le Traitement Automatique des LAngues (ATALA) lors... -
Gebärdensprach-Korpora im Überblick
Survey on sign language corpora with detailed information on project, raw, meta, primary data, references and contact information -
Tools for Harmonizing Available Annotations to a Common Format
This deliverable D6.6 provides tools for harmonizing available annotations to a common interchange format. It defines the interchange format and provides example implementations... -
Sign Language Corpora Survey
Survey on sign language corpora with detailed information on project, raw, meta, primary data, references and contact information -
Establishment of the Infrastructure to Automatically Analyse other Datasets
This deliverable D9.6 documents the installation of a tool chain for processing sign language data external to the project, mostly meant to be run on a high performance... -
GER_SET: Situation Entity Type labelled corpus for German
Semantic clause types, also called Situation Entity (SE) types (Smith, 2003) are linguistic characterizations of aspectual properties shown to be useful for tasks like... -
BiHCorp: Corpus of Parliamentary Debates in Bosnia and Herzegovina
+++++++++++++++++++++++++++++++++++++++++++ The most recent version of this study is available at: https://doi.org/10.5281/zenodo.6517697... -
CROCorp: Corpus of Parliamentary Debates in Croatia
+++++++++++++++++++++++++++++++++++++++++++ The most recent version of this study is available at: https://doi.org/10.5281/zenodo.6521372... -
SRBCorp: Corpus of Parliamentary Debates in Serbia
+++++++++++++++++++++++++++++++++++++++++++ The most recent version of this study is available at: https://doi.org/10.5281/zenodo.6521648... -
BiHCorp: Corpus of Parliamentary Debates in Bosnia and Herzegovina
The repository contains a cleaned and pre-processed corpus of parliamentary debates from the Parliamentary Assembly of Bosnia and Herzegovina. The corpus is accompanied by the... -
The Multilingual Emotional Football Corpus (MEmoFC)
Multilingual Emotional Football Corpus, (MEmoFC) has been manually collected from English, German, and Dutch websites of individual football clubs to investigate the way... -
Replication Data for: When modality and tense meet. The future marker budet ‘...
Dataset description: This is a study of examples of Russian impersonal constructions with the modal word možno ‘can, be possible’ with and without the future copula budet ‘will... -
Replication Data for: Russian verbal borrowings in Udmurt
This is the dataset used in a study of Russian verbal loans in Udmurt. The files contain lists of Russian verbs found in the Udmurt social media corpus... -
Replication Data for: A network of allostructions: quantified subject constru...
Data and R code are provided for statistical analysis of approximately 39,000 corpus examples of predicate agreement in constructions with quantified subjects in Russian. The... -
Replication Data for: Less is More: Why All Paradigms are Defective, and Why ...
Only a fraction of lexemes are encountered in all their paradigm forms in any corpus or even in the lifetime of any speaker. This raises a question as to how it is that native... -
SYN v4: large corpus of written Czech
Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the... -
Cleaned Polish Oscar corpus (64M lines)
Cleaned Polish Oscar corpus (part: 64M lines, 3.45 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences... -
Big Data language model - subword - BPE - ARPA
Big data language model based on subword units, based on byte pair encoding in ARPA format -
MWE Kuncewiczowa
Maria Kuncewiczowa -
CEN
Corpus of Economic News (CEN) contains 797 documents from Polish Wikipedia annotated with 65 categories of proper names in ccl format....