CLARIN - Repositories

Language Technology Research Bibliography for Lithuanian 2016-2020

The language technology bibliography for Lithuanian language in the period 2016-2020. The resource is in BibTex format and it contains: 1) 91 references of research...

JABLONSKIS tagset v2

JABLONSKIS VERSION 2 is a Lithuanian standard morphologiclal tagset that is based on the abbreviations of parts of speech and other grammatical categories commonly used in...

Lithuanian Treebank ALKSNIS (2019-10-24)

ALKSNIS v3.0. ALKSNIS v3,0 consists of 3,643 syntactically annotated sentences in the PML (Prague Mark-up Language) format. The format allows researchers to visualise and edit...

TED-ELH Parallel Corpus

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data.

Survey Data on Preferences of Lithuanian Cybersecurity Terminology

The data is provided in two files: one containing questionnaire-data and the other containing the respondentents' data. The questionnaire data is in a TXT file, which includes...

MariTerm v.1.2

This is an enriched version of the MariTerm maritime ontology, containing plug-ins to correpsonding synsets inside IWN. The resource was created within the collaboration of the...

SELEXINI corpus

We present here a large automatically annotated corpus for French. This corpus is divided into two parts: the first from BigScience, and the second from HPLT. The annotated...

Parole+ (2017-10-16)

The Swedish PAROLE Lexicon - A language technology resource with access to syntactic information, connected to SALDO senses. Svenskt PAROLE-lexikon - En språkteknologisk resurs...

Annotated Route Description

This file set existing of a video stream, an audio stream and a multimodal annotation file is a frequently used as show case of how to do complex multimodal annotations with the...

Model for Normalizing Historical English

This is an OpenNMT-py model for normalizing historical English into modern spelling. For usage, please see: https://github.com/mikahama/natas This has been described in the...

Gustav Vasa's letter production (2015-05-26) Gustav Vasas brevproduktion (20...

King Gustav I's registry Konung Gustaf den förstes registratur

Wikipedia paths

Wikipedia category embedding starting at the top category Biology for English, French and Czech. English data are not complete.

Finnish Semantic Relatedness Model

This model is a semantic model that captures the relatedness of Finnish words as word vectors. This model can be used in various tasks such as metaphor interpretation. For...

SIgn Language Recording

This is a Sign Language Recording made for scientific purposes.

Murre - Normalize non-standard Finnish and dialectalize standard Finnish

A python library for normalizing dialectal Finnish and dialectalizing standard Finnish. Normalization Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text...

Creative Dialog Generation for Fallout 4

Mika Hämäläinen and Khalid Alnajjar. 2019. Creative contextual dialog adaptation in an open world RPG. In Proceedings of the 14th International Conference on the Foundations of...

Cases of Complements of Finnish Verbs

Context Cases of the complements of Finnish verbs. The data is useful for natural language generation (NLG). The data is described in the following paper, which should also be...

HABE-IXA euskarazko idazmen proben corpusa HABE-IXA Basque written test corpus

This corpus contains essays written in official HABE exams for assessing student's knowledge of the Basque language. We have collected 120 essays in each of the B1, B2, C1 and...

CLIN26-Bracmat-poster.pdf

Linguistic and algebraic expressions can be analysed with similar pattern matching (PM) methods, suggesting a trove of useful methods for Natural Language Processing (NLP). For...

TXM_0.7.7_Win64.exe

TXM 0.7.7 for Windows 64-bit setup file TXM is a free and open-source (GPL v3) textual corpora analysis platform. It combines five key components: a) the ability to import and...

4,930 datasets found