CLARIN - Repositories

English-Lithuanian Parallel Cybersecurity Corpus - DVITAS v2.0

English-Lithuanian parallel corpus DVITAS v2 includes original English texts on cybersecurity and their Lithuanian translations aligned on the sentence level. Version 1 of the...

DIGIRES COVID-19 ML Dataset v.1

DIGIRES COVID-19 ML dataset v.1 is a tab-separated (.tsv) file prepared for training machine learning algorithms. The training dataset was compiled from various internet public...

Lithuanian keyboard for macOS users

This keyboard driver allows easy access of the Lithuanian letters via conventional keyboard layout a.k.a. „Lithuanian letters instead of numbers“. Essential new feature of this...

Colloc -- A Tool for Automatic Identification of Multiword Expressions

Colloc -- a tool for automatic identification of multiword expressions (MWE) is freely available for online use at http://resursai.mwe.lt/atpazintuvas. As material for training...

ORVELIT v3

ORVELIT v3 (Lith.Originalios ir Vertimų Lietuvių Kalbos Tekstynas) is a comparable monolingual corpus of original and translated Lithuanian consisting of four sub-corpora of...

Corpus of the Contemporary Lithuanian Language

Corpus of the Contemporary Lithuanian Language, which comprises 208 million words, is a collection of texts designed to represent the current Lithuanian. The corpus has been...

Pedagogic Corpus of Lithuanian

The Pedagogic Corpus of Lithuanian is a monolingual specialized corpus, prepared for learning and teaching Lithuanian in a foreign language classroom. The pedagogic corpus...

Lithuanian morphologically annotated corpus - MATAS

MATAS v0.2 - Morphologically Annotated Lithuanian Corpus (manually checked) Contains 4 parts: Documents (21%), Fiction (19%), Periodicals (36%), Scientific texts (24%) Wordform...

Language Technology Research Bibliography for Lithuanian 2016-2020

The language technology bibliography for Lithuanian language in the period 2016-2020. The resource is in BibTex format and it contains: 1) 91 references of research...

JABLONSKIS tagset v2

JABLONSKIS VERSION 2 is a Lithuanian standard morphologiclal tagset that is based on the abbreviations of parts of speech and other grammatical categories commonly used in...

Lithuanian Treebank ALKSNIS (2019-10-24)

ALKSNIS v3.0. ALKSNIS v3,0 consists of 3,643 syntactically annotated sentences in the PML (Prague Mark-up Language) format. The format allows researchers to visualise and edit...

TED-ELH Parallel Corpus

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data.

Survey Data on Preferences of Lithuanian Cybersecurity Terminology

The data is provided in two files: one containing questionnaire-data and the other containing the respondentents' data. The questionnaire data is in a TXT file, which includes...

MariTerm v.1.2

This is an enriched version of the MariTerm maritime ontology, containing plug-ins to correpsonding synsets inside IWN. The resource was created within the collaboration of the...

SELEXINI corpus

We present here a large automatically annotated corpus for French. This corpus is divided into two parts: the first from BigScience, and the second from HPLT. The annotated...

Parole+ (2017-10-16)

The Swedish PAROLE Lexicon - A language technology resource with access to syntactic information, connected to SALDO senses. Svenskt PAROLE-lexikon - En språkteknologisk resurs...

Annotated Route Description

This file set existing of a video stream, an audio stream and a multimodal annotation file is a frequently used as show case of how to do complex multimodal annotations with the...

Model for Normalizing Historical English

This is an OpenNMT-py model for normalizing historical English into modern spelling. For usage, please see: https://github.com/mikahama/natas This has been described in the...

Gustav Vasa's letter production (2015-05-26) Gustav Vasas brevproduktion (20...

King Gustav I's registry Konung Gustaf den förstes registratur

Wikipedia paths

Wikipedia category embedding starting at the top category Biology for English, French and Czech. English data are not complete.

4,938 datasets found