Dataset - B2FIND

Chromosomal rearrangements but no change of genes and transposable elements r...

We present a new de-novo whole-genome assembly obtained from a high quality DNA extraction and long-reads sequencing Nanopore technology obtained from an isolate sampled in the...

MIMIC2: Murine Intestinal Microbiota Integrated Catalog v2

Dataset overview The MIMIC2 dataset provides: a non-redundant high-quality catalog of 5.0 million genes 6,967 Metagenome-Assembled Genomes (MAGs) 1,252 Metagenomic Species...

High density genotyping dataset associated to the paper entitled « The geneti...

This dataset consists of genotypes of 528 animals belonging to 13 cattle populations for 680338 variants (SNPs) in plink format. The compressed archive contains two files: (i) a...

SNP4OrphanSpecies: A bio-informatic pipeline to isolate robust molecular mark...

This pipeline performs a de-novo genome assembly, identifies single copy genes in the genome, and designs several pools of pair of primers for amplifying regions of these genes...

MetaChick: characterization of the chicken caecal metagenome by deep shotgun ...

Dataset overview This dataset provides: a non-redundant high-quality catalog of 13.6 million genes 30,031 Metagenome-Assembled Genomes (MAGs) 2,420 Metagenomic Species...

Czech HS Contracts Dataset (CHSC) 1.0

Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal....

Multilingual static embeddings for Verbal Multiword Expressions trained on PA...

This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi,...

Indonesian web corpus (idWac)

Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd...

Large Corpus of Czech Parliament Plenary Hearings

We present a large corpus of Czech parliament plenary sessions. The corpus consists of approximately 444 hours of speech data and corresponding text transcriptions. The whole...

SQAD 3.2

Simple question answering database version 3.2 (SQAD v3.2) created from Czech Wikipedia. The new version consists of more than 16000 records. Each record of SQAD consists of...

skTenTen

Slovak large web corpus skTenTen, comprising 876,003,720 tokens.

sqad 3.0

Simple question answering database version 3 (SQAD v3) created from Czech Wikipedia. New version consits of 13477 records. Each record of SQAD consist of multiple files -...

Czech OOV Inflection Dataset

Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard...

CoNLL 2018 Shared Task - UDPipe Baseline Models and Supplementary Materials

Baseline UDPipe models for CoNLL 2018 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.2 and are evaluated using the official...

MorfFlex SK 170914

Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma...

MorfFlex CZ 161115

Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for...

VIADAT-SEARCH

VIADAT-SEARCH in connection with VIADAT-REPO enables searching transcripts of oral history recordings. Language analysis has been used to preprocess the recordings, which makes...

CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials

Baseline UDPipe models for CoNLL 2017 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.1 and are evaluated using the official...

SYN v9: large corpus of written Czech

Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed...

MorfFlex CZ

Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for...

103 datasets found