SynEst (English-to-Estonian) Synthetic Estonian Parallel Corpus

DOI

Synthetic parallel corpus with original English texts, machine-translated into Estonian and filtered.

Original English text sources: - NewsCrawl (https://data.statmt.org/news-crawl) up to year 2021 - ParaCrawl v9 (https://paracrawl.eu): the English side of parallel corpora between English and German, Spanish, Finnish, French, Lithuanian, Latvian, Russian, Swedish, Ukrainian and Chinese - United Nations Parallel Corpus (https://conferences.unite.un.org/uncorpus) - OpenSubtitles (https://opus.nlpl.eu) monolingual English texts

Additional unfiltered data (not included in count): - Reddit data (downloaded via https://github.com/microsoft/DialoGPT) in English

Identifier
DOI https://doi.org/10.15155/5R1E-6R35
Metadata Access https://metashare.ut.ee/oai_pmh/?verb=GetRecord&metadataPrefix=olac&identifier=2c7256ade14b11ee8822cf5d819ab78b9cb61168200644faa3b9aae58f95c3c0
Provenance
Publisher CLARIN
Contributor Mark Fišel, fishel[at]ut.ee, Tartu Ülikool
Publication Year 2024
Rights CC-BY
OpenAccess true
Contact info(at)keeleressursid.ee
Representation
Language Estonian; English
Resource Type Text
Size 768250602 sentences
Discipline Linguistics