SynEst (English-to-Estonian) Synthetic Estonian Parallel Corpus

DOI

Synthetic parallel corpus with original English texts, machine-translated into Estonian and filtered.

Original English text sources: - NewsCrawl (https://data.statmt.org/news-crawl) up to year 2021 - ParaCrawl v9 (https://paracrawl.eu): the English side of parallel corpora between English and German, Spanish, Finnish, French, Lithuanian, Latvian, Russian, Swedish, Ukrainian and Chinese - United Nations Parallel Corpus (https://conferences.unite.un.org/uncorpus) - OpenSubtitles (https://opus.nlpl.eu) monolingual English texts

Additional unfiltered data (not included in count): - Reddit data (downloaded via https://github.com/microsoft/DialoGPT) in English

Identifier
DOI	https://doi.org/10.15155/5R1E-6R35
Metadata Access	https://metashare.ut.ee/oai_pmh/?verb=GetRecord&metadataPrefix=olac&identifier=2c7256ade14b11ee8822cf5d819ab78b9cb61168200644faa3b9aae58f95c3c0

Provenance
Publisher	CLARIN
Contributor	Mark Fišel, fishel[at]ut.ee, Tartu Ülikool
Publication Year	2024
Rights	CC-BY
OpenAccess	true
Contact	info(at)keeleressursid.ee