kirjeldus
Estonian corpus of written texts. Consists of the Estonian Reference Corpus (90s–2008), Contemporary and old literature, Estonian Web (2013, 2017, 2019, 2021, 2023), Timestamped Estonian corpora (2014–2021, 2020–2023), Estonian Wikipedia (articles: 2023, talkpages: 2017) and Estonian academic writing (2020–2023). Cleaned, deduplicated. Text type annotation: topics, genres.
ENCODING: UTF-8
== Comparison to ENC 2021 corpus
Balanced Corpus 1990–2008 ................. kept without changes
Reference Corpus 1990–2008 ................ kept without changes
Literature Old 1864–1945 .................. updated according to the source
Literature Contemporary 2000–2023 ......... updated according to the source (licensed under CLARIN ACA)
Web 2013 .................................. kept without changes
Web 2017 .................................. kept without changes
Wikipedia Talk 2017 ....................... kept without changes
Academic Texts (formerly DOAJ) up to 2023 . updated with new data
Web 2019 .................................. kept without changes
Web 2021 .................................. kept without changes
Wikipedia 2023 ............................ replacing Wikipedia 2021
Feeds (JSI) 2014–2021 ..................... kept without changes
Feeds (LC) 2020–2023 ...................... updated with new data
Web 2023 .................................. new