MULTEXT-East "1984" annotated corpus 4.0


The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.

This version of the corpus contains the linguistically annotated texts, with each word tagged by its lemma and its MULTEXT(-East) morphosyntactic description (MSD, i.e., a fine-grained feature-structure based PoS tag).

The structurally annotated texts are a separate submission (, also with somewhat different languages.

Related Identifier
Related Identifier
Metadata Access
Creator Erjavec, Tomaž; Barbu, Ana-Maria; Derzhanski, Ivan; Dimitrova, Ludmila; Garabík, Radovan; Ide, Nancy; Kaalep, Heiki-Jaan; Kotsyba, Natalia; Krstev, Cvetana; Oravecz, Csaba; Petkevič, Vladimír; Priest-Dorman, Greg; QasemiZadeh, Behrang; Radziszewski, Adam; Simov, Kiril; Tufiş, Dan; Zdravkova, Katerina
Publisher Jožef Stefan Institute
Publication Year 2010
Funding Reference info:eu-repo/grantAgreement/EC/FP7/211938
Rights MULTEXT-East licence;; ACA
OpenAccess true
Contact info(at)
Language Bulgarian; Czech; English; Estonian; Persian; Farsi; Hungarian; Macedonian; Polish; Romanian; Moldavian; Moldovan; Slovak; Slovenian; Slovene; Serbian
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics