Lithuanian-English Parallel Cybersecurity Corpus consists of official cybersecurity documents of the Republic of Lithuania and their English translations, dating from 2014 to 2024. The documents were obtained from the legal act repositories of the Republic of Lithuania (e-seimas.lrs.lt; e-tar.lt) and from the official website of the National Cyber Security Centre under the Ministry of National Defence of the Republic of Lithuania (nksc.lt).
The total size of the corpus is 216,213 words (LT: 96,085 words; EN: 120,128 words). The texts are aligned at the sentence level; the corpus contains 6,417 aligned segments (sentences).
The dataset consists of 152 files encoded in UTF-8. The files are arranged in the following archives:
1) LT-EN_Parallel_Cybersecurity_Corpus_TMX.zip - 30 files in TMX (translation memory exchange) format:
- 29 separate LT-EN TMX files with aligned texts
- 1 combined file consolidating all 29 LT-EN TMX files
2) LT-EN_Parallel_Cybersecurity_Corpus_VERT.zip - 60 files in VERT (vertical text) format:
- 29 separate LT VERT files with morphological annotation,
- 1 combined file consolidating all 29 LT VERT files,
- 29 separate EN VERT files with morphological annotation,
- 1 combined file consolidating all 29 EN VERT files.
SENTENCE ALIGNMENT:
Each block in VERT files corresponds to a translation unit in TMX files.
MORPHOLOGICAL ANNOTATION STRUCTURE:
LT: wordform | lempos | tag (LT MULTEXT-East)
EN: wordform | tag | lempos (EN TreeTagger)
TAGSET REFERENCES:
https://www.sketchengine.eu/lithuanian-multext-east-part-of-speech-tagset/
https://www.sketchengine.eu/english-treetagger-pipeline-2/
3) LT-EN_Parallel_Cybersecurity_Corpus_TXT.zip - 60 files in TXT (plain text) format:
- 29 separate LT TXT files,
- 1 combined file consolidating all 29 LT TXT files,
- 29 separate EN TXT files,
- 1 combined file consolidating all 29 EN TXT files.
4) LT-EN_Parallel_Cybersecurity_Corpus_CSV(Metadata).zip - 2 files with metadata in CSV (comma separated values) format:
- 1 LT CSV file with metadata,
- 1 EN CSV file with metadata.
Metadata categories:
File names, Type of document, Title of document, Author of document, Year, Words, Source, URL.