ELMo embeddings models for seven languages

Dataset

PID

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level.

Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library.

Identifier
PID	http://hdl.handle.net/11356/1277
Related Identifier	https://arxiv.org/abs/1911.10049
Related Identifier	http://hdl.handle.net/11356/1257
Related Identifier	http://embeddia.eu
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1277

Provenance
Creator	Ulčar, Matej
Publisher	Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2019
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/825153
Rights	Apache License 2.0; PUB; https://opensource.org/licenses/Apache-2.0
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene; Croatian; Finnish; Estonian; Latvian; Lithuanian; Swedish
Resource Type	toolService
Format	application/gzip; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline	Linguistics