Multilingual static embeddings for Verbal Multiword Expressions trained on PARSEME raw corpora

PID

This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, Chinese). They were trained with the Word2Vec algorithm, in its skip-gram version, on PARSEME raw corpora automatically annotated for morpho-syntax (http://hdl.handle.net/11234/1-3367). These corpora were annotated by Seen2Seen, a rule-based VMWE identifier, one of the leading tools of the PARSEME shared task version 1.2. VMWE tokens were merged into single tokens. The format of the vector space files is that of the original Word2Vec implementation by Mikolov et al. (2013), i.e. a binary format. For compression, bzip2 was used.

Identifier
PID http://hdl.handle.net/11234/1-5528
Related Identifier https://gitlab.com/parseme/corpora
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-5528
Provenance
Creator Estève, Louis Clément; Savary, Agata; Lavergne, Thomas
Publisher Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique
Publication Year 2024
Rights PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement; https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2-raw; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language German; Greek, Modern (1453-); Greek; Basque; French; Irish; Hebrew; Hindi; Italian; Polish; Portuguese; Romanian; Moldavian; Moldovan; Swedish; Turkish; Chinese
Resource Type lexicalConceptualResource
Format text/plain; charset=utf-8; application/octet-stream; application/x-xz; text/plain; downloadable_files_count: 22
Discipline Linguistics