Multilingual static embeddings for Verbal Multiword Expressions trained on PARSEME raw corpora

Dataset

PID

This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, Chinese). They were trained with the Word2Vec algorithm, in its skip-gram version, on PARSEME raw corpora automatically annotated for morpho-syntax (http://hdl.handle.net/11234/1-3367). These corpora were annotated by Seen2Seen, a rule-based VMWE identifier, one of the leading tools of the PARSEME shared task version 1.2. VMWE tokens were merged into single tokens. The format of the vector space files is that of the original Word2Vec implementation by Mikolov et al. (2013), i.e. a binary format. For compression, bzip2 was used.

Identifier
PID	http://hdl.handle.net/11234/1-5528
Related Identifier	https://gitlab.com/parseme/corpora
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-5528

Provenance
Creator	Estève, Louis Clément; Savary, Agata; Lavergne, Thomas
Publisher	Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique
Publication Year	2024
Rights	PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement; https://lindat.mff.cuni.cz/repository/static/licence-mwe-1.2-raw.html; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	German; Greek, Modern (1453-); Greek; Basque; French; Irish; Hebrew; Hindi; Italian; Polish; Portuguese; Romanian; Moldavian; Moldovan; Swedish; Turkish; Chinese
Resource Type	lexicalConceptualResource
Format	application/octet-stream; application/x-xz; text/plain; downloadable_files_count: 22
Discipline	Linguistics