W2C – Web to Corpus – Corpora

PID

A set of corpora for 120 languages automatically collected from wikipedia and the web.

Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1

Identifier
PID http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
Related Identifier http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-6133-9
Provenance
Creator Majliš, Martin
Publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year 2011
Rights Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0); http://creativecommons.org/licenses/by-sa/3.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Afrikaans; Amharic; Arabic; Aragonese; Asturian; Bable; Leonese; Asturleonese; Azerbaijani; Belarusian; Bengali; Bangla; Bosnian; Breton; Buginese; Bulgarian; Catalan; Valencian; Cebuano; Czech; Chuvash; Corsican; Welsh; Danish; German; Greek, Modern (1453-); Greek; English; Esperanto; Estonian; Basque; Faroese; Persian; Farsi; Finnish; French; Western Frisian; Gaelic; Scottish Gaelic; Irish; Galician; Gujarati; Haitian; Haitian Creole; Hebrew; Hindi; Croatian; Upper Sorbian; Hungarian; Armenian; Ido; Interlingua (International Auxiliary Language Association); Indonesian; Icelandic; Italian; Javanese; Japanese; Kannada; Georgian; Kazakh; Korean; Kurdish; Latin; Latvian; Limburgan; Limburger; Limburgish; Lithuanian; Luxembourgish; Letzeburgesch; Malayalam; Marathi; Marāṭhī; Macedonian; Malagasy; Mongolian; Maori; Māori; Malay; Burmese; Neapolitan; Low German; Low Saxon; German, Low; Saxon, Low; Nepali; Nepal Bhasa; Newari; Dutch; Flemish; Norwegian Nynorsk; Nynorsk, Norwegian; Norwegian; Occitan (post 1500); Provençal; Ossetian; Ossetic; Pampanga; Kapampangan; Polish; Portuguese; Quechua; Romanian; Moldavian; Moldovan; Russian; Yakut; Sicilian; Scots; Slovak; Slovenian; Slovene; Spanish; Castilian; Albanian; Serbian; Sundanese; Swahili; Swedish; Tamil; Tatar; Telugu; Tajik; Tagalog; Thai; Turkish; Ukrainian; Urdu; Uzbek; Vietnamese; Volapük; Waray; Walloon; Yiddish; Yoruba; Chinese
Resource Type corpus
Format application/x-gzip; text/plain; charset=utf-8; downloadable_files_count: 122
Discipline Linguistics