C4Corpus (CC BY-NC-ND part)

PID

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.

Identifier
PID http://hdl.handle.net/11372/LRT-2205
Related Identifier http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf
Related Identifier https://dkpro.github.io/dkpro-c4corpus/
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11372/LRT-2205
Provenance
Creator Gurevych, Iryna; Habernal, Ivan; Zayed, Omnia
Publisher Technische Universität Darmstadt
Publication Year 2016
Rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); http://creativecommons.org/licenses/by-nc-nd/4.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Afrikaans; Arabic; Bengali; Bangla; Bulgarian; Czech; Danish; German; Greek, Modern (1453-); Greek; English; Estonian; Persian; Farsi; Finnish; French; Gujarati; Hebrew; Hindi; Croatian; Hungarian; Indonesian; Italian; Japanese; Kannada; Korean; Latvian; Lithuanian; Malayalam; Marathi; Marāṭhī; Macedonian; Nepali; Dutch; Flemish; Norwegian; Polish; Portuguese; Romanian; Moldavian; Moldovan; Russian; Slovak; Slovenian; Slovene; Somali; Spanish; Castilian; Albanian; Swahili; Swedish; Tamil; Telugu; Tagalog; Thai; Turkish; Ukrainian; Undetermined; Urdu; Vietnamese; Chinese
Resource Type corpus
Format text/plain; application/x-gzip; downloadable_files_count: 56
Discipline Linguistics