Urdu Monolingual Corpus

Dataset

PID

We release a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the both plain and tagged corpora.

Identifier
PID	http://hdl.handle.net/11858/00-097C-0000-0023-65A9-5
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-65A9-5

Provenance
Creator	Jawaid, Bushra; Kamran, Amir; Bojar, Ondřej
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2014
Funding Reference	info:eu-repo/grantAgreement/EC/FP7/288487
Rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0); http://creativecommons.org/licenses/by-nc-sa/3.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Urdu
Resource Type	lexicalConceptualResource
Format	application/x-gzip; application/pdf; application/octet-stream; downloadable_files_count: 4
Discipline	Linguistics