Engineering job ads corpus

PID

The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.

The corpus is divided in two components: - POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format. - Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats: * Whole text documents: containing all the information originally posted in the ad. * Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)

Identifier
PID http://hdl.handle.net/11234/1-2673
Related Identifier https://github.com/ronaldahmed/labor-market-demand-analysis
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-2673
Provenance
Creator Cardenas Acosta, Ronald; Bello Medina, Kevin; Coronado, Alberto; Villota, Elizabeth
Publisher National University of Engineering, Peru
Publication Year 2016
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); http://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Spanish; Castilian
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics