Engineering job ads corpus

Dataset

PID

The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.

The corpus is divided in two components: - POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format. - Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats: * Whole text documents: containing all the information originally posted in the ad. * Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)

Identifier
PID	http://hdl.handle.net/11234/1-2673
Related Identifier	https://github.com/ronaldahmed/labor-market-demand-analysis
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-2673

Provenance
Creator	Cardenas Acosta, Ronald; Bello Medina, Kevin; Coronado, Alberto; Villota, Elizabeth
Publisher	National University of Engineering, Peru
Publication Year	2016
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); http://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Spanish; Castilian
Resource Type	corpus
Format	application/zip; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics