Engineering job ads corpus


The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.

The corpus is divided in two components: - POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format. - Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats: * Whole text documents: containing all the information originally posted in the ad. * Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)

Related Identifier
Metadata Access
Creator Cardenas Acosta, Ronald; Bello Medina, Kevin; Coronado, Alberto; Villota, Elizabeth
Publisher National University of Engineering, Peru
Publication Year 2016
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0);; PUB
OpenAccess true
Contact lindat-help(at)
Language Spanish; Castilian
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics