StarwarsNER French Italian Corpus - sample

PID

The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain.

It supports research in:
- Information extraction - Relation extraction - Entity linking

The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology .

For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations.


Resource Creation

  1. French corpus
  2. Collected from reports, regulations, and local media texts.
  3. Manually annotated according to the STARWARS schema.

  4. Italian corpus

  5. Produced via machine translation of the French texts.
  6. Reviewed and corrected by bilingual translation students and expert hydrologists.

  7. Annotation process

  8. Conducted with the INCEpTION annotation platform.
  9. Ensured consistent alignment between French and Italian.

For details, please refer to the publication:
F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco.


Contents of this Package

  • Texts: Provided in plain text.
  • Annotations: Provided in CONLL 2003 format, as exported from INCEpTION.
  • Annotation guidelines: Included in both French and Italian, as used by annotators.
Identifier
PID http://hdl.handle.net/20.500.11752/ILC-1052
Related Identifier https://arxiv.org/abs/2506.01938
Related Identifier https://sites.google.com/view/horizoneurope2020-starwars/
Metadata Access http://dspace-clarin-it.ilc.cnr.it/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:dspace-clarin-it.ilc.cnr.it:20.500.11752/ILC-1052
Provenance
Creator Frontini, Francesca; Chahinian, Nanée; Aelami, Mitra; Cardillo, Franco Alberto; Conard, Serge; Debole, Franca
Publisher Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR); Institute of Information Science and Technologies "Alessandro Faedo" - National Research Council of Italy (ISTI CNR); Institut de Recherche pour le Développement; Université de Montpellier
Publication Year 2025
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0; PUB
OpenAccess true
Contact dspace-clarin-it-ilc-help(at)ilc.cnr.it
Representation
Language Italian; French
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics