The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain.
It supports research in:
- Information extraction
- Relation extraction
- Entity linking
The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology .
For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations.
Resource Creation
- French corpus
- Collected from reports, regulations, and local media texts.
-
Manually annotated according to the STARWARS schema.
-
Italian corpus
- Produced via machine translation of the French texts.
-
Reviewed and corrected by bilingual translation students and expert hydrologists.
-
Annotation process
- Conducted with the INCEpTION annotation platform.
- Ensured consistent alignment between French and Italian.
For details, please refer to the publication:
F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco.
Contents of this Package
- Texts: Provided in plain text.
- Annotations: Provided in CONLL 2003 format, as exported from INCEpTION.
- Annotation guidelines: Included in both French and Italian, as used by annotators.