WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

DOI

WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models. The corpus contains training, development and testing subsets randomly split on the query level. Relevance judgments for Cross-Language Information Retrieval (CLIR) are constructed from the inter-language links between German and English Wikipedia articles. A relevance level of (3) is assigned to the (English) cross-lingual mate, and level (2) to all other (English) articles that link to the mate, AND are linked by the mate. Our intuition for this level (2) is that arti cles in a bidirectional link relation to the mate are likely to either define similar concepts or are instances of the concept defined by the mate. For a more detailed description of the corpus construction process, see the above publication.

Identifier
DOI https://doi.org/10.11588/data/10003
Metadata Access https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/10003
Provenance
Creator Hieber, Felix (Department of Computational Linguistics); Schamoni, Shigehiko (Department of Computational Linguistics); Sokolov, Artem (Department of Computational Linguistics); Riezler, Stefan (Department of Computational Linguistics)
Publisher heiDATA
Contributor Prof. Dr. Stefan Riezler; Hieber, Felix; HeiDATA: Heidelberg Research Data Repository
Publication Year 2014
Rights WikiCLIR is licensed under a <a href='http://creativecommons.org/licenses/by-sa/4.0/'>Creative Commons Attribution-ShareAlike 4.0 International License. &#160;<img src='https://i.creativecommons.org/l/by-sa/4.0/80x15.png' alt='CC by-sa' /></a>; info:eu-repo/semantics/openAccess
OpenAccess true
Contact Prof. Dr. Stefan Riezler (Department of Computational Linguistics)
Representation
Resource Type Dataset
Format text/plain; charset=US-ASCII; application/x-gzip
Size 1858; 887887912
Version 1.1
Discipline Other
Spatial Coverage Heidelberg, Germany