DK-CLARIN Rapid Aligned Corpus 1993-2011 (da-en, da-de)

PID

The aligned corpus consists of press releases from the European Commission Press Relase Database (Rapid) harvested in 2009 and 2011 (http://europa.eu/rapid/search.htm).

The corpus comprises 5330 + 2200 press releases (files) for each language Danish, English and German with app. 5,000,000 words per language and 260,000 - 270,000 aligned sentences for the language pair Danish - English and Danish - German.

All documents are processed with Uplug (https://bitbucket.org/tiedemann/uplug/wiki/Home) and aligned with HunAlign. Files with more than 10 % negative alignments have been removed and so has all 0-alignmants. The documents are in txt-format for each language and in tmx-format for the aligned language pairs (da-en and da-de).

Identifier
PID http://hdl.handle.net/20.500.12115/30
Metadata Access http://repository.clarin.dk/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:repository.clarin.dk:20.500.12115/30
Provenance
Creator Haltrup Hansen, Dorte; Offersgaard, Lene
Publisher Centre for Language Technology, NorS, University of Copenhagen; European Commission
Publication Year 2012
Rights CLARIN-ACA-NC; ACA; https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&NORED=1
OpenAccess true
Contact info(at)clarin.dk
Representation
Language Danish; English; German
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; text/plain; downloadable_files_count: 3
Discipline Linguistics