PatTR: Patent Translation Resource


PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims. The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstrac t, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office ( EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract. Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United S tates Patent and Trademark Office (USPTO) corpus, following Utiyama, Masao and Isahara, Hitoshi: A Japanese-English patent parallel corpus. MT summit XI (2007), 475--482. All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools. For a detailed description of the corpus construction process, please see the publications above.

Metadata Access
Creator Wäschle, Katharina (Department of Computational Linguistics); Riezler, Stefan (Department of Computational Linguistics)
Publisher heiDATA
Contributor Prof. Dr. Stefan Riezler; Wäschle, Katharina
Publication Year 2014
Rights info:eu-repo/semantics/openAccess
OpenAccess true
Contact Prof. Dr. Stefan Riezler (Department of Computational Linguistics)
Resource Type Dataset
Format application/x-gzip; text/plain
Size 245640661; 1348412107; 1403883854; 702207465; 1113706581; 658863580; 676894791; 4505
Version 3.1
Discipline Other
Spatial Coverage Heidelberg, Germany