Corpus from the Aozora Bunko Library

Dataset

PID

This corpus contains a subset of available texts from the Aozora Bunko public library project, which contains various works of mostly older literature in Japanese. A custom python script was used to compile it from its official GitHub directory in order to fit specific requirements. It excluded any text currently not freely available in the public domain and organized the output into approximately same-sized text files. Furthermore, they contain an XML structure using tags to denote individual documents (books) as well as provide basic bibliographic information about their author, year, and title.

Identifier
PID	http://hdl.handle.net/11234/1-5682
Related Identifier	https://nlp.fi.muni.cz/projekty/aozora
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-5682

Provenance
Creator	Rohacek, Jakub
Publisher	Masaryk University, NLP Centre
Publication Year	2024
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); http://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Japanese
Resource Type	corpus
Format	application/x-gzip; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics