This corpus collects and annotates the extensive and highly valuable diachronic collection of Slovenian proverbs, 50 years and more in the making at the ZRC SAZU Institute of Slovenian Ethnology. It is composed of the structured 2,515 bibliographical items (1578-2010): printed books, journals, calendars, collecting campaigns in different journals, folklore collecting field-works, personal notes, etc. that served as the sources of the proverbs and the collection of the paremiological units. Each one is represented in two ways: as the diplomatic transcription from the source collection (due to the technical difficulties of the transcribers and human errors in transcription, the transcription of older texts is inconsistent) and as the critical transcription which normalizes the alphabet.
The words of the critical transcriptions have also been automatically modernised to contemporary spelling, and these words further annotated with lemmas, MULTEXT-East MSDs and Universal dependencies with the CLASSLA toolchain.
The canonical encoding of the corpus is TEI, but the corpus is also distributed in two derived encodings. One is the bibliography and sayings as two TSV files, and the other the vertical file, as used by CQP-type concordancers, such as Sketch Engine.