goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899.
Each text contains extensive meta-data and per-page links to facsimiles, while the word tokens in the texts are annotated with their modernised word-form, lemma, part-of-speech, and, for archaic words, their nearest modern synonyms or short explanation.
The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all the information from the source TEI.