Corpus of combined Slovenian corpora MetaFida 0.1


Slovenia has a large number of diverse corpora available for online analysis via the CLARIN.SI concordancers. However, if users are interested in the same queries across different corpora, they have to search for relevant information in each corpus separately, and then combine this information manually, which is time-consuming and also prone to analysis errors. An additional problem is that corpora typically have different metadata and may also be labeled at different linguistic levels, which further complicates identical searches across different corpora.

For these reasons we combined a number of existing corpora of the Slovenian available through the CLARIN.SI concordances into the MetaFida corpus. Here it was first necessary to unify the metadata and harmonize the linguistic and structural annotations between the corpora, and to create conversions of individual corpora from their vertical formats, which are used as input by the CLARIN.SI concordances, into the MetaFida vertical format. As the source corpora are not completely distinct, MetaFida is deduplicated on the level of paragraphs.

In the MetaFida corpus, we kept only that information that is common to most of the selected corpora. The structure is nested very shallowly, as it is easier to create subcorpus or limit the search to individual text types. All Metafida positional attributes are considered to have multiple values, separated by a space. More values ​​are needed because some corpora have normalized words (older Slovenian, user-generated content), where one original word can be mapped to several normalized ones or vice versa.

There are 34 corpora included in this version of MetaFida: * classlawiki_sl, CLASSLAWiki-sl (Slovenian Wikipedia), 54,608,642 tokens * dgt15_sl, EU DGT 2015: Slovene, 62,303,744 tokens * dsi, DSI (informatics), 5,245,073 tokens * eltec_slv, ELTeC-slv (100 novels), 6,901,534 tokens * filmi, FILMI (film reviews), 936,446 tokens * gfida20_dedup, Gigafida v2.0 (reference, deduplicated), 1.333,360,653 tokens * gos_vl42, GosVL 4.2 (spoken, VideoLectures), 179,063 tokens * gos11, Gos 1.1.1 (reference, speech), 1,063,861 tokens * imp, IMP (older texts), 17,723,874 tokens * ispac_sl, ISPAC: Slovenian, 1,432,798 tokens * janes_blog, Janes Blog (blogs with comments), 34,534,431 tokens * janes_forum, Janes Forum (web forums), 47,066,575 tokens * janes_news, Janes News (news comments), 14,838,074 tokens * janes_tweet, Janes Tweet (tweets 2013-2017), 151,457,091 tokens * janes_wiki, Janes Wiki (Wikipedia comments), 5,008,067 tokens * jaslo_sl, jaSlo: Slovenian, 532,395 tokens * kas_dipl, KAS Dipl (diplomas), 1,101,796,659 tokens * kas_dr, KAS Dr (PhD theses), 101,473,395 tokens * kas_mag, KAS Mag (master theses), 495,827,656 tokens * konji, Konji (equestrianism), 469,894 tokens * korp, KoRP (public relations), 2,194,130 tokens * lemonde_sl, LeMonde: Slovenian, 615,617 tokens * maj68, Maj68 (May 1968 in literature), 794,382 tokens * maks, MAKS (youth literature), 12,072,273 tokens * prilit, PriLit (older narrative prose), 1,275,209 tokens * rsdo5, RSDO5 (term-annotated texts), 310,588 tokens * sbsj, SBSJ (school texts), 1,836,810 tokens * siparl20, siParl 2.0 (parliament 1990-2018), 239,749,733 tokens * slwac, slWaC (Slovene Web), 895,903,321 tokens * solar, Šolar v2 Clear (school essays), 1,907,731 tokens * suss, ŠUSS (FAQ on Slovenian language), 365,371 tokens * trans5_sl, TRANS5: Slovenian, 1,594,120 tokens * tweet_sl, Tweet-sl (older tweets), 6,291,820 tokens * vayna, VAYNA (attacks on the YNA), 300,666 tokens Σ 34 corpora, 4,601,971,696 tokens

Related Identifier
Related Identifier
Related Identifier
Metadata Access
Creator Erjavec, Tomaž
Publisher Jožef Stefan Institute
Publication Year 2021
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type corpus
Format downloadable_files_count: 0
Discipline Linguistics