Dataset - B2FIND

Big Data language model - subword - BPE - ARPA

Big data language model based on subword units, based on byte pair encoding in ARPA format

Cyfry

A small spoken digits corpus in polish. Contains 488 recordings of 25 speakers reading 20 digits (0-9) each. Amounts to around 76 minutes of recordings. Split into train (~72%),...

Big data language model with part of speech tags stemmed in ARPA format

Big Data language model in Word2Vec CBOW format.

Big Data language model with grammatical groups - RAW

Big Data Language model tagged with grammatical groups in RAW format.

Big Data language model - subword - SYLLABED - ARPA

Big data language model based on syllabes in ARPA format.

Big data language model stemmed in ARPA format

Big data language model stemmed in ARPA format.

Big data language model with part of speech tags stemmed in RAW format

Big data language model stemmed with BPE in ARPA format

Korpus nagrań radiowych

A collection of radio 192 recordings, with around 200 speakers, each no longer than 40 minutes long. Audio saved as RAW 16-bit 16 kHz sampling frequency.

Big Data language model with grammatical groups - ARPA

Big Data Language model tagged with grammatical groups trained in ARPA format.

Big Data language model tagged with POS - RAW.

Big data language model tagged with POS - RAW

Big Data language model in FastText CBOW format

Speech Recognition System for Polish: Parliamentary Speech

This resource contains dockerized models and scripts of an automatic speech recognition system for Polish trained on Polish Parliament speeches. The system is based on the Kaldi...

Speech Recognition System for Polish: Polish Film Chronicles

This resource contains dockerized models and scripts of an automatic speech recognition system for Polish trained on recording of the Polish Film Chronicles. The system is based...

Big Data language model - STEMMED - RAW data

Big data language model stemmed in RAW format

Transcriptions of the Polish Film Chronicles (Polska Kronika Filmowa) - years...

This is the orthographic transcription of the audio of the Polish Film Chronicles (Polska Kronika Filmowa - PKF) between the years 1945-1962. The transcription is mostly...

Transkrypcja fonetyczna Kronik RP

This is a phonetic transcription of the "Kroniki RP" data set using the G2P tool available at mowa.clarin-pl.eu.

Big Data language model - subword - BPE - RAW

Big data language model based on subword units, based on byte pair encoding in RAW format

Big Data language model in FastText Skip-gram format.

38 datasets found