Automated Speech Recognition service for NeMo Conformer CTC BPE E2E models. For more details about building such models, see the official NVIDIA NeMo documentation (https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html) and NVIDIA NeMo GitHub (https://github.com/NVIDIA/NeMo). A model for automated speech recognition of Slovene speech can be downloaded from http://hdl.handle.net/11356/1740.
The service accepts as input audio files in WAV 16kHz, 16bit PCM, mono format. The maximal accepted audio duration is 300s. Note that transcription of one 300s audio file on cpu will take advantage of all available cores, consume up to 16GB RAM and may take ~180s (on a system with 24 vCPU). See the service README.md for further details.