4 datasets found

Keywords: under resourced languages

Filter Results
  • Amharic WIC Corpus

    Substantially cleaned version of existing morphologically annotated WIC Corpus.
  • Tigrinya Web Corpus

    Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
  • AlbNER Named Entity Recognition in Albanian

    AlbNER is a Named Entity Recognition corpus of Wikipedia sentences in Albanian, consisting of 900 records. The sentence tokens are manually labeled complying with the CoNLL-2003...
  • Somali Web Corpus

    Somali web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
You can also access this registry using the API (see API Docs).