62 datasets found

Keywords: web corpus

Filter Results
  • Albanian web corpus MaCoCu-sq 1.0

    The Albanian web corpus MaCoCu-sq 1.0 was built by crawling the ".al" internet top-level domain in 2022, extending the crawl dynamically to other domains as well. The crawler is...
  • Albanian-English parallel corpus MaCoCu-sq-en 1.0

    The Albanian-English parallel corpus MaCoCu-sq-en 1.0 was built by crawling the “.al” internet top-level domain in 2022, extending the crawl dynamically to other domains as...
  • Text collection for training the BERTić transformer model BERTić-data

    The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to...
  • Bosnian-English parallel corpus MaCoCu-bs-en 1.0

    The Bosnian-English parallel corpus MaCoCu-bs-en 1.0 was built by crawling the “.ba” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains...
  • Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0

    The Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0 was built by crawling the “.me” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other...
  • Slovene-English parallel corpus MaCoCu-sl-en 2.0

    The Slovene-English parallel corpus MaCoCu-sl-en 2.0 was built by crawling the “.si” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains...
  • Bosnian web corpus MaCoCu-bs 1.0

    The Bosnian web corpus MaCoCu-bs 1.0 was built by crawling the ".ba" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The...
  • Bulgarian-English parallel corpus MaCoCu-bg-en 2.0

    The Bulgarian-English parallel corpus MaCoCu-bg-en 2.0 was built by crawling the “.bg” and “.бг” internet top-level domains in 2021, extending the crawl dynamically to other...
  • Slovene web corpus MaCoCu-sl 1.0

    The Slovene web corpus MaCoCu-sl 1.0 was built by crawling the ".si" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is...
  • Macedonian-English parallel corpus MaCoCu-mk-en 2.0

    The Macedonian-English parallel corpus MaCoCu-mk-en 2.0 was built by crawling the “.mk” and “.мкд” internet top-level domains in 2021, extending the crawl dynamically to other...
  • Turkish-English parallel corpus MaCoCu-tr-en 2.0

    The Turkish-English parallel corpus MaCoCu-tr-en 2.0 was built by crawling the “.tr” and “.cy” internet top-level domains in 2021, extending the crawl dynamically to other...
  • Turkish web corpus MaCoCu-tr 2.0

    The Turkish web corpus MaCoCu-tr 2.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The...
  • Maltese-English parallel corpus MaCoCu-mt-en 2.0

    The Maltese-English parallel corpus MaCoCu-mt-en 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....
  • Slovene-English parallel corpus slenWaC 1.0

    The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor...
  • Slovene web corpus MaCoCu-sl 2.0

    The Slovene web corpus MaCoCu-sl 2.0 was built by crawling the ".si" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The...
  • Serbian-English parallel corpus srenWaC 1.0

    The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs top-level domain for Serbia. The corpus was built with Spidextor...
  • Croatian-English parallel corpus MaCoCu-hr-en 1.0

    The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as...
  • Croatian web corpus MaCoCu-hr 1.0

    The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is...
  • Macedonian-English parallel corpus MaCoCu-mk-en 1.0

    The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other...
  • Maltese web corpus MaCoCu-mt 1.0

    The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is...
You can also access this registry using the API (see API Docs).