Twitter corpus Janes-Tweet 1.0

PID

Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into individual tweets, together with their metadata. The tweets in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to Twitter terms-of-service, the corpus is distributed in an encoded version. The included tweetpub program (also available and documented on https://github.com/clarinsi/tweetpub) should be used to decode it, which it does by fetching the original tweets and applying a diff operation on the distributed corpus. Note that the retrieved corpus can have fewer tweets than the distributed version if some have been removed from Twitter by their authors in the meantime.

Identifier
PID http://hdl.handle.net/11356/1142
Related Identifier https://revije.ff.uni-lj.si/slovenscina2/article/view/7003
Related Identifier https://nl.ijs.si/janes/viri/avtomatsko-oznaceni-korpusi/#Janes-Tweet
Related Identifier https://doi.org/10.1007/s10579-018-9425-z
Related Identifier https://nl.ijs.si/janes/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1142
Provenance
Creator Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja
Publisher Jožef Stefan Institute
Publication Year 2017
Rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0); https://creativecommons.org/licenses/by-nc/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics