The Brazilian corpus on urban violence

DOI

The newspaper articles included in the Brazilian Corpus on Urban violence were collected from Factiva, a news aggregator service that provides full-text access to newspapers, newswires, business journals, market research and analyst reports, and web sites from 118 countries. Here we focused on articles published between 01/Jan/2014 to 31/Dec/2014 by the following Brazilian newspapers: Zero Hora, Pioneiro, Folha de São Paulo, and O Estado de São Paulo. These are daily broadsheet papers with wide circulation in the states where they are based. The first two (Zero Hora and Pioneiro) are based in the Brazil’s Southern state of Rio Grande do Sul, where the Brazilian researchers in this project are based and hence the focus of our study. The other two newspapers are published in São Paulo, the wealthiest and most populated state in Brazil. They were included in the corpus to allow comparison of the discourse around urban violence in different regions of Brazil. Overall, the corpus contains 5,127 texts (1,778,282 words) Brazil's current social and political situation gives rise to a particular breed of urban violence aimed at individuals and characterized by its continual presence. The average Brazilian citizen has to contend with this violence on a daily basis. This creates a general state of fear and insecurity among the population in general, but, at the same time, may promote on the part of more socially aware individuals, a sense of empathy with the less privileged classes in Brazil. The influence of the media contributes to this scenario. Daily news reports highlight violent acts carried out by individuals or groups from all social classes. The impact of violence on people's everyday lives is thus amplified by the media. This fosters beliefs, attitudes and values related to violence, which may or may not be consistent with the actual incidence, forms and causes of violence. The partners will investigate the linguistic representation of urban violence in Brazil by applying the techniques of Corpus Linguistics to two datasets, or 'corpora': 1. The existing transcripts of two focus groups on living with urban violence conducted in Fortaleza, Brazil in 2010, for a total of approximately 20,000 words (Focus Groups Corpus); 2. A 2-million-word corpus of news reports in the Brazilian press, to be constructed as part of the partnership (News Reports Corpus).

To select individual texts, our initial approach was to apply Gabrielatos’ (2007) method which is especially useful to determine query words or phrases which favour the retrieval of a wide range of relevant texts from a restricted-access database. Briefly, Gabrielatos (2007) suggests using a core query consisting of two or three words/phrases as a starting point to compile a pilot corpus. This pilot corpus is then used to identify additional relevant query words/phrases. These are words/phrases that tend to occur in texts where the core terms are also used, thus they are at least in principle closely associated with the core terms in a significant number of contexts. The ultimate purpose of applying Gabrielatos’ (2007) method is to identify words/phrases that would return articles on the topic under investigation, even though core terms themselves are not used in them. At the same time, these additional terms should not create undue noise, that is, useful additional terms are those that retrieve a sufficient number of articles which do not contain the core terms but are still relevant. Given the restricted time period examined in this study (2014 only), we opted for compiling an initial corpus using all articles published in the chosen four newspapers (Folha de São Paulo, O Estado de São Paulo, Zero Hora and Pioneiro) in the entire period (Jan-Dec/2014). This initial corpus would then be used to identify additional relevant query words/phrases as suggested by Gabrielatos (2007). Our first attempt was to use the Portuguese equivalent for urban violence (violência urbana) and violence in cities/towns (violência na(s) cidade(s)) as our core query terms. However, these two terms did not retrieve as many texts as one would expect in a country where urban violence is a major issue. Overall, urban violence (violência urbana) appeared in 66 articles and violence in cities/towns (violência na(s) cidade(s)) in 10 articles. Neither was violence in the street(s) (violência na(s) rua(s)) frequently used: 22 articles in total. In an attempt to identify search terms that would lead to a higher number of texts on urban violence, we then searched for urban security (segurança urbana) and public security (segurança pública). Urban security (segurança urbana) is not frequently used in Brazilian newspapers either: 50 articles in total. Public security (segurança pública) on the other hand is frequently mentioned: 1,809 articles in total. Violência urbana (urban violence) and segurança pública (public security) were then used to compile a pilot corpus so that Gabrielatos’ method could be applied to identify additional search terms. The method pointed to three additional terms: criminalidade (criminality), homicídio (homicide), and roubo (robbery/theft). While relevant, using homicídio (homicide), and roubo (robbery/theft) as query terms would result in a biased selection of texts that would inevitably favour texts about these two crimes specifically. This would not allow us to have a clear picture of what crimes are most frequently mentioned in Brazilian newspapers, the project’s research question #1. Our decision was therefore to complement the list of query terms with crime names mentioned in official government statistics as well as other crimes the researchers would intuitively deem important. Also, in an attempt to gather as many relevant texts as possible, we opted for expanding the collection of texts to all word forms related to the selected crimes names. Thus, for example, rather than using roubo (robbery/theft) as a query term, we used roub* which retrieves texts containing roubo as well as roubos (plural form), roubar (to rob/steal), roubou (robbed/stole), roubado (robbed/stolen), etc. While useful to identify texts related to urban violence in Brazil, using crime related words as query terms has nevertheless introduced some undue noise. A number of texts in which these terms appeared referred to violence and crimes in other parts of the world, rather than in Brazil: murders in Iraq, kidnapping in Nigeria, homicides in war zones and so on. In addition, there were also a large number of texts referring to issues other than urban violence such as corruption, internet crimes and labour issues, in Brazil and somewhere else as well as articles related to cinema (especially thrillers) and crime fiction. To make matters more complicated, one cannot ignore the metaphorical nature of language. There was also a large number of texts in which our query terms were used metaphorically and not at all related to urban violence: roubar a cena (steal the scene), roubar meu lugar (take over my place), furtar-se a fazer alguma coisa (avoid doing something), etc. To minimize such noise, we have discarded a wide range of topics in the actual retrieval of texts from the Factiva news aggregator. The topics discarded are shown under the lave “subjects” in Figure 1. They were identified on the basis of a random analysis of the texts within such categories. We have also discarded texts containing one or more of the following words/phrases: comissão da verdade (truth commission – a committee established in 2012 to investigate violations of human rights by the Brazilian government between 18/Sep 1946 to 05/Oct 1988), Bolsonaro (a Brazilian congressman, infamous for his controversial comments on rape and human rights), Petrobrás or Petrobras (Brazilian oil company at the centre of a corruption scandal), ditadura (dictatorship), ditador (dictator), Al-Quaeda. These words are shown under “None of these words” in Figure 1. Also, within the Factiva search options, we have chosen to discard identical duplicates and also republished news, recurring pricing and market data, obituaries, sports, calendars. All texts meeting the criteria above were retrieved in full, including their headline(s). This means that there was not filtering according to the section of the newspaper in which the text was published. In other words, the corpus contains news reports as well as editorials, opinions, interviews, or any other text type. It is also important to stress that texts were selected irrespective of the number of query words/phrases it contained and their frequency within each text. This means that the texts included in the Brazilian Corpus on Urban violence vary in relation to the extent to which urban violence is discussed. Here, any reference to urban violence is considered relevant, even if urban violence is not the main topic discussed in the text. This enables us to look at both texts discussing urban violence issues in detail as well as those in which urban violence issues are mentioned in relation to another topic. Such approach broadens the scope of the analysis and enables us to examine situational contexts which are directly or indirectly associated with urban violence.

Identifier
DOI https://doi.org/10.5255/UKDA-SN-852226
Metadata Access https://datacatalogue.cessda.eu/oai-pmh/v0/oai?verb=GetRecord&metadataPrefix=oai_ddi25&identifier=293171e8bf2fe21dbdace1e609f5a5022ead8b2f16f467d77f28468368358e20
Provenance
Creator Semino, E, Lancaster University; Carmen, D, Lancaster University
Publisher UK Data Service
Publication Year 2016
Funding Reference Economic and Social Research Council
Rights Elena Semino, Lancaster University; The Data Collection only consists of metadata and documentation as the data could not be archived due to legal, ethical or commercial constraints. For further information, please contact the contact person for this data collection.
OpenAccess true
Representation
Language English
Resource Type Numeric
Discipline Jurisprudence; Law; Social and Behavioural Sciences
Spatial Coverage United Kingdom; Brazil