Nuclear deterrence, education and transformational change

Dataset

DOI

Descripción del Dataset (ES)

El dataset contiene como observaciones documentos publicados en el 
Public Register of Documents del Parlamento Europeo ("https://www.europarl.europa.eu/RegistreWeb/home/welcome.htm"). Cada observación representa un documento 
específico del Parlamento, identificado por un título, tipo de documento, año de publicación, fecha del documento, 
autoridades responsables, conceptos clave (Eurovoc), códigos de directorio y otros metadatos descriptivos. 
Además, se incluye una representación tokenizada de los textos y palabras clave extraídas mediante técnicas de 
Natural Language Processing (NLP).

Cobertura Temporal

Período de referencia: 1994 - 2024
Dataset inicial: 24,502 documentos
Dataset filtrado: 13,759 documentos (publicados después de 2014, disponibles en idioma inglés)

Fuente Public Register of Documents (Parlamento Europeo)

Criterios de Selección

Palabras clave utilizadas en la búsqueda: 
    “nuclear deterrence”, npt (Tratado de No Proliferación Nuclear), tpnw (Tratado sobre la Prohibición de Armas Nucleares), 
    “nuclear weapons”.

Idioma: Inglés
Fechas de publicación: Posteriores a 2014

Estructura del Dataset Campos principales

Title: Título del documento.
Register Reference: Código único del documento en el registro.
Document Type: Tipo de documento (informes, enmiendas, mociones, etc.).
Year: Año de publicación.
Document Date: Fecha del documento.
Authorities: Entidades responsables del documento.
Eurovoc Concept: Conceptos temáticos clasificados según el tesauro de Eurovoc.
Directory Codes y Subject Headings: Clasificaciones específicas del Parlamento Europeo.
File: Enlace o nombre del archivo asociado al documento.
Text, Text_OG y Text_tokenized: Texto completo, versión original, y versión tokenizada del contenido.
Keywords: Palabras clave generadas mediante los algoritmos KeyBERT y TF-IDF.

Tamaño del Dataset: 13,800 documentos filtrados.

Metodología Procesamiento

Extracción mediante web scraping.
Preprocesamiento y limpieza del texto.
Tokenización para análisis lingüístico.

Análisis

Identificación de palabras clave con TF-IDF.

Dataset Description (EN)

The dataset contains observations of documents published in the 
Public Register of Documents of the European Parliament 
(https://www.europarl.europa.eu/RegistreWeb/home/welcome.htm). 
Each observation represents a specific document from the Parliament, identified by a title, document type, year of publication, 
document date, responsible authorities, key concepts (Eurovoc), directory codes, and other descriptive metadata. 
Additionally, it includes a tokenized representation of the texts and keywords extracted using 
Natural Language Processing (NLP) techniques.

Temporal Coverage

Reference Period: 1994 - 2024
Initial Dataset: 24,502 documents
Filtered Dataset: 13,759 documents (published after 2014, available in English)

Source Public Register of Documents (European Parliament)

Selection Criteria

Keywords used in the search: 
    “nuclear deterrence,” NPT (Non-Proliferation Treaty), TPNW (Treaty on the Prohibition of Nuclear Weapons), 
    “nuclear weapons.”

Language: English
Publication Dates: After 2014

Dataset Structure Main Fields

Title: Title of the document.
Register Reference: Unique code of the document in the register.
Document Type: Type of document (reports, amendments, motions, etc.).
Year: Year of publication.
Document Date: Date of the document.
Authorities: Entities responsible for the document.
Eurovoc Concept: Thematic concepts classified using the Eurovoc thesaurus.
Directory Codes and Subject Headings: Specific classifications from the European Parliament.
File: Link or name of the file associated with the document.
Text, Text_OG, and Text_tokenized: Full text, original version, and tokenized version of the content.
Keywords: Keywords generated using KeyBERT and TF-IDF algorithms.

Dataset Size: 13,800 filtered documents.

Methodology Processing

Data extraction via web scraping.
Text preprocessing and cleaning.
Tokenization for linguistic analysis.

Analysis

Keyword identification using TF-IDF.

Identifier
DOI	https://doi.org/10.34810/data1870
Metadata Access	https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/data1870

Provenance
Creator	Rajmil, Daniel ; Morales, Lucía (ORCID: 0000-0002-9111-813X); García Juanatey, Ana ; Carbonell Mateo, David
Publisher	CORA.Repositori de Dades de Recerca
Contributor	Carbonell Mateo, David; Universitat Oberta de Catalunya; Rajmil, Daniel
Publication Year	2024
Funding Reference	Institut Català Internacional per la Pau (ICIP) ICI023/23/000009
Rights	CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess	true
Contact	Carbonell Mateo, David (Universitat Oberta de Catalunya)

Representation
Resource Type	Textual data; Dataset
Format	text/tab-separated-values; text/plain
Size	923668016; 3278; 3504
Version	1.0
Discipline	Agriculture, Forestry, Horticulture, Aquaculture; Agriculture, Forestry, Horticulture, Aquaculture and Veterinary Medicine; Life Sciences; Social Sciences; Social and Behavioural Sciences; Soil Sciences