Full-population web crawl of .gov.uk web domain, 2014

DOI

This dataset is the result of a full-population crawl of the .gov.uk web domain, aiming to capture a full picture of the scope of public-facing government activity online and the links between different government bodies. Local governments have been developing online services, aiming to better serve the public and reduce administrative costs. However, the impact of this work, and the links between governments’ online and offline activities, remain uncertain. The overall research question for this research examines whether local e-government has met these expectations, of Digital Era Governance and of its practitioners. Aim was to directly analyse the structure and content of government online. It shows that recent digital-centric public administration theories, typified by the Digital Era Governance quasi-paradigm, are not empirically supported by the UK local government experience. The data consist of a file of individual Uniform Resource Locators (URLs) fetched during the crawl, and a further file containing pairs of URLs reflecting the Hypertext Markup Language (HTML) links between them. In addition, a GraphML format file is presented for a version of the data reduced to third-level-domains, with accompanying attribute data for the publishing government organisations and calculated webometric statistics based on the third-level-domain link network.This project engages with the Digital Era Governance (DEG) work of Dunleavy et. al. and draws upon new empirical methods to explore local government and its use of Internet-related technology. It challenges the existing literature, arguing that e-government benefits have been oversold, particularly for transactional services; it updates DEG with insights from local government. The distinctive methodological approach is to use full-population datasets and large-scale web data to provide an empirical foundation for theoretical development, and to test existing theorists’ claims. A new full-population web crawl of .gov.uk is used to analyse the shape and structure of online government using webometrics. Tools from computer science, such as automated classification, are used to enrich our understanding of the dataset. A new full-population panel dataset is constructed covering council performance, cost, web quality, and satisfaction. The local government web shows a wide scope of provision but only limited evidence in support of the existing rhetorics of Internet-enabled service delivery. In addition, no evidence is found of a link between web development and performance, cost, or satisfaction. DEG is challenged and developed in light of these findings. The project adds value by developing new methods for the use of big data in public administration, by empirically challenging long-held assumptions on the value of the web for government, and by building a foundation of knowledge about local government online to be built on by further research. This is an ESRC-funded DPhil research project.

A web crawl was carried out with Heritrix, the Internet Archive's web crawler. A list of all registered domains in .gov.uk (and their www.x.gov.uk equivalents) was used as a set of start seeds. Sites outside .gov.uk were excluded; robots.txt files were respected, with the consequence that some .gov.uk sites (and some parts of other .gov.uk sites) were not fetched. Certain other areas were manually excluded, particularly crawling traps (e.g. calendars which will serve infinite numbers of pages in the past and future and those websites returning different URLs for each browser session) and the contents of certain large peripheral databases such as online local authority library catalogues. A full set of regular expressions used to filter the URLs fetched are included in the archive. On completion of the crawl, the page URLs and link data were extracted from the output WARC files. The page URLs were manually examined and re-filtered to handle various broken web servers and to reduce duplication of content where multiple views were presented onto the same content (for example, where a site was presented at both http://organisation.gov.uk/ and http://www.organisation.gov.uk/ without HTTP redirection between the two). Finally, The link list was filtered against the URL list to remove bogus links and both lists were map/reduced to a single set of files. Also included in this data release is a derived dataset more useful for high-level work. This is a GraphML file containing all the link and page information reduced to third-level domain level (so darlington.gov.uk is considered as a single node, not a large set of pages) and with the links binarised to present/not present between each node. Each graph node also has various attributes, including the name of the registering organisation and various webometric measures including PageRank, indegree and betweenness centrality.

Identifier
DOI https://doi.org/10.5255/UKDA-SN-852205
Metadata Access https://datacatalogue.cessda.eu/oai-pmh/v0/oai?verb=GetRecord&metadataPrefix=oai_ddi25&identifier=86a4a8312cf2fbdc30551f4ba89eadef512cf34abcbd3920db6a86dee9e06498
Provenance
Creator Nicholls, T, Oxford Internet Institute, University of Oxford
Publisher UK Data Service
Publication Year 2019
Funding Reference Economic and Social Research Council
Rights Tom Nicholls, Oxford Internet Institute, University of Oxford; The Data Collection is available to any user without the requirement for registration for download/access.
OpenAccess true
Representation
Resource Type Numeric
Discipline Social Sciences
Spatial Coverage The .gov.uk second level domain; United Kingdom