Beyond the Lab: An In-Depth Analysis of Real-World Practices in Government-to-Citizen Software User Documentation
Welcome to the replication package of the paper "Beyond the Lab: An In-Depth Analysis of Real-World Practices in Government-to-Citizen Software User Documentation". This repository contains a suite of tools and scripts used in the research project aimed at analyzing government-to-citizen software user documentation, particularly focusing on the Swiss Digital Strategy program. The project extends existing literature by conducting a large-scale, real-world analysis of user guides. Nearly 600 user guides were scrutinized, identified from about 5,000 links on the websites of 18 German-speaking Swiss cantons. This analysis correlates the presence of key guide features with socio-economic factors of the cantons.
Compliance with the EU's Platform-to-Business (P2B) Regulation is challenging for online platforms, and the assessment of their compliance is difficult for public authorities. This is partly due to the lack of automated tools for assessing the information platforms provide in their terms and conditions (i.e., software documentation), in relation to ranking transparency. That gap also creates uncertainty regarding the usefulness of such documentation for end-users. Our study tackles this issue in two ways. First, we empirically evaluate the compliance of six major platforms, revealing substantial differences in their documentation. Second, we introduce and test automated compliance assessment tools based on ChatGPT and information retrieval technology. These tools are evaluated against human judgments, showing promising results as reliable proxies for compliance assessments. Our findings could help enhance regulatory compliance and align with the United Nations Sustainable Development Goal 10.3, which seeks to reduce inequality, including business disparities on these platforms.
0_link_scraping
: Contains scripts for scraping links from official websites.1_guide_identification
: Stores processed results from ChatGPT and their manual corrections, related to guide identification.2_web_classification
: Includes scripts for classifying web content.statistical_analysis.py
: Analyzes classification results of web content.content_classifier.py
: Classifies web content based on predefined criteria.
3_pdf_classification
: Scripts for classifying PDF content.statistical_analysis.py
: Performs statistical analysis of the classified PDF content.content_classifier.py
: Classifies the content of PDF documents.
4_web+pdf_classification
: Combined analysis of web and PDF content.statistical_analysis.py
: Merges and analyzes data from both web and PDF content.
5_correlation_analysis
: Correlation analysis scripts.analysis_script_guides.py
: Analyzes correlations within user guides.analysis_script_scraped_links.py
: Analyzes correlations within scraped links.
requirements.txt
: List of Python packages required to run the scripts.
- Ensure Python 3.x is installed on your system.
- Clone this repository to your local machine.
- Navigate to the cloned directory and install dependencies:
pip install -r requirements.txt
Follow the directory numbering when executing scripts, beginning with link scraping and proceeding through guide identification, classification, and correlation analysis.
-
Link Scraping:
- Navigate to
0_link_scraping
. - Run
link_data_generator.py
to collect links. Results are stored in the same directory.
- Navigate to
-
Guide Identification:
- Review the content in
1_guide_identification
, containing outputs from ChatGPT and manual corrections.
- Review the content in
-
Web and PDF Classification:
- For web content classification:
- Navigate to
2_web_classification
. - Run
content_classifier.py
first, followed bystatistical_analysis.py
. The latter script generates visualizations, with results saved in the same directory.
- Navigate to
- For PDF content classification:
- Navigate to
3_pdf_classification
. - Run
content_classifier.py
first, followed bystatistical_analysis.py
. Results, including visualizations, are stored in the same directory.
- Navigate to
- For web content classification:
-
Combined Web and PDF Analysis:
- Navigate to
4_web+pdf_classification
. - Run
statistical_analysis.py
for a comprehensive analysis. Results are saved in the same directory.
- Navigate to
-
Correlation Analysis:
- Navigate to
5_correlation_analysis
. - Run
analysis_script_guides.py
andanalysis_script_scraped_links.py
. Both scripts produce visualizations and store results in the same directory as PDF files.
- Navigate to
Results are saved within the same directories as the scripts. Look for output files in formats such as PDF, CSV, JSON, or specific directories mentioned in the scripts' documentation.
Contributions to this project are welcome. Please submit pull requests or issues through the repository's issue tracker.
This replication package provides a comprehensive framework for analyzing government-to-citizen software user documentation practices. It is designed to be flexible, allowing researchers to replicate the study and build upon its findings.
For any problem or question, please contact me at [email protected]