Web Studies is a companion extension to the 4CAT Capture and Analysis Toolkit. It add functionality to 4CAT by utilizing Selenium along with a Firefox browser to collect data from web sources.
- Selenium URL Collector
- Collect HTML, text, and links from a list of URLs
- Web Archive Collector
- Use Web Archive's Wayback Machine to collect archives of a URL over time
- Screenshot Generator
- Take screenshots of web pages
- Microsoft Azure App Store
- Collect data on Microsoft Azure applications
- Amazon Web Services (AWS) Marketplace
- Collect data on AWS applications
- Take screenshots of any column containing URLs
- Detect trackers
- Provide a list of various source code to search for in collected HTML
These extensions are designed to work with 4CAT v1.46 or later.
For instructions on adding the "I do not care about cookies" browser extension, see below.
- Download/clone extensions into both 4CAT backend and frontend containers
docker exec 4cat_backend git clone https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions.git extensions/web_studies/
docker exec 4cat_frontend git clone https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions.git extensions/web_studies/
- Restart 4CAT containers
docker compose restart
from 4CAT directory wheredocker-compose.yml
and.env
files were previously downloaded- This will automatically install necessary dependencies, Firefox, and Geckodriver
- Activate desired new datasources from the 4CAT Control Panel
- Control Panel -> Settings -> Data sources
- Download or clone this repository and copy the folders into the
extensions
folder in your 4CAT directory
git clone https://github.com/digitalmethodsinitiative/4cat_web_studies_extensions.git extensions/web_studies/
- Run 4CAT's migrate script to install necessary packages
python helper-scripts/migrate.py
- Note:
fourcat_insall.py
is only designed to run on linux systems. For other systems you will need set up the following:- Install python packages from
requirements.txt
- Download Firefox
- Download the appropriate Geckodriver compatible with that version of Firefox (https://github.com/mozilla/geckodriver/releases/)
- Adjust settings in 4CAT interface via
Control Panel -> Settings -> selenium
to point to Firefox/Geckodriver programs
- Install python packages from
- Activate desired datasources from the 4CAT Control Panel
- Control Panel -> Settings -> Data sources
Some datasources/processors can make use of a Firefox extension that removes cookies. To install:
docker exec 4cat_backend wget https://addons.mozilla.org/firefox/downloads/file/4216095/istilldontcareaboutcookies-1.1.4.xpi
# you can find the most recent version at the above link- Enable the extension in the 4CAT Control Panel
- Control Panel -> Settings -> selenium
- Update "Firefox Extensions" by adding the filename to the path section
- e.g.
{"i_dont_care_about_cookies": {"path": "istilldontcareaboutcookies-1.1.4.xpi", "always_enabled": false}}
- e.g.