diff --git a/README.md b/README.md index 1eb5f15..093de43 100755 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # ukwa-services + + - [Introduction](#introduction) - [Structure](#structure) - [Deployment Process](#deployment-process) @@ -8,26 +10,25 @@ Deployment configuration for all UKWA services stacks. -These [Docker Stack](https://docs.docker.com/engine/reference/commandline/stack/) configurations and related scripts are used to launch and manage our main services. No internal or sensitive data is kept here -- that is stored in internal `ukwa-services-env` repository and pulled in when needed. +These [Docker Stack](https://docs.docker.com/engine/reference/commandline/stack/) configurations and related scripts are used to launch and manage our main services. No internal or sensitive data is kept here -- that is stored in internal `ukwa-services-env` repository as environment variable scripts required for deployment, or as part of the CI/CD system. See the [change log](./CHANGELOG.md) for information on how this setup has changed over time. ## Structure -Service stacks are grouped by broad service area, e.g. [`access`](./access) contains the stacks that provides the access services, and the [access README](./access/README.md) provides detailed documentation on how the access services are deployed. +Service stacks are grouped by broad service area, e.g. [`access`](./access) contains the stacks that provides the access services, and the [access README](./access/README.md) provides detailed documentation on how the access services are deployed. The service areas are: -Within each sub-folder, e.g. `access/website`], we should have a single `docker-compose.yml` file which should be used for all deployment contexts (`dev`,`beta` and `prod`). Any necessary variations should be defined via environment variables. +- [`ingest`](./ingest) covers all services relating to the curation and ingest of web archives +- [`access`](./access) covers all services relating to how we make the web archives accessible to the public +- [`manage`](./manage) covers all internal services relating to the management of the web archive, including automation and workflows that orchestrate activities from ingest to storage and then to access -These variables, any other context-specific configuration, should be held in `dev`,`beta` and `prod` subdirectories. For example, if `access/website/docker-compose.yml` is the main stack definition file, any addtional services needed only on `dev` might be declared in `access/website/dev/docker-compose.yml` and would be deployed separately. +Within each sub-folder, e.g. `access/website`, we should have a single `docker-compose.yml` file which should be used for all deployment contexts (e.g. `dev`,`beta` and `prod`). Any necessary variations should be defined via environment variables. -A top-level guide to all the different automated tasks is provided in [`TASKS.md`](./TASKS.md). +These variables, any other context-specific configuration, should be held in subdirectories. For example, if `access/website/docker-compose.yml` is the main stack definition file, any addtional services needed only on `dev` might be declared in `access/website/dev/docker-compose.yml` and would be deployed separately. ## Deployment Process -First, individual components should be developed and tested on developers' own machines/VMs, using the [Docker Compose](https://docs.docker.com/compose/compose-file/) files within each tool's repository. e.g. - -- [w3act](https://github.com/ukwa/w3act/blob/master/docker-compose.yml) -- [crawl-log-viewer](https://github.com/ukwa/crawl-log-viewer#local-development-setup) +First, individual components should be developed and tested on developers' own machines/VMs, using the [Docker Compose](https://docs.docker.com/compose/compose-file/) files within each tool's repository. e.g. [w3act](https://github.com/ukwa/w3act/blob/master/docker-compose.yml). These are are intended to be self-contained. i.e. if possible should not depend on external services, but use dummy ones populated with test data. diff --git a/access/README.md b/access/README.md index abde60f..35933d9 100644 --- a/access/README.md +++ b/access/README.md @@ -1,212 +1,16 @@ -The Access Stack -================ +The Access Stacks +================= + + - [Introduction](#introduction) - - [Integration Points](#integration-points) -- [The Access Data Stack](#the-access-data-stack) - - [Deployment](#deployment) - - [Components](#components) - - [W3ACT Exports](#w3act-exports) - - [Crawl Log Analyser](#crawl-log-analyser) - - [Cron Tasks](#cron-tasks) -- [The Website Stack](#the-website-stack) - - [Deployment](#deployment-1) - - [NGINX Proxies](#nginx-proxies) - - [Components](#components-1) - - [Shine Database](#shine-database) - - [Stop the Shine service](#stop-the-shine-service) - - [Creating the Shine database](#creating-the-shine-database) - - [Restoring the Shine database from a backup](#restoring-the-shine-database-from-a-backup) - - [Restart the Shine service](#restart-the-shine-service) - - [Creating a backup of the Shine database](#creating-a-backup-of-the-shine-database) - - [Cron Jobs](#cron-jobs) - [The Website Regression Test Stack](#the-website-regression-test-stack) - - [Cron Jobs](#cron-jobs-1) - [The Reading Room Wayback Stack](#the-reading-room-wayback-stack) -- [Monitoring](#monitoring) # Introduction -This folder contains the components used for access to our web archives. It's made up of a number of separate stacks, with the first, 'Access Data', providing support for the others. - -## Integration Points - -These services can be deployed in different contexts (dev/beta/prod/etc.) but in all cases are designed to run (read-only!) against: - -- The WebHDFS API. -- The OutbackCDX API. -- The Solr full-text search API(s). -- The Prometheus Push Gateway metrics API. - -These are defined in the stack launch scripts, and can be changed as needed, based on deployment context if necessary. - -The web site part is designed to be run behind an edge server than handles the SSL/non-SSL transition and proxies the requests downstream. More details are provided in the relevant Deployment section. - -# The Access Data Stack - -The other access stacks depend on a number of data sources and the `access_data` stack handles those. The [access_data stack definition](./data/docker-compose.yml) describes data volumes as well as services that the other stacks can refer to. - -**NOTE** that this means that the stacks should be deployed consistently under the same names, as the `access_website` stack will not be able to find the networks associated with the `access_data` stack if the stack has been deployed under a different name. - -## Deployment - -The stack is deployed using: - - cd data - ./deploy-access-data.sh dev - -The deployment shell script sets up the right environment variables for each context (dev/beta/prod) before launching the services. This sets the `STORAGE_PATH` location where service data should be held, and this needs to be updated depending on what file system the Swarm nodes in a given deployment context share. - -**NOTE** that after deployment, the Solr collection data is pulled into the service, which takes ~10 minutes to appear. - -## Components - -### W3ACT Exports - -The `w3act_export` service downloads the regular W3ACT database dump from HDFS (`/9_processing/w3act/w3act-db-csv.zip`) and uses it to generate the data sources the rest of the stack needs. The service runs once when the stack is deployed or when it is updated. Regular updates can be orchestrated by using cron to run: - - docker service update --force access_data_w3act_export - -The outputs of the `w3act_export` service are placed on a volume called `access_data_w3act_export`. If all goes well, this should include: - -- The `allows.aclj` and `block.aclj` files needed by the [pywb access control system](https://github.com/ukwa/ukwa-pywb/blob/master/docs/access_controls.md#access-control-system). The `allows.aclj` file is generated from the data in W3ACT, based on the license status. The `blocks.aclj` file is managed in GitLab, and is downloaded from there. -- The `allows.txt` and `annotations.json` files needed for full-text Solr indexing. - -The service also populates the secondary Solr collection used to generate the _Topics & Themes_ pages of the UKWA website. The Solr instance and schema is managed as a Docker container in this stack. - -TODO: On completing these tasks, the service sends metrics to Prometheus for monitoring (TBA). - -### Crawl Log Analyser - -The `analyse` service connects to the Kafka crawl log of the frequent crawler, and aggregates statistics on recent crawling activity. This is summarised into a regularly-updated JSON file that the UKWA Access API part of the website stack makes available for users. This is used by the https://ukwa-vis.glitch.me/ live crawler glitch experiment. - -## Cron Tasks - -As mentioned above, a cron task should be set up to run the W3ACT Export. This cron task should run hourly. - -# The Website Stack - The [access_website stack](./website/docker-compose.yml) runs the services that actually provide the end-user website for https://www.webarchive.org.uk/ or https://beta.webarchive.org.uk/ or https://dev.webarchive.org.uk. -## Deployment - -The stack is deployed using: - - cd website/ - ./deploy-access-website.sh dev - -As with the data stack, this script must be setup for the variations across deployment contexts. For example, DEV version is password protected and it configured to pick this up from our internal repository. - -**NOTE** that this website stack generates and caches images of archived web pages, and hence will require a reasonable amount of storage for this cache (see below for details). - -### NGINX Proxies - -The website is designed to be run behind a boundary web proxy that handles SSL etc. To make use of this stack of services, the server that provides e.g. `dev.webarchive.org.uk` will need to be configured to point to the right API endpoint, which by convention is `website.dapi.wa.bl.uk`. - -The set of current proxies and historical redirects associated with the website are now contained in the [internal nginx.conf](./config/nginx.conf). This sets up a service on port 80 where all the site components can be accessed. Once running, the entire system should be exposed properly via the API gateway. For example, for accessing the dev system we want `website.dapi.wa.bl.uk` to point to `dev-swarm-members:80`. - -Because most of the complexity of the NGINX setup is in the internal NGINX, the proxy setup at the edge is much simpler. e.g. for DEV, the external-facing NGINX configuration looks like: - -``` - location / { - # Used to tell downstream services what external host/port/etc. is: - proxy_set_header Host $host; - proxy_set_header X-Forwarded-Proto $scheme; - proxy_set_header X-Forwarded-Host $host; - proxy_set_header X-Forwarded-Port $server_port; - proxy_set_header X-Forwarded-For $remote_addr; - # Used for rate-limiting Mementos lookups: - proxy_set_header X-Real-IP $remote_addr; - proxy_pass http://website.dapi.wa.bl.uk/; - } -``` - -(Internal users can see the `dev_443.conf` setup for details.) - -The [internal NGINX configuration](./website/config/nginx.conf) is more complex, merging together the various back-end systems and passing on the configuration as appropriate. For example, [the configuration for the public PyWB service](https://github.com/ukwa/ukwa-services/blob/d68e54d6d7d44e714df24bf31223c8f8f46e5ff6/access/website/config/nginx.conf#L40-L42) includes: - -``` - uwsgi_param UWSGI_SCHEME $http_x_forwarded_proto; - uwsgi_param SCRIPT_NAME /wayback; -``` - -The service picks up the host name from the standard HTTP `Host` header, but here we add the scheme (http/https, passed from the upstream NGINX server via the `X-Forwarded-Proto` header) and fix the deployment path using the `SCRIPT_NAME` CGI variable. - -Having set this chain up, if we visit e.g. `dev.webarchive.org.uk` the traffic should show up on the API server as well as the Docker container. - -**NOTE** that changes to the internal NGINX configuration are only picked up when it starts, so necessary to run: - - docker service update --force access_website_nginx - -After which NGINX should restart and pick up any configuration changes and re-check whether it can connect to any proxied services inside the stack. - -Because the chain of proxies is quite complicated, we also add a `Via` header at each layer, e.g. - -``` - # Add header for tracing where issues occur: - add_header Via $hostname always; -``` - -This adds a hostname for every successful proxy request, so the number of `Via` headers and their values can be used to trace problems with the proxies. - - -## Components - -Behind the NGINX, we have a set of modular components: - -- The [ukwa-ui](https://github.com/ukwa/ukwa-ui) service that provides the main user interface. -- The [ukwa-pywb](https://github.com/ukwa/ukwa-pywb) service that provides access to archive web pages -- The [mementos](https://github.com/ukwa/mementoweb-webclient) service that allows users to look up URLs via Memento. -- The [shine](https://github.com/ukwa/shine) and shinedb services that provide our older prototype researcher interface. -- The [ukwa-access-api](https://github.com/ukwa/ukwa-access-api) and related services (pywb-nobanner, webrender-api, Cantaloupe) that provide API services. - - The API services include a caching image server ([Cantaloupe](https://cantaloupe-project.github.io/)) that takes rendered versions of archived websites and exposes them via the standard [IIIF Image API](https://iiif.io/api/image/2.1/). This will need substantial disk space (~1TB). - -### Shine Database - -Shine requires a PostgreSQL database, so additional setup is required using the scripts in [./scripts/postgres](./scripts/postgres). - -#### Stop the Shine service - -When modifying the database, and having deployed the stack, you first need to stop Shine itself from running, as otherwise it will attempt to start up and will insert and empty database into PostgreSQL and this will interfere with the restore process. So, use - - $ docker service scale access_website_shine=0 - -This will drop the Shine service but leave all the rest of the stack running. - -#### Creating the Shine database - -* `create-db.sh` -* `create-user.sh` -* `list-db.sh` - -Within `scripts/postgres/`, you can run `create-db.sh` to create the database itself. Then, run `create-user.sh` to run the `setup_user.sql` script and set up a suitable user with access to the database. Use `list-db.sh` to check the database is there at this pont. - -#### Restoring the Shine database from a backup - -* Edit `download-shine-db-dump.sh` to use the most recent date version from HDFS -* `download-shine-db-dump.sh` -* `restore-shine-db-from-dump.sh` - -To do a restore, you need to grab a database dump from HDFS. Currently, the backups are dated and are in the HDFS `/2_backups/access/access_shinedb/` folder, so you'll need to edit the file to use the appropriate date, then run `download-shine-db-dump.sh` to actually get the database dump. Now, running `restore-shine-db-from-dump.sh` should populate the database. - -#### Restart the Shine service - -Once you have created and restored the database as needed, re-scale the service and Shine will restart using the restored database. - - $ docker service scale access_website_shine=1 - -#### Creating a backup of the Shine database - -An additional helper script will download a dated dump file of the live database and push it to HDFS, `backup-shine-db-to-hdfs.sh`. - - ./backup-shine-db-to-hdfs.sh dev - -This should be run daily. - -## Cron Jobs - -There should be a daily (early morning) backup of the Shine database. - # The Website Regression Test Stack A series of tests for the website are held under the `tests` folder. As well as checking service features and critical APIs, these test also cover features relating legal compliance. @@ -244,20 +48,8 @@ The tests are run once on startup, and results are posted to Prometheus. Follow These can be run each morning, and the metrics posted to Prometheus used to track compliance and raise alerts if needed. -## Cron Jobs - -There should be a Daily (early morning) run of the website tests. - # The Reading Room Wayback Stack The `rrwb` stack defines the necessary services for running our reading room access services via proxied connections rather than DLS VMs. This new approach is on hold at present. - -# Monitoring - -Having deployed all of the above, the cron jobs mentioned above should be in place. - -The `ukwa-monitor` service should be used to check that these are running, and that the W3ACT database export file on HDFS is being updated. - -...monitoring setup TBC... diff --git a/access/data/deploy-access-data.sh b/access/data/deploy-access-data.sh deleted file mode 100755 index 2ad682a..0000000 --- a/access/data/deploy-access-data.sh +++ /dev/null @@ -1,40 +0,0 @@ -#!/bin/bash - -# Fail on errors: -set -e - -# read script environ argument -ENVIRON=$1 -if ! [[ ${ENVIRON} =~ dev|beta|prod ]]; then - echo "ERROR: Script $0 requires environment argument (dev|beta|prod)" - exit -fi - - -# Where to store persistant data (current same for dev|beta|prod): -if [[ ${ENVIRON} == 'prod' ]]; then - export STORAGE_PATH=/mnt/nfs/prod1/data/access_data -elif [[ ${ENVIRON} == 'beta' ]]; then - export STORAGE_PATH=/mnt/nfs/data/access_data -else - # dev vars - export STORAGE_PATH=/mnt/nfs/data/access_data -fi - -# Which Kafka to talk to for recent FC activity: -# 192.168.45.15 is crawler05.n45 -export KAFKA_BROKER=192.168.45.15:9094 - -# Get the UID so we can run services as the same UID: -export CURRENT_UID=$(id -u):$(id -g) - -# Create needed folders: -mkdir -p $STORAGE_PATH/w3act_export -mkdir -p $STORAGE_PATH/fc_analysis -mkdir -p $STORAGE_PATH/collections_solr_cores -mkdir -p $STORAGE_PATH/collections_solr_logs -chown -R ${CURRENT_UID} ${STORAGE_PATH} -chmod a+w $STORAGE_PATH/collections_solr_* - -# Launch the common configuration with these environment variable: -docker stack deploy -c ../docker-compose.shared.yml -c docker-compose.yml access_data diff --git a/access/data/docker-compose.yml b/access/data/docker-compose.yml deleted file mode 100755 index fdddd24..0000000 --- a/access/data/docker-compose.yml +++ /dev/null @@ -1,56 +0,0 @@ -# ------------------------------------------------------------- -# This service configuration defines shared access data services -# ------------------------------------------------------------- - -version: '3.2' - -services: - # ------------------------------------------------------------- - # Get W3ACT data from HDFS and generate derivatives - # ------------------------------------------------------------- - w3act_export: - image: ukwa/python-w3act - user: "${CURRENT_UID}" - command: "/w3act_export_scripts/export.sh" - volumes: - - "./w3act_export_scripts:/w3act_export_scripts" - - "w3act_export:/w3act_export" - deploy: - restart_policy: - # Run once: - condition: on-failure - # If it fails, retry every 15 mins: - delay: 15m - - - # ------------------------------------------------------------- - # Collections index for Topics & Themes of the UKWA UI - # ------------------------------------------------------------- - collections_solr: - image: ukwa/ukwa-ui-collections-solr:1.1.1 - user: "${CURRENT_UID}" - volumes: - - "${STORAGE_PATH}/collections_solr_cores:/opt/solr/server/solr/mycores" - - "${STORAGE_PATH}/collections_solr_logs:/opt/solr/server/logs" - ports: - - "9021:8983" # Exposed port so external clients can run checks (TBC) - - # ---------------------------------------------------------------------- - # Analyses recent crawl behaviour by processing the crawled data stream: - # ---------------------------------------------------------------------- - analyse: - image: ukwa/crawl-streams - user: "${CURRENT_UID}" - command: "analyse -k ${KAFKA_BROKER} -u 2 -o /analysis/fc.crawled.json" - volumes: - - "fc_analysis:/analysis" - -# Volumes and networks supporting the above -# ----------------------------------------- - -networks: - # This attachable network is needed so the website stack can see the Collections Solr without having to expose a host port. - default: - driver: overlay - attachable: true - diff --git a/access/data/w3act_export_scripts/download_blocks.py b/access/data/w3act_export_scripts/download_blocks.py deleted file mode 100644 index 05d265e..0000000 --- a/access/data/w3act_export_scripts/download_blocks.py +++ /dev/null @@ -1,4 +0,0 @@ -import urllib.request - -urllib.request.urlretrieve("http://git.wa.bl.uk/bl-services/wayback_excludes_update/-/raw/master/oukwa/acl/blocks.aclj", "blocks.aclj.new") - diff --git a/access/data/w3act_export_scripts/download_w3act_dump.py b/access/data/w3act_export_scripts/download_w3act_dump.py deleted file mode 100644 index 3f01a24..0000000 --- a/access/data/w3act_export_scripts/download_w3act_dump.py +++ /dev/null @@ -1,7 +0,0 @@ -import urllib.request -import zipfile - -urllib.request.urlretrieve("http://hdfs.api.wa.bl.uk/webhdfs/v1/9_processing/w3act/w3act-db-csv.zip?user.name=access&op=OPEN", "w3act-db-csv.zip") - -with zipfile.ZipFile("w3act-db-csv.zip", 'r') as zip_ref: - zip_ref.extractall(".") diff --git a/access/data/w3act_export_scripts/export.sh b/access/data/w3act_export_scripts/export.sh deleted file mode 100755 index 4efda78..0000000 --- a/access/data/w3act_export_scripts/export.sh +++ /dev/null @@ -1,40 +0,0 @@ -#!/bin/bash - -# Stop on errors: -set -e - -# Source env -#export HDFS_W3ACT_DB_CSV=/9_processing/w3act/w3act-db-csv.zip - -# Change into data export folder: -cd /w3act_export - -# Dump W3ACT DB as CSV -echo "Downloading W3ACT CSV from Hadoop..." -rm -f w3act-db-csv.zip w3act-db-csv/*.* -python /w3act_export_scripts/download_w3act_dump.py - -# Generate the OA access list -echo "Generating open-access allow list..." -w3act -d w3act-db-csv gen-acl allows.aclj.new -w3act -d w3act-db-csv gen-acl --format surts allows.txt.new -w3act -d w3act-db-csv gen-annotations annotations.json.new - -# Copy the allows, and then update atomically: -echo "Updating allow lists etc..." -mv -f allows.aclj.new allows.aclj -mv -f allows.txt.new allows.txt -mv -f annotations.json.new annotations.json - -# Update the OA service blocks list: -echo "Pulling latest blocks file from GitLab..." -python /w3act_export_scripts/download_blocks.py - -# Copy the blocks, and then update atomically: -echo "Updating block list..." -mv -f blocks.aclj.new blocks.aclj - -# Now update the Collections Solr:` -w3act -d w3act-db-csv update-collections-solr http://collections_solr:8983/solr/collections - - diff --git a/access/rrwb/README.md b/access/rrwb/README.md deleted file mode 100644 index 91c668b..0000000 --- a/access/rrwb/README.md +++ /dev/null @@ -1,354 +0,0 @@ -Reading Room Wayback Service Stack -================================== - -- [Introduction](#introduction) -- [To Do](#to-do) -- [Overview](#overview) - - [Deployment Architecture](#deployment-architecture) -- [The Central Services](#the-central-services) - - [Pre-requisites](#pre-requisites) - - [Operations](#operations) - - [Deploying and Updating the Stack](#deploying-and-updating-the-stack) - - [Setting up logging](#setting-up-logging) - - [Setting up monitoring](#setting-up-monitoring) - - [Updating the Block List](#updating-the-block-list) - - [Inspecting and Managing SCU locks](#inspecting-and-managing-scu-locks) - - [Deployment Testing](#deployment-testing) -- [Access in Reading Rooms](#access-in-reading-rooms) - - [Via Secure Terminals](#via-secure-terminals) - - [Via the NPLD Player](#via-the-npld-player) - - [Connection to the Central Services](#connection-to-the-central-services) - - [Deploying the NPLD Player](#deploying-the-npld-player) - - [Printing](#printing) -- [Testing](#testing) -- [Monitoring](#monitoring) -- [MI Reporting](#mi-reporting) - -Introduction ------------- - -This [Docker Swarm Stack](https://docs.docker.com/engine/swarm/key-concepts/) deploys the back-end services required to provide reading-room and staff access to Non-Print Legal Deposit material. - -This system provides a web-based access point for every Legal Deposit library, and one more for BL Staff, through which NPLD material can be accessed. This covers items delivered to us by publishers (supporting eBook and ePub formats at this time), and web pages captured by the UK Web Archive. This system implements the access restrictions required by the NPLD regulations. - -This replaces the remote-desktop-based access system by using [UK Web Archive Python Wayback](https://github.com/ukwa/ukwa-pywb) (UKWA PyWB) to provide access to content directly to secure browsers in reading rooms (either directly, or via the forthcoming [NPLD Player](https://github.com/ukwa/npld-player)). The UKWA PyWB system also implements the Single-Concurrent Usage (SCU) locks, and provides a way for staff to manage those locks if needed. - -To Do ------ - -This section has been moved to: https://github.com/ukwa/ukwa-services/issues/69 - -Overview --------- - -To ensure a smooth transition, this service maintains the same pattern of localized URLs for accessing content as the current system. e.g. - -- https://blstaff.ldls.org.uk/welcome.html?ark:/81055/vdc_100090432161.0x000001 -- http://bodleian.ldls.org.uk/ark:/81055/vdc_100090432161.0x000001 -- https://bl.ldls.org.uk/welcome.html?10000101000000/http://www.downstairsatthekingshead.com -- https://nls.ldls.org.uk/10000101000000/http://www.downstairsatthekingshead.com _TBC: Is this syntax supported? i.e. no `welcome.html`?_ - -The items with ARK identifiers are handled by the PyWB `doc` collection that proxies the request downstream to the digital library access service, and the `TIMESTAMP/URL` identifiers are passed to a second `web` collection that 'replays' the archived web pages back using UKWA internal services. NGINX is used to perform the mappings from expected URLs to those supported by PyWB. - -For example, if a BL Reading Room patron uses this Access URL to get an URL from the web archive: - -- https://blstaff.ldls.org.uk/welcome.html?10000101000000/http://www.downstairsatthekingshead.com - -Then the URL will get mapped to this PyWB URL: - -- https://blstaff.ldls.org.uk/web/10000101000000/http://www.downstairsatthekingshead.com - -Alternatively, if a BL Staff Access URL used to get an eBook from DLS: - -- https://blstaff.ldls.org.uk/welcome.html?ark:/81055/vdc_100090432161.0x000001 - -Then the content will be served from this URL: - -- https://blstaff.ldls.org.uk/doc/20010101120000/http://staffaccess.dl.bl.uk/ark:/81055/vdc_100090432161.0x000001 - -In this case, a fixed timestamp is used for all ARKs and the `http://staffaccess.dl.bl.uk` prefix has been added, as PyWB needs both a timestamp and a URL to get the content and manage the SCU locks. Requests from reading rooms would use `http://access.dl.bl.uk`, e.g. http://access.dl.bl.uk/ark:/81055/vdc_100022588767.0x000002 These URLs are just about how the web service acts as a proxy to the archival store, and are implementation details that do not affect how the service is used, except in that the combined `TIMESTAMP/URI` identifier is the key upon which SCU locks are minted and managed. - -### Deployment Architecture - -It is expected that the services in this stack are used as the back-end for an upstream proxy. For example, for the British Library, there is some frontend proxy that the `bl.ldls.org.uk` domain name resolves to. That 'front door' proxy will then pass the request on to the relevant back-end services provided by this service stack, which will be deployed in BSP and STP, and connected up using the existing failover mechanism. This backend system can be used directly from secure reading room PCs, or using the NPLD Player on unsecured reading room PCs. Note that the back-end setup is the same in either case, as the access restrictions are implemented at the network level, and the NPLD Player authentication is handled upstream. - -```mermaid -graph LR; - NP(NPLD Player on Insecure PC) --> AP(Authenticating Proxy); - AP --> LDL; - - WB(Browser on Secure Reading Room PC) --> LDL; - - LDL(*.ldls.org.uk proxy) --> S1(BSP Stack); - LDL -.-> S2(STP Stack); - - S1 --> DA(access.dl.bl.uk BSP) - S1 --> DS(staffaccess.dl.bl.uk BSP) - S1 --> UKWA(*.api.wa.bl.uk BSP) - - S2 -.-> DA2(access.dl.bl.uk STP) - S2 -.-> DS2(staffaccess.dl.bl.uk STP) - S2 -.-> UKWA -``` - -Note that the web archive is only accessible via the BSP site at present, so will become unavailable if BSP is down and all content is being served via STP. Access to NPLD documents should work fine, as the `*.dl.bl.uk` services are available at both sites. - - -The Central Services --------------------- - -To provide the central services on the _BSP Stack_ and _STP Stack_, each stack runs the following set of services: - -- An NGINX service to provide URL management, with a shared port and separate ports for each service. This also includes a [mtail](https://github.com/google/mtail) process used for monitoring the service. -- Seven PyWB services, one for each Legal Deposit Library (BL/NLW/NLS/Bod/CUL/TCD managing SCU locks for each), and one for staff access (no SCU locks). -- A Redis service, which holds the SCU lock state for all the PyWB services. -- A [PushProx](https://github.com/prometheus-community/PushProx) client service, which allows NGINX to be monitored by pushing metrics to a remote [Prometheus](https://prometheus.io/) service via a PushProx proxy. - -Each service supports two host names, the real `*.ldls.org.uk` name and a `*.beta.ldls.org.uk` version that could be used if it is necessary to test this system in parallel with the original system. When accessed over the shared port, NGINX uses the `Host` in the request to determine which service is being called. Each PyWB service also exposes a dedicated port, but this is intended to debugging rather than production use. - - -| Server Name | Beta Server Name | Shared NGINX Port | Dedicated NGINX Port | Direct PyWB Port (for debugging) | -|-----------------------|-----------------------------|-------------------|----------------------|----------------------------------| -| bl.ldls.org.uk | bl.beta.ldls.org.uk | 8100 | 8200 | 8300 | -| nls.ldls.org.uk | nls.beta.ldls.org.uk | 8100 | 8201 | 8301 | -| llgc.ldls.org.uk | llgc.beta.ldls.org.uk | 8100 | 8202 | 8302 | -| cam.ldls.org.uk | cam.beta.ldls.org.uk | 8100 | 8203 | 8303 | -| bodleian.ldls.org.uk | bodleian.beta.ldls.org.uk | 8100 | 8204 | 8304 | -| tcdlibrary.ldls.org.uk| tcdlibrary.beta.ldls.org.uk | 8100 | 8205 | 8305 | -| blstaff.ldls.org.uk | blstaff.beta.ldls.org.uk | 8100 | 8209 | 8309 | - - -This NGINX setup assumes that any failover redirection, SSL encryption, authentication, token validation or user identification has all been handled upstream of this service stack. - -For testing purposes, a local `/etc/hosts` file can be used to point the `*.ldls.org.uk` domain names to the service stack, allowing the service to be viewed in a web browser. Of course this won't include any of the services that are handled upstream. - -### Pre-requisites - -In each deployment location: - -- One or more Linux servers with Docker installed and running in Swarm mode. -- Network access to: - - The public web, if only temporarily, install these files and to download the Docker images during installation/deployment. - - If this is not possible [offline Docker image installation can be used](https://serverfault.com/a/718470). - - The BL internal nameservers, so `\*.api.wa.bl.uk` service domains can be resolved. - - The DLS back-end systems where ARK-based resources can be downloaded (e.g. `access.dl.bl.uk`, `staffaccess.dl.bl.uk`). - - The UKWA back-end systems: - - CDX index for URL lookups (`cdx.api.wa.bl.uk`). - - WARC record retrieval (`warc-server.api.wa.bl.uk`). - - GitLab where the URL block list is stored ([`git.wa.bl.uk`](http://git.wa.bl.uk/bl-services/wayback_excludes_update/-/tree/master/ldukwa/acl)). - - If deployed on the Access VLAN, the existing UKWA service proxy can be used to reach these systems. - - -### Operations - -When running operations on the server, the operator should use a non-root user account that is able to use Docker (i.e. a member of the `docker` group on the machine). e.g. - -``` -[root@demo ~]# useradd -G docker access -[root@demo ~]# su - access -[access@demo ~]$ docker run hello-world -``` - -#### Deploying and Updating the Stack - -First get the `ukwa-services` repository and change to the relevant directory: - -``` - git clone https://github.com/ukwa/ukwa-services.git - cd ukwa-services/access/rrwb - ``` - -The Swarm deployment needs access to an host drive location where the list of blocked URLs is stored. The `deploy-rrwb-dev.sh` script shows an example of how this is done for the UKWA DEV system: - -``` -#!/bin/sh - -# Where to store shared files: -export STORAGE_PATH_SHARED=/mnt/nfs/data/airflow/data_exports - -# Username and password to use to access the locks pages: -export LOCKS_AUTH=demouser:demopass - -# Which version of PyWB to use: -export PYWB_IMAGE=ukwa/ukwa-pywb:2.6.4 - -# Deploy as a Docker Stack -docker stack deploy -c docker-compose.yml access_rrwb -``` - -A similar deployment script should be created for each deployment context, setting the `STORAGE_PATH_SHARED` environment variable before deploying the stack, and setting the `LOCKS_AUTH` username and password as required. - -Before running the deployment script, a copy of the URL block access control list should be placed in your shared folder, as per the [Updating the Blocks List section below](#updating-the-block-list). Once that's in place, you can run your script to deploy the services. - -Assuming the required Docker images can be downloaded (or have already been installed offline/manually), the services should start up and start to come online. In a few moments, you should see: - -``` -[access@demo rrwb]$ docker service ls -ID NAME MODE REPLICAS IMAGE PORTS -8de1fqo812x2 access_rrwb_nginx replicated 1/1 nginx:1-alpine *:8100->8100/tcp, *:8200-8205->8200-8205/tcp, *:8209->8209/tcp -0nrr4jvzo1z5 access_rrwb_pywb-bl replicated 1/1 ukwa/ukwa-pywb:2.6.4 *:8300->8080/tcp -oce47sczlkbi access_rrwb_pywb-bod replicated 1/1 ukwa/ukwa-pywb:2.6.4 *:8304->8080/tcp -pbhou0zmso6f access_rrwb_pywb-cam replicated 1/1 ukwa/ukwa-pywb:2.6.4 *:8303->8080/tcp -a1ixwrebslj0 access_rrwb_pywb-llgc replicated 1/1 ukwa/ukwa-pywb:2.6.4 *:8302->8080/tcp -oczh6d2c4oh8 access_rrwb_pywb-nls replicated 1/1 ukwa/ukwa-pywb:2.6.4 *:8301->8080/tcp -lddlkbb80ez7 access_rrwb_pywb-staff replicated 1/1 ukwa/ukwa-pywb:2.6.4 *:8309->8080/tcp -9s1wyzmlshx0 access_rrwb_pywb-tcd replicated 1/1 ukwa/ukwa-pywb:2.6.4 *:8305->8080/tcp -e54xnbxkkk14 access_rrwb_redis replicated 1/1 redis:6 -``` - -Where all service replicas are `1/1`. If any are stuck at `0/1` then they are having trouble starting, and you can use commands like `docker service ps --no-trunc access_rrwb_nginx` to check on individual services. - -If the `docker-compose.yml` file is updated, the stack can be redeployed in order to update the Swarm configuration. However, note that most of the specific configuration is in files held on disk, e.g. the NGINX configuration files. If these are changed, the services can be restarted, forcing the configuration to be reloaded, e.g. - - docker service update --force access_rrwb_nginx - -In case things seem to get into a confused state, it is possible to completely remove the whole service stack and then redeploy it, e.g. - -```bash -docker stack rm access_rrwb -# Wait a couple of minutes while everything gets tidied up, then -./deploy-rrwb-dev.sh -``` -#### Setting up logging - -_TBA: How should logging be set up, for MI and for security?_ - -Currently, the stack extracts Prometheus metrics from web access logs as they stream past, and all the actual log files are managed by Docker. - -If we need to keep access log files from NGINX for analysis, there are various options: - -- Change Docker to write logs to files, and do log rotation like [this](https://www.digitalocean.com/community/tutorials/how-to-configure-logging-and-log-rotation-in-nginx-on-an-ubuntu-vps). -- Push logs to Logstash and use it to make dated log files. -- Push all Docker logs to a syslog server (okay for security, not much use for M.I.). - -_The precise details depend on how M.I. integration works._ - - -#### Setting up monitoring - -The NGINX metrics are exposed on port 3903, but are not accessible directly, due to DLS network restrictions. However, we are allowed to make some outward connections, so Prometheus monitoring can be facilitated by PushProx. - -The Web Archive team can ensure the proxy is in place, and configure their monitoring services to gather metrics from the live service. there should not be any further setup required. - -#### Updating the Block List - -The list of URLs that are blocked from access in the Reading Rooms needs to be installed when deploying the service, and will need to be updated periodically (when the web archive team receives take-down requests). - -The blocks list is version controlled and held in: http://git.wa.bl.uk/bl-services/wayback_excludes_update/-/tree/master/ldukwa/acl - -It needs to be downloaded from there on a regular basis. e.g. a daily cron job like: - - curl -o /shared-folder/blocks.aclj http://git.wa.bl.uk/bl-services/wayback_excludes_update/-/raw/master/ldukwa/acl/blocks.aclj - -#### Inspecting and Managing SCU locks - -The UKWA PyWB system includes [an improved version of the SCU locking mechanism](https://github.com/ukwa/ukwa-pywb/blob/master/docs/locks.md#single-concurrent-lock-system). When an item is first retrieved, a lock for that item is minted against a session cookie in the secure browser. This initial lock is stored in Redis and set to expire at the end of the day. - -However, while the item is being access, a JavaScript client is used to update the lock status, and changes the expiration of the lock to be five minutes in the future. This lock is refreshed every minute or so, so keeps being pushed back into the future while the item is in use. Once the item is no longer being used, the lock updates stop, and the lock is released shortly afterwards. This mechanism is expected to release item locks more reliably than the previous approach. - -See [the Admin Page documentation](https://github.com/ukwa/ukwa-pywb/blob/master/docs/locks.md#admin-page-and-api) to see how to access and manage the SCU locks. - -Access to this page is managed by HTTP Basic authentication via the `LOCKS_AUTH=username:pw` environment variable that must be set on launch. - - -#### Deployment Testing - -While proper testing needs to be done from the end user perspective, basic testing of the deployed services can be use to check the basic functions and connectivity are in place. - -For the web archive: - -- http://host:8209/web/19950418155600/http://portico.bl.uk/ - -For the documents: - -- http://host:8209/doc/20010101120000/http://staffaccess.dl.bl.uk/ark:/81055/vdc_100090432161.0x000001 - - -_...TBA PDF and ePub and a one or two more of each..._ - - -Access in Reading Rooms ------------------------- - -How access works depends on the terminals in use. - -### Via Secure Terminals - -In reading rooms with locked-down access terminals, readers may access the central services directly in the access terminal's web browser. - -In this case, the domain name of the relevant service, e.g. nls.ldls.org.uk, should be the IP address of the machine acting as the `ldls.org.uk proxy`? Or does the DNS name refer directly to the BSP or STP Stack? - -Access to that IP address should be managed at the network and firewall level. Only official Reading Room IP addresses should be able to access the services. For services outside the British Library, access to the central services is enabled by the DLS Access VLAN, which spans all BL/LLGC/NS properties. - - -### Via the NPLD Player - -In reading rooms without locked-down terminals, readers must use the NPLD Player to access content. This means deploying two components: - -- the [NPLD Player](https://github.com/ukwa/npld-player) on reading room terminals, including a bundled secret key that the Player will use to authenticate itself. _n.b. as yet there are no builds of the player package/installer available for deployment_ -- the authenticating proxy, that checks the secret key, and proxies the requests on to the central services. - -There are two possible deployment patterns, depending on whether the library in question is directly peered onto the DLS Access VLAN, or uses a remote outbound proxy to connect to the central services. - -In both cases, an additional proxy configuration is required for the client IP address range corresponding to the reading rooms where the NPLD Player will be used. This additional proxy configuration should check the secret key is present. If not, it should present a web page that will redirect the user to the NPLD Player via the custom protocol scheme. - -If the secret key is verified, the request should be proxied on towards the central services. For reading rooms with access to the DLS Access VLAN, the request should be proxied onwards in the same way as for secure terminals. - -For reading rooms without access to the DLS Access VLAN, the request should be proxied onwards to the Outbound Proxy that securely connects to the central services over the internet. This should only need a single Linux VM in each network area. For incoming connections, this would handling NPLD Player authentication or provide access depending on client IP ranges, and would pass outgoing connections on the the central services using the appropriate TLS certificates and access keys. - -No further URL rewriting or other complicated configuration should be required, as and such manipulations should now be managed centrally. - -### Connection to the Central Services - -Both modes of access depend on the Central Services being available. The national libraries of Scotland, Wales and Britain should all be able to access the central services directly via the DLS Access VLAN, but the university libraries, and any library wishing to use the NPLD Player, will need to deploy an additional proxy server locally: - -```mermaid -graph LR; - NP(NPLD Player on Insecure PC) --> LDLl; - - WB(Browser on Secure Reading Room PC) --> LDLl; - - LDLl(*.ldls.org.uk inc. Auth - LOCAL ) -->|Secure WAN Connection| LDLc; - - LDLc(*.ldls.org.uk - CENTRAL); -``` - -The role of this server is to proxy user requests to the central services over the _Secure WAN Connection_. This is expected to be an NGINX instance that verifies the source IP addresses, handles the validation of the NPLD Player secure token, and sets up the ongoing secure connection to the central services. This should not require a lot of resources, e.g. a Linux server with 2GB of RAM and at least 2 CPUs. - -### Deploying the NPLD Player - -Where Readers are expected to use the NPLD Player, this will need to be installed on the appropriate access terminals. - -Installation packages, built with the secret access token bundled inside, will be made available via this _private_ GitHub repository: https://github.com/ukwa/npld-player-builds The intention is that all necessary local configuration will be held there and embedded in the distribution packages (one for each legal deposit library). It will be the responsibility of the deploying library to ensure that the bundled secret access token is accepted by the local authenticating proxy. - -Any further documentation that is needed will be found at: https://github.com/ukwa/npld-player#readme - -Note that as with the previous solution, the secure deployment of the NPLD Player is critically dependent on the careful management of the Outbound Proxy that links back to the centralized services. This should be locked down so that it can only be used from IP ranges that correspond to reading rooms, and that the IP range corresponding to reading rooms without locked-down terminals is also configured to require the secure token header, thus ensuring only the NPLD Player can access the material. - -### Printing - -Printing should work exactly the same as for any other web page being viewed in the reading room. When the NPLD Player is in use, this will act like any other application on the machine, and use on the local print system via standard OS calls. - -Note that to avoid copies of material being taken away, libraries should not allow `Print to file`. - -## Testing - -_...TBA: Add or link to suite of test cases and acceptance criteria..._ - -## Monitoring - -_...TBA: Any notes on current monitoring setup. The Web Archive currently uses Prometheus to monitor services, but this is 'pull-based', meaning our Prometheus server makes calls to the services it is monitoring, rather than the services posting data to it. Our current DLS setup is able to pull data from internal systems, but it is not possible to proxy connections the other way. However, it seems [PushProx `prom/pushprox:v0.1.0`](https://github.com/prometheus-community/PushProx) provides a standard way to handle this situation. This should allow service metric monitoring for the central services._ - -_It would be nice to extend this to the local proxies, either directly monitoring the far-end NGINX services and/or using a `cron` job to ping back to the central services Prometheus instance._ - -_In principle, this could be extending to running Robot Framework tests every morning, and reporting that all is well, or not._ - -## MI Reporting - -It should be possible to analyse the NGINX logs to gather the information we need. This could be done for local proxies or for the central services, ideally both so that one can act as a check for the other. - -_...TBD: logging locally, including turnaway details, and setup a data feed back to BL? Or a modified version of current MI logging?_ _How are turnaways identified?_ - -_...TBA: Some information on how this works at present..._ - -_...TBA: Some information on whether/how individual user sessions should be identified..._ \ No newline at end of file diff --git a/access/rrwb/deploy-rrwb-demo.sh b/access/rrwb/deploy-rrwb-demo.sh deleted file mode 100755 index 7c95f6f..0000000 --- a/access/rrwb/deploy-rrwb-demo.sh +++ /dev/null @@ -1,13 +0,0 @@ -#!/bin/sh - -# Where to store shared files: -export STORAGE_PATH_SHARED=/home/access/rrwb-acls - -# Username and password to use to access the locks pages: -export LOCKS_AUTH=demouser:demopass - -# Which version of PyWB to use: -export PYWB_IMAGE=ukwa/ukwa-pywb:master - -# Deploy as a Docker Stack -docker stack deploy -c docker-compose.yml access_rrwb diff --git a/access/rrwb/deploy-rrwb-dev.sh b/access/rrwb/deploy-rrwb-dev.sh deleted file mode 100755 index faf85c8..0000000 --- a/access/rrwb/deploy-rrwb-dev.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/bin/sh - -# Pull in vars -set -a && . ./dev.env && set +a - -# Deploy as a Docker Stack -docker stack deploy -c docker-compose.yml $EXTRA_CONFIG access_rrwb diff --git a/access/rrwb/dev.env b/access/rrwb/dev.env deleted file mode 100755 index d498097..0000000 --- a/access/rrwb/dev.env +++ /dev/null @@ -1,7 +0,0 @@ -LOCKS_AUTH=demouser:demopass -STORAGE_PATH_SHARED=/mnt/nfs/data/airflow/data_exports -PYWB_IMAGE=ukwa/ukwa-pywb:custom-viewers -PUSHPROX_URL="http://dev1.n45.wa.bl.uk:9494/" -PUSHPROX_FQDN=dev1.n45.wa.bl.uk -EXTRA_CONFIG="-c docker-compose.dev.yml" -DLS_ACCESS_SERVER=http://staffaccess.dl.bl.uk diff --git a/access/rrwb/docker-compose.dev.yml b/access/rrwb/docker-compose.dev.yml deleted file mode 100755 index a8d864d..0000000 --- a/access/rrwb/docker-compose.dev.yml +++ /dev/null @@ -1,16 +0,0 @@ - -# This is the base configuration. -version: '3.2' - -services: - - # Redis Browser - redis-commander: - image: rediscommander/redis-commander:latest - environment: - - REDIS_HOSTS=redis_bl:redis:6379:0,redis_nls:redis:6379:1,redis_llgc:redis:6379:2,redis_cam:redis:6379:3,redis_bod:redis:6379:4,redis_tcd:redis:6379:5 - ports: - - "8081:8081" - depends_on: - - redis - diff --git a/access/rrwb/docker-compose.yml b/access/rrwb/docker-compose.yml deleted file mode 100755 index 05e1d3c..0000000 --- a/access/rrwb/docker-compose.yml +++ /dev/null @@ -1,200 +0,0 @@ - -# This is the base configuration. -version: '3.2' - -services: - - # ------------------------------------------------------------- - # Staff Access Service Configuration (no locks): - # ------------------------------------------------------------- - pywb-staff: - image: ${PYWB_IMAGE} - environment: - - "TLDEXTRACT_CACHE_TIMEOUT=0.1" - volumes: - - ./pywb/staff.yaml:/webarchive/config.yaml - - ${STORAGE_PATH_SHARED}:/ukwa_pywb/acl/ - - ./logos/logo-staff.png:/ukwa_pywb/static/ukwa-2018-w-med.png - depends_on: - - redis - - doc-streamer - ports: - - "8309:8080" - - - # ------------------------------------------------------------- - # Reading Room Wayback with SCU locks, per LDL: - # ------------------------------------------------------------- - pywb-bl: - image: ${PYWB_IMAGE} - environment: - - "REDIS_URL=redis://redis:6379/0" # Locks stored in Redis DB 0 - - "LOCKS_AUTH=${LOCKS_AUTH}" - - "TLDEXTRACT_CACHE_TIMEOUT=0.1" - - "LOCK_PING_INTERVAL=5" - - "LOCK_PING_EXTEND_TIME=10" - volumes: - - ./pywb/readingroom.yaml:/webarchive/config.yaml - - ${STORAGE_PATH_SHARED}:/ukwa_pywb/acl/ - - ./logos/bl_logo.png:/ukwa_pywb/static/ukwa-2018-w-med.png - depends_on: - - redis - - doc-streamer - ports: - - "8300:8080" - - pywb-nls: - image: ${PYWB_IMAGE} - environment: - - "REDIS_URL=redis://redis:6379/1" - - "LOCKS_AUTH=${LOCKS_AUTH}" - - "TLDEXTRACT_CACHE_TIMEOUT=0.1" - - "LOCK_PING_INTERVAL=5" - - "LOCK_PING_EXTEND_TIME=10" - volumes: - - ./pywb/readingroom.yaml:/webarchive/config.yaml - - ${STORAGE_PATH_SHARED}:/ukwa_pywb/acl/ - - ./logos/nls_logo.png:/ukwa_pywb/static/ukwa-2018-w-med.png - depends_on: - - redis - - doc-streamer - ports: - - "8301:8080" - - pywb-llgc: - image: ${PYWB_IMAGE} - environment: - - "REDIS_URL=redis://redis:6379/2" - - "LOCKS_AUTH=${LOCKS_AUTH}" - - "TLDEXTRACT_CACHE_TIMEOUT=0.1" - - "LOCK_PING_INTERVAL=5" - - "LOCK_PING_EXTEND_TIME=10" - volumes: - - ./pywb/readingroom.yaml:/webarchive/config.yaml - - ${STORAGE_PATH_SHARED}:/ukwa_pywb/acl/ - - ./logos/llgc_logo.png:/ukwa_pywb/static/ukwa-2018-w-med.png - depends_on: - - redis - - doc-streamer - ports: - - "8302:8080" - - pywb-cam: - image: ${PYWB_IMAGE} - environment: - - "REDIS_URL=redis://redis:6379/3" - - "LOCKS_AUTH=${LOCKS_AUTH}" - - "TLDEXTRACT_CACHE_TIMEOUT=0.1" - - "LOCK_PING_INTERVAL=5" - - "LOCK_PING_EXTEND_TIME=10" - volumes: - - ./pywb/readingroom.yaml:/webarchive/config.yaml - - ${STORAGE_PATH_SHARED}:/ukwa_pywb/acl/ - - ./logos/cambridge_logo.png:/ukwa_pywb/static/ukwa-2018-w-med.png - depends_on: - - redis - - doc-streamer - ports: - - "8303:8080" - - pywb-bod: - image: ${PYWB_IMAGE} - environment: - - "REDIS_URL=redis://redis:6379/4" - - "LOCKS_AUTH=${LOCKS_AUTH}" - - "TLDEXTRACT_CACHE_TIMEOUT=0.1" - - "LOCK_PING_INTERVAL=5" - - "LOCK_PING_EXTEND_TIME=10" - volumes: - - ./pywb/readingroom.yaml:/webarchive/config.yaml - - ${STORAGE_PATH_SHARED}:/ukwa_pywb/acl/ - - ./logos/bodleian_logo.png:/ukwa_pywb/static/ukwa-2018-w-med.png - depends_on: - - redis - - doc-streamer - ports: - - "8304:8080" - - pywb-tcd: - image: ${PYWB_IMAGE} - environment: - - "REDIS_URL=redis://redis:6379/5" - - "LOCKS_AUTH=${LOCKS_AUTH}" - - "TLDEXTRACT_CACHE_TIMEOUT=0.1" - - "LOCK_PING_INTERVAL=5" - - "LOCK_PING_EXTEND_TIME=10" - volumes: - - ./pywb/readingroom.yaml:/webarchive/config.yaml - - ${STORAGE_PATH_SHARED}:/ukwa_pywb/acl/ - - ./logos/trinity_logo.png:/ukwa_pywb/static/ukwa-2018-w-med.png - depends_on: - - redis - - doc-streamer - ports: - - "8305:8080" - - - - # ------------------------------------------------------------- - # Supporting Services: - # ------------------------------------------------------------- - - # Redis service to hold locks - redis: - image: redis:6 - - # PushProx to enable integration with Prometheus monitoring - pushprox-client: - image: prom/pushprox:v0.1.0 - entrypoint: '/app/pushprox-client' - command: - - '--fqdn=${PUSHPROX_FQDN}' # Should point to the host server so any ports can be accessed. - - '--proxy-url=${PUSHPROX_URL}' - - # Add some setup via NGINX - nginx: - image: nginx:1 - command: /opt/mtail/entrypoint.sh - volumes: - - ./nginx/conf.d:/etc/nginx/conf.d/:ro - - ./nginx/mtail:/opt/mtail:ro - #- ./nginx/logs:/var/log/nginx:rw - ports: - - "8100:8100" # Shared port (service determine by Host header) - - "8200:8200" # BL - - "8201:8201" # NLS - - "8202:8202" # LLGC - - "8203:8203" # CUL - - "8204:8204" # BOD - - "8205:8205" # TCD - - "8209:8209" # Staff (default when Host header does not match) - - "3903:3903" # mtail port for monitoring - depends_on: - - pywb-staff - - pywb-bl - - pywb-nls - - pywb-llgc - - pywb-cam - - pywb-bod - - pywb-tcd - networks: - default: - aliases: # So Docker services on the internal network can resolve services by name: - - bl.ldls.org.uk - - nls.ldls.org.uk - - llgc.ldls.org.uk - - cam.ldls.org.uk - - bodleian.ldls.org.uk - - tcdlibrary.ldls.org.uk - - # ePub Streamer/Unzipper - doc-streamer: - image: ukwa/epub-streamer:main - environment: - - ARK_SERVER=${DLS_ACCESS_SERVER} - -#logging: -# driver: gelf -# options: -# gelf-address: "udp://logs.wa.bl.uk:12201" - diff --git a/access/rrwb/exporter.sh b/access/rrwb/exporter.sh deleted file mode 100755 index 2f3d428..0000000 --- a/access/rrwb/exporter.sh +++ /dev/null @@ -1,23 +0,0 @@ -#!/bin/bash - -envfile=$1 - -echo Reading env from $envfile ... - -# Pull in production settings: -set -a && . $envfile && set +a - -# Get a list of the images: -echo Scanning and pulling images... -for img in $(docker-compose config | awk '{if ($1 == "image:") print $2;}'); do - # Unique image names only... - if [[ ! "$images" == *"$img"* ]]; then - images="$images $img" - # Pull it in case this host hasn't already pulled it: - docker pull $img - fi -done - -# Save the images in a composite file (loadable using 'docker load < services.tar') -echo Saving $images ... -docker save -o services.tar $images diff --git a/access/rrwb/logos/bl_logo.png b/access/rrwb/logos/bl_logo.png deleted file mode 100644 index ced0dc6..0000000 Binary files a/access/rrwb/logos/bl_logo.png and /dev/null differ diff --git a/access/rrwb/logos/bodleian_logo.jpg b/access/rrwb/logos/bodleian_logo.jpg deleted file mode 100644 index 9b72243..0000000 Binary files a/access/rrwb/logos/bodleian_logo.jpg and /dev/null differ diff --git a/access/rrwb/logos/bodleian_logo.png b/access/rrwb/logos/bodleian_logo.png deleted file mode 100644 index 898493b..0000000 Binary files a/access/rrwb/logos/bodleian_logo.png and /dev/null differ diff --git a/access/rrwb/logos/cambridge_logo.jpg b/access/rrwb/logos/cambridge_logo.jpg deleted file mode 100644 index 17df021..0000000 Binary files a/access/rrwb/logos/cambridge_logo.jpg and /dev/null differ diff --git a/access/rrwb/logos/cambridge_logo.png b/access/rrwb/logos/cambridge_logo.png deleted file mode 100644 index 66fd6c0..0000000 Binary files a/access/rrwb/logos/cambridge_logo.png and /dev/null differ diff --git a/access/rrwb/logos/llgc_logo.png b/access/rrwb/logos/llgc_logo.png deleted file mode 100644 index bb6da24..0000000 Binary files a/access/rrwb/logos/llgc_logo.png and /dev/null differ diff --git a/access/rrwb/logos/logo-original.png b/access/rrwb/logos/logo-original.png deleted file mode 100644 index 9c86ad3..0000000 Binary files a/access/rrwb/logos/logo-original.png and /dev/null differ diff --git a/access/rrwb/logos/logo-staff.png b/access/rrwb/logos/logo-staff.png deleted file mode 100644 index 01b7b0d..0000000 Binary files a/access/rrwb/logos/logo-staff.png and /dev/null differ diff --git a/access/rrwb/logos/nls_logo.png b/access/rrwb/logos/nls_logo.png deleted file mode 100644 index c4f2d51..0000000 Binary files a/access/rrwb/logos/nls_logo.png and /dev/null differ diff --git a/access/rrwb/logos/trinity_logo.jpeg b/access/rrwb/logos/trinity_logo.jpeg deleted file mode 100644 index 7f212ad..0000000 Binary files a/access/rrwb/logos/trinity_logo.jpeg and /dev/null differ diff --git a/access/rrwb/logos/trinity_logo.png b/access/rrwb/logos/trinity_logo.png deleted file mode 100644 index e40086a..0000000 Binary files a/access/rrwb/logos/trinity_logo.png and /dev/null differ diff --git a/access/rrwb/nginx/conf.d/common.inc b/access/rrwb/nginx/conf.d/common.inc deleted file mode 100644 index 60c727b..0000000 --- a/access/rrwb/nginx/conf.d/common.inc +++ /dev/null @@ -1,23 +0,0 @@ -# -# Common redirects to support the original URLs, included in other configs: -# - -location /welcome.html { - # Also need 'npld_access_staff_autostart.html?' ? - - # Strip the matched path, converting the query string to be the path: - set $new_path $args; - set $args ''; - rewrite ^/(.*)$ /$new_path last; -} - -location / { - # Explicit ARKs - rewrite ^/ark:/(\d+)/([^\/]+) /doc/20130401000000/http://doc-streamer:8000/ark:/$1/$2/ permanent; - - # Implied ARKs (starting with e.g. vdc_) - rewrite ^/(vd[^\/]+) /doc/20130401000000/http://doc-streamer:8000/ark:/81055/$1/ permanent; - - # UKWA IDs e.g. TIMESTAMP/URL: - rewrite ^/(\d+)/(.*)$ /web/$1/$2 permanent; -} \ No newline at end of file diff --git a/access/rrwb/nginx/conf.d/mtail.conf b/access/rrwb/nginx/conf.d/mtail.conf deleted file mode 100644 index 9eb130f..0000000 --- a/access/rrwb/nginx/conf.d/mtail.conf +++ /dev/null @@ -1,3 +0,0 @@ -log_format mtail '$server_name $remote_addr - $remote_user [$time_local] ' - '"$request" $status $bytes_sent $request_time ' - '"$http_referer" "$http_user_agent" "$content_type"'; diff --git a/access/rrwb/nginx/conf.d/readingroom.conf b/access/rrwb/nginx/conf.d/readingroom.conf deleted file mode 100644 index d380463..0000000 --- a/access/rrwb/nginx/conf.d/readingroom.conf +++ /dev/null @@ -1,142 +0,0 @@ -server { - listen 8100; # Shared port - listen 8200; # Dedicated port - - # Declare this as the BL instance: - server_name bl.ldls.org.uk bl-beta.ldls.org.uk; - - # Include common configuration, e.g. URL mappings: - include "conf.d/common.inc"; - - # Enforce URL patterns, only supporting a single Document timestamp: - location ~ "^(/doc/20130401000000(\w\w_|)/|/web/|/static/|/_locks)" { - include uwsgi_params; - uwsgi_param UWSGI_SCHEME $scheme; - uwsgi_param SCRIPT_NAME ""; - - uwsgi_pass pywb-bl:8081; - } - - access_log /var/log/nginx/access.log mtail; - -} - -# NLS instance, same port, different Host: -server { - listen 8100; # Shared port - listen 8201; # Dedicated port - - # Declare this as the NLS instance: - server_name nls.ldls.org.uk nls.beta.ldls.org.uk; - - # Include common configuration, e.g. URL mappings: - include "conf.d/common.inc"; - - # Enforce URL patterns, only supporting a single Document timestamp: - location ~ "^(/doc/20130401000000(\w\w_|)/|/web/|/static/|/_locks)" { - include uwsgi_params; - uwsgi_param UWSGI_SCHEME $scheme; - uwsgi_param SCRIPT_NAME ""; - - uwsgi_pass pywb-nls:8081; - } - - access_log /var/log/nginx/access.log mtail; - -} - -# LLGC instance, same port, different Host: -server { - listen 8100; # Shared port - listen 8202; # Dedicated port - - # Declare this instance: - server_name llgc.ldls.org.uk llgc.beta.ldls.org.uk; - - # Include common configuration, e.g. URL mappings: - include "conf.d/common.inc"; - - # Enforce URL patterns, only supporting a single Document timestamp: - location ~ "^(/doc/20130401000000(\w\w_|)/|/web/|/static/|/_locks)" { - include uwsgi_params; - uwsgi_param UWSGI_SCHEME $scheme; - uwsgi_param SCRIPT_NAME ""; - - uwsgi_pass pywb-llgc:8081; - } - - access_log /var/log/nginx/access.log mtail; - -} - -# CUL instance, same port, different Host: -server { - listen 8100; # Shared port - listen 8203; # Dedicated port - - # Declare this instance: - server_name cam.ldls.org.uk cam.beta.ldls.org.uk; - - # Include common configuration, e.g. URL mappings: - include "conf.d/common.inc"; - - # Enforce URL patterns, only supporting a single Document timestamp: - location ~ "^(/doc/20130401000000(\w\w_|)/|/web/|/static/|/_locks)" { - include uwsgi_params; - uwsgi_param UWSGI_SCHEME $scheme; - uwsgi_param SCRIPT_NAME ""; - - uwsgi_pass pywb-cam:8081; - } - - access_log /var/log/nginx/access.log mtail; - -} - -# Bodleian instance, same port, different Host: -server { - listen 8100; # Shared port - listen 8204; # Dedicated port - - # Declare this instance: - server_name bodleian.ldls.org.uk bodleian.beta.ldls.org.uk; - - # Include common configuration, e.g. URL mappings: - include "conf.d/common.inc"; - - # Enforce URL patterns, only supporting a single Document timestamp: - location ~ "^(/doc/20130401000000(\w\w_|)/|/web/|/static/|/_locks)" { - include uwsgi_params; - uwsgi_param UWSGI_SCHEME $scheme; - uwsgi_param SCRIPT_NAME ""; - - uwsgi_pass pywb-bod:8081; - } - - access_log /var/log/nginx/access.log mtail; - -} - -# TCD instance, same port, different Host: -server { - listen 8100; # Shared port - listen 8205; # Dedicated port - - # Declare this instance: - server_name tcdlibrary.ldls.org.uk tcdlibrary.beta.ldls.org.uk; - - # Include common configuration, e.g. URL mappings: - include "conf.d/common.inc"; - - # Enforce URL patterns, only supporting a single Document timestamp: - location ~ "^(/doc/20130401000000(\w\w_|)/|/web/|/static/|/_locks)" { - include uwsgi_params; - uwsgi_param UWSGI_SCHEME $scheme; - uwsgi_param SCRIPT_NAME ""; - - uwsgi_pass pywb-tcd:8081; - } - - access_log /var/log/nginx/access.log mtail; - -} diff --git a/access/rrwb/nginx/conf.d/staff.conf b/access/rrwb/nginx/conf.d/staff.conf deleted file mode 100644 index 3c738b4..0000000 --- a/access/rrwb/nginx/conf.d/staff.conf +++ /dev/null @@ -1,23 +0,0 @@ -server { - listen 8100 default_server; # Shared port - listen 8209; # Dedicated port - - # Declare this as the BL instance: - server_name blstaff.ldls.org.uk blstaff.beta.ldls.org.uk; - - # Include common configuration, e.g. URL mappings: - include "conf.d/common.inc"; - - # Enforce URL patterns, only supporting a single Document timestamp: - location ~ "^(/doc/20130401000000(\w\w_|)/|/web/|/static/)" { - - # Pass requests to PyWB over uwSGI: - include uwsgi_params; - uwsgi_param UWSGI_SCHEME $scheme; - - uwsgi_pass pywb-staff:8081; - } - - access_log /var/log/nginx/access.log mtail; - -} diff --git a/access/rrwb/nginx/logs/.keep b/access/rrwb/nginx/logs/.keep deleted file mode 100644 index e69de29..0000000 diff --git a/access/rrwb/nginx/mtail/entrypoint.sh b/access/rrwb/nginx/mtail/entrypoint.sh deleted file mode 100755 index 5370548..0000000 --- a/access/rrwb/nginx/mtail/entrypoint.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/bin/sh - -# Hook log files to stdout/stderr -# For NGINX, these are already hooked up correctly! -#ln -sf /dev/stdout /var/log/nginx/access.log -#ln -sf /dev/stderr /var/log/nginx/error.log - -# Start mtail in the background: -/opt/mtail/mtail -progs /opt/mtail/progs -logs /var/log/nginx/access.log & - -# Start NGINX -nginx -g 'daemon off;' diff --git a/access/rrwb/nginx/mtail/mtail b/access/rrwb/nginx/mtail/mtail deleted file mode 100755 index 1e5f8ea..0000000 Binary files a/access/rrwb/nginx/mtail/mtail and /dev/null differ diff --git a/access/rrwb/nginx/mtail/progs/nginx.mtail b/access/rrwb/nginx/mtail/progs/nginx.mtail deleted file mode 100644 index 29470c1..0000000 --- a/access/rrwb/nginx/mtail/progs/nginx.mtail +++ /dev/null @@ -1,26 +0,0 @@ -counter http_requests_total by vhost, method, code, content_type -counter http_request_duration_milliseconds_sum by vhost, method, code, content_type -counter http_response_size_bytes_sum by vhost, method, code, content_type - -# log_format mtail '$server_name $remote_addr - $remote_user [$time_local] ' -# '"$request" $status $bytes_sent $request_time' -# '"$http_referer" "$http_user_agent" "$content_type"'; - -/^/ + -/(?P[0-9A-Za-z\.\-:]+) / + -/(?P[0-9A-Za-z\.\-:]+) / + -/- / + -/(?P[0-9A-Za-z\-]+) / + -/(?P\[\d{2}\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} [\-\+]\d{4}\]) / + -/"(?P[A-Z]+) (?P\S+) (?PHTTP\/[0-9\.]+)" / + -/(?P\d{3}) / + -/(?P\d+) / + -/(?P\d+)\.(?P\d+) / + -/"(?P\S+)" / + -/"(?P[[:print:]]+)" / + -/"(?P[^;\\]+)(;.*)?"/ + -/$/ { - http_requests_total[$vhost][tolower($request_method)][$status][$content_type]++ - http_request_duration_milliseconds_sum[$vhost][tolower($request_method)][$status][$content_type] += $request_seconds * 1000 + $request_milliseconds - http_response_size_bytes_sum[$vhost][tolower($request_method)][$status][$content_type] += $bytes_sent -} diff --git a/access/rrwb/prometheus/prometheus.yml b/access/rrwb/prometheus/prometheus.yml deleted file mode 100644 index b6df5dc..0000000 --- a/access/rrwb/prometheus/prometheus.yml +++ /dev/null @@ -1,16 +0,0 @@ -global: - external_labels: - system: 'rrwb' - system_name: 'reading-room-wayback' - -scrape_configs: - - - job_name: 'prometheus' - static_configs: - - targets: ['prometheus:9090'] - - - job_name: 'rrwb-exporter' - proxy_url: 'http://pushprox:8080/' - static_configs: - - targets: ['dev1.n45.wa.bl.uk:3903'] - diff --git a/access/rrwb/pywb/readingroom.yaml b/access/rrwb/pywb/readingroom.yaml deleted file mode 100755 index 2fba814..0000000 --- a/access/rrwb/pywb/readingroom.yaml +++ /dev/null @@ -1,92 +0,0 @@ -collections: - # NPLD web archive access under /web/ - web: - index: - type: cdx - api_url: "http://cdx.api.wa.bl.uk/data-heritrix?url={url}&closest={closest}&sort=closest&filter=!statuscode:429&filter=!mimetype:warc/revisit" - replay_url: "" - archive_paths: "webhdfs://warc-server.api.wa.bl.uk/by-filename/" - - acl_paths: - - ./acl/blocks.aclj - - default_access: allow - - # up the query limit: - query_limit: 100000 - - # Enable SCU locks - single-use-lock: true - - add_headers: - Cache-Control: 'max-age=0, no-cache, must-revalidate, proxy-revalidate, private' - Expires: 'Thu, 01 Jan 1970 00:00:00 GMT' - - ext_redirects: - 'epub': '/static/viewers/epub_viewer/index.html?bookPath={0}' - - content_type_redirects: - # allows - 'text/': 'allow' - 'image/': 'allow' - 'video/': 'allow' - 'audio/': 'allow' - 'application/javascript': 'allow' - - 'text/rtf': 'https://example.com/viewer?{0}' - 'application/epub+zip': '/static/viewers/epub_viewer/index.html?bookPath={0}' - 'application/pdf': '/static/viewers/pdf_viewer/web/viewer.html?file={0}' - - #'application/': 'allowed' - - # default redirects - '': 'https://example.com/blocked?url={0}' - '*': 'https://example.com/blocked?url={0}' - - - # Access to NPLD documents using live web support, under /doc/: - doc: - index: $live - single-use-lock: true - - add_headers: - Cache-Control: 'max-age=0, no-cache, must-revalidate, proxy-revalidate, private' - Expires: 'Thu, 01 Jan 1970 00:00:00 GMT' - - ext_redirects: - 'epub': '/static/viewers/epub_viewer/index.html?bookPath={0}' - - content_type_redirects: - # allows - 'text/': 'allow' - 'image/': 'allow' - 'video/': 'allow' - 'audio/': 'allow' - 'application/javascript': 'allow' - - 'text/rtf': 'https://example.com/viewer?{0}' - 'application/epub+zip': '/static/viewers/epub_viewer/index.html?bookPath={0}' - 'application/pdf': '/static/viewers/pdf_viewer/web/viewer.html?file={0}' - - #'application/': 'allowed' - - # default redirects - '': 'https://example.com/blocked?url={0}' - '*': 'https://example.com/blocked?url={0}' - - -# redirect to exact url behavior -redirect_to_exact: true - -# enable memento -enable_memento: true - -# enable experimental Memento Prefer -enable_prefer: true - -# Locale setup -locales_root_dir: ./i18n/translations/ -locales: - - en - - cy - diff --git a/access/rrwb/pywb/staff.yaml b/access/rrwb/pywb/staff.yaml deleted file mode 100755 index f1491b2..0000000 --- a/access/rrwb/pywb/staff.yaml +++ /dev/null @@ -1,90 +0,0 @@ -collections: - # NPLD web archive access under /web/ - web: - index: - type: cdx - api_url: "http://cdx.api.wa.bl.uk/data-heritrix?url={url}&closest={closest}&sort=closest&filter=!statuscode:429&filter=!mimetype:warc/revisit" - replay_url: "" - archive_paths: "webhdfs://warc-server.api.wa.bl.uk/by-filename/" - - acl_paths: - - ./acl/blocks.aclj - - default_access: allow - - # up the query limit: - query_limit: 100000 - - # Do not enable SCU locks - single-use-lock: false - - add_headers: - Cache-Control: 'max-age=0, no-cache, must-revalidate, proxy-revalidate, private' - Expires: 'Thu, 01 Jan 1970 00:00:00 GMT' - - ext_redirects: - 'epub': '/static/viewers/epub_viewer/index.html?bookPath={0}' - - content_type_redirects: - # allows - 'text/': 'allow' - 'image/': 'allow' - 'video/': 'allow' - 'audio/': 'allow' - 'application/javascript': 'allow' - - 'text/rtf': 'https://example.com/viewer?{0}' - 'application/epub+zip': '/static/viewers/epub_viewer/index.html?bookPath={0}' - 'application/pdf': '/static/viewers/pdf_viewer/web/viewer.html?file={0}' - - #'application/': 'allowed' - - # default redirects - '': 'https://example.com/blocked?url={0}' - '*': 'https://example.com/blocked?url={0}' - - # Access to NPLD documents using live web support, under /doc/: - doc: - index: $live - single-use-lock: false - - add_headers: - Cache-Control: 'max-age=0, no-cache, must-revalidate, proxy-revalidate, private' - Expires: 'Thu, 01 Jan 1970 00:00:00 GMT' - - ext_redirects: - 'epub': '/static/viewers/epub_viewer/index.html?bookPath={0}' - - content_type_redirects: - # allows - 'text/': 'allow' - 'image/': 'allow' - 'video/': 'allow' - 'audio/': 'allow' - 'application/javascript': 'allow' - - 'text/rtf': 'https://example.com/viewer?{0}' - 'application/epub+zip': '/static/viewers/epub_viewer/index.html?bookPath={0}' - 'application/pdf': '/static/viewers/pdf_viewer/web/viewer.html?file={0}' - - #'application/': 'allowed' - - # default redirects - '': 'https://example.com/blocked?url={0}' - '*': 'https://example.com/blocked?url={0}' - -# redirect to exact url behavior -redirect_to_exact: true - -# enable memento -enable_memento: true - -# enable experimental Memento Prefer -enable_prefer: true - -# Locale setup -locales_root_dir: ./i18n/translations/ -locales: - - en - - cy - diff --git a/access/rrwb/update-blocks-demo.sh b/access/rrwb/update-blocks-demo.sh deleted file mode 100755 index 5aede9d..0000000 --- a/access/rrwb/update-blocks-demo.sh +++ /dev/null @@ -1,2 +0,0 @@ -#!/bin/sh -curl -o /home/access/rrwb-acls/blocks.aclj http://git.wa.bl.uk/bl-services/wayback_excludes_update/-/raw/master/ldukwa/acl/blocks.aclj diff --git a/access/website/README.md b/access/website/README.md new file mode 100644 index 0000000..caf2d24 --- /dev/null +++ b/access/website/README.md @@ -0,0 +1,268 @@ +The Access Stacks +================= + + + +- [Introduction](#introduction) + - [Integration Points](#integration-points) +- [The Access Data Stack](#the-access-data-stack) + - [Deployment](#deployment) + - [Components](#components) + - [W3ACT Exports](#w3act-exports) + - [Crawl Log Analyser](#crawl-log-analyser) + - [Cron Tasks](#cron-tasks) +- [The Website Stack](#the-website-stack) + - [Deployment](#deployment-1) + - [NGINX Proxies](#nginx-proxies) + - [Components](#components-1) + - [Shine Database](#shine-database) + - [Stop the Shine service](#stop-the-shine-service) + - [Creating the Shine database](#creating-the-shine-database) + - [Restoring the Shine database from a backup](#restoring-the-shine-database-from-a-backup) + - [Restart the Shine service](#restart-the-shine-service) + - [Creating a backup of the Shine database](#creating-a-backup-of-the-shine-database) + - [Cron Jobs](#cron-jobs) +- [The Website Regression Test Stack](#the-website-regression-test-stack) + - [Cron Jobs](#cron-jobs-1) +- [The Reading Room Wayback Stack](#the-reading-room-wayback-stack) +- [Monitoring](#monitoring) + +# Introduction + +This folder contains the components used for access to our web archives. It's made up of a number of separate stacks, with the first, 'Access Data', providing support for the others. + +## Integration Points + +These services can be deployed in different contexts (dev/beta/prod/etc.) but in all cases are designed to run (read-only!) against: + +- The TrackDB, which knows where all the WARCs are an provides WebHDFS URLs for them. +- The WebHDFS API, that serves the WARC records from each HDFS cluster. +- The OutbackCDX API, that links URL searches to the WARCs that contain the records for that URL. +- The Solr full-text search API(s), that indicates which URLs contain which search terms. +- The Prometheus Push Gateway metrics API, that is used to monitor what's happening. + +These are defined in the stack launch scripts, and can be changed as needed, based on deployment context if necessary. + +The access service also depends on a number of data files, generated by the `w3act_export` workflow run under Apache Airflow. This includes: + +- The `allows.aclj` and `block.aclj` files needed by the [pywb access control system](https://github.com/ukwa/ukwa-pywb/blob/master/docs/access_controls.md#access-control-system). The `allows.aclj` file is generated from the data in W3ACT, based on the license status. The `blocks.aclj` file is managed in GitLab, and is downloaded from there. +- The `allows.txt` and `annotations.json` files needed for full-text Solr indexing. + +The workflow also populates the secondary Solr collection used to generate the _Topics & Themes_ pages of the UKWA website. The Solr instance and schema is managed as a Docker container in this stack. + + +The web site part is designed to be run behind an edge server than handles the SSL/non-SSL transition and proxies the requests downstream. More details are provided in the relevant Deployment section. + +# The Access Data Stack + +The other access stacks depend on a number of data sources and the `access_data` stack handles those. The [access_data stack definition](./data/docker-compose.yml) describes data volumes as well as services that the other stacks can refer to. + +**NOTE** that this means that the stacks should be deployed consistently under the same names, as the `access_website` stack will not be able to find the networks associated with the `access_data` stack if the stack has been deployed under a different name. + +## Deployment + +The stack is deployed using: + + cd data + ./deploy-access-data.sh dev + +The deployment shell script sets up the right environment variables for each context (dev/beta/prod) before launching the services. This sets the `STORAGE_PATH` location where service data should be held, and this needs to be updated depending on what file system the Swarm nodes in a given deployment context share. + +**NOTE** that after deployment, the Solr collection data is pulled into the service, which takes ~10 minutes to appear. + +## Components + +### W3ACT Exports + +The `w3act_export` service downloads the regular W3ACT database dump from HDFS (`/9_processing/w3act/w3act-db-csv.zip`) and uses it to generate the data sources the rest of the stack needs. The service runs once when the stack is deployed or when it is updated. Regular updates can be orchestrated by using cron to run: + + docker service update --force access_data_w3act_export + + +TODO: On completing these tasks, the service sends metrics to Prometheus for monitoring (TBA). + +### Crawl Log Analyser + +The `analyse` service connects to the Kafka crawl log of the frequent crawler, and aggregates statistics on recent crawling activity. This is summarised into a regularly-updated JSON file that the UKWA Access API part of the website stack makes available for users. This is used by the https://ukwa-vis.glitch.me/ live crawler glitch experiment. + +## Cron Tasks + +As mentioned above, a cron task should be set up to run the W3ACT Export. This cron task should run hourly. + +# The Website Stack + +The [access_website stack](./website/docker-compose.yml) runs the services that actually provide the end-user website for https://www.webarchive.org.uk/ or https://beta.webarchive.org.uk/ or https://dev.webarchive.org.uk. + +## Deployment + +The stack is deployed using: + + cd website/ + ./deploy-access-website.sh dev + +As with the data stack, this script must be setup for the variations across deployment contexts. For example, DEV version is password protected and it configured to pick this up from our internal repository. + +**NOTE** that this website stack generates and caches images of archived web pages, and hence will require a reasonable amount of storage for this cache (see below for details). + +### NGINX Proxies + +The website is designed to be run behind a boundary web proxy that handles SSL etc. To make use of this stack of services, the server that provides e.g. `dev.webarchive.org.uk` will need to be configured to point to the right API endpoint, which by convention is `website.dapi.wa.bl.uk`. + +The set of current proxies and historical redirects associated with the website are now contained in the [internal nginx.conf](./config/nginx.conf). This sets up a service on port 80 where all the site components can be accessed. Once running, the entire system should be exposed properly via the API gateway. For example, for accessing the dev system we want `website.dapi.wa.bl.uk` to point to `dev-swarm-members:80`. + +Because most of the complexity of the NGINX setup is in the internal NGINX, the proxy setup at the edge is much simpler. e.g. for DEV, the external-facing NGINX configuration looks like: + +``` + location / { + # Used to tell downstream services what external host/port/etc. is: + proxy_set_header Host $host; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header X-Forwarded-Host $host; + proxy_set_header X-Forwarded-Port $server_port; + proxy_set_header X-Forwarded-For $remote_addr; + # Used for rate-limiting Mementos lookups: + proxy_set_header X-Real-IP $remote_addr; + proxy_pass http://website.dapi.wa.bl.uk/; + } +``` + +(Internal users can see the `dev_443.conf` setup for details.) + +The [internal NGINX configuration](./website/config/nginx.conf) is more complex, merging together the various back-end systems and passing on the configuration as appropriate. For example, [the configuration for the public PyWB service](https://github.com/ukwa/ukwa-services/blob/d68e54d6d7d44e714df24bf31223c8f8f46e5ff6/access/website/config/nginx.conf#L40-L42) includes: + +``` + uwsgi_param UWSGI_SCHEME $http_x_forwarded_proto; + uwsgi_param SCRIPT_NAME /wayback; +``` + +The service picks up the host name from the standard HTTP `Host` header, but here we add the scheme (http/https, passed from the upstream NGINX server via the `X-Forwarded-Proto` header) and fix the deployment path using the `SCRIPT_NAME` CGI variable. + +Having set this chain up, if we visit e.g. `dev.webarchive.org.uk` the traffic should show up on the API server as well as the Docker container. + +**NOTE** that changes to the internal NGINX configuration are only picked up when it starts, so necessary to run: + + docker service update --force access_website_nginx + +After which NGINX should restart and pick up any configuration changes and re-check whether it can connect to any proxied services inside the stack. + +Because the chain of proxies is quite complicated, we also add a `Via` header at each layer, e.g. + +``` + # Add header for tracing where issues occur: + add_header Via $hostname always; +``` + +This adds a hostname for every successful proxy request, so the number of `Via` headers and their values can be used to trace problems with the proxies. + + +## Components + +Behind the NGINX, we have a set of modular components: + +- The [ukwa-ui](https://github.com/ukwa/ukwa-ui) service that provides the main user interface. +- The [ukwa-pywb](https://github.com/ukwa/ukwa-pywb) service that provides access to archive web pages +- The [mementos](https://github.com/ukwa/mementoweb-webclient) service that allows users to look up URLs via Memento. +- The [shine](https://github.com/ukwa/shine) and shinedb services that provide our older prototype researcher interface. +- The [ukwa-access-api](https://github.com/ukwa/ukwa-access-api) and related services (pywb-nobanner, webrender-api, Cantaloupe) that provide API services. + - The API services include a caching image server ([Cantaloupe](https://cantaloupe-project.github.io/)) that takes rendered versions of archived websites and exposes them via the standard [IIIF Image API](https://iiif.io/api/image/2.1/). This will need substantial disk space (~1TB). + +### Shine Database + +Shine requires a PostgreSQL database, so additional setup is required using the scripts in [./scripts/postgres](./scripts/postgres). + +#### Stop the Shine service + +When modifying the database, and having deployed the stack, you first need to stop Shine itself from running, as otherwise it will attempt to start up and will insert and empty database into PostgreSQL and this will interfere with the restore process. So, use + + $ docker service scale access_website_shine=0 + +This will drop the Shine service but leave all the rest of the stack running. + +#### Creating the Shine database + +* `create-db.sh` +* `create-user.sh` +* `list-db.sh` + +Within `scripts/postgres/`, you can run `create-db.sh` to create the database itself. Then, run `create-user.sh` to run the `setup_user.sql` script and set up a suitable user with access to the database. Use `list-db.sh` to check the database is there at this pont. + +#### Restoring the Shine database from a backup + +* Edit `download-shine-db-dump.sh` to use the most recent date version from HDFS +* `download-shine-db-dump.sh` +* `restore-shine-db-from-dump.sh` + +To do a restore, you need to grab a database dump from HDFS. Currently, the backups are dated and are in the HDFS `/2_backups/access/access_shinedb/` folder, so you'll need to edit the file to use the appropriate date, then run `download-shine-db-dump.sh` to actually get the database dump. Now, running `restore-shine-db-from-dump.sh` should populate the database. + +#### Restart the Shine service + +Once you have created and restored the database as needed, re-scale the service and Shine will restart using the restored database. + + $ docker service scale access_website_shine=1 + +#### Creating a backup of the Shine database + +An additional helper script will download a dated dump file of the live database and push it to HDFS, `backup-shine-db-to-hdfs.sh`. + + ./backup-shine-db-to-hdfs.sh dev + +This should be run daily. + +## Cron Jobs + +There should be a daily (early morning) backup of the Shine database. + +# The Website Regression Test Stack + +A series of tests for the website are held under the `tests` folder. As well as checking service features and critical APIs, these test also cover features relating legal compliance. + +The tests are defined as [Robot Framework](https://robotframework.org/) acceptance tests. In the [`tests/robot/tests`](./tests/robot/tests) we have a set of `.robot` files that define tests for each major web site feature (e.g. [Wayback](./tests/robot/tests/wayback.robot)). The tests are written in an [pseudo-natural-language tabular format](https://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#test-case-syntax), relying heavily on [web testing automation keywords](https://robotframework.org/SeleniumLibrary/SeleniumLibrary.html) provided by the [Robot Framework Selenium Library](https://github.com/robotframework/SeleniumLibrary). + +Here's an example of a simple test sequence... + +``` +Open Browser + Open Browser To Home Page + +Check Wayback EN Home Page + [Tags] wayback locale en + Go To %{HOST}/wayback/en + Page Should Contain UK Web Archive Access System +``` + +The first test (`Open Browser`) uses the `Open Browser To Home Page` keyword, which we've defined in the shared [`_resource.robot`](./tests/robot/tests/_resource.robot) file. This sets up the right test browser with the right configuration for the tests in this file (when developing tests, take care to ensure that `Open Browser` is _only_ called once per test file. It tends to hang if it's called multiple times). The next test (`Check Wayback EN Home Page`) loads the English-language Wayback homepage and checks the page contains a particular text string (n.b. matching is case-sensitive). + +This provides a simple language for describing the expected behaviour of the site, and makes it easy to add further tests. By putting the host name in an environment variable (referenced as `%{HOST}`), we can run the same test sequence across `HOST=https://www.webarchive.org.uk`, `HOST=https://beta.webarchive.org.uk` or even `HOST=https://username:password@dev.webarchive.org.uk`. + +The deployment script can be run like this: + + cd website_tests + ./deploy-website-tests.sh dev + +The script handles setting up the `HOST` based on the deployment context. + +The stack will spin up the necessary Selenium containers (with the [Selenium Hub](https://www.guru99.com/introduction-to-selenium-grid.html) exposed on port 4444 in case you want to take a look), and then run the tests. The results will be visible in summary in the console, and in detail via the `results/report.html` report file. If you hit errors, the system should automatically take screenshots so you can see what the browser looked like at the point the error occured. + +The tests are run once on startup, and results are posted to Prometheus. Following test runs can be orchestrated by using cron to run: + + docker service update --force access_website_tests_robot + +These can be run each morning, and the metrics posted to Prometheus used to track compliance and raise alerts if needed. + +## Cron Jobs + +There should be a Daily (early morning) run of the website tests. + + +# The Reading Room Wayback Stack + +The `rrwb` stack defines the necessary services for running our reading room access services via proxied connections rather than DLS VMs. This new approach is on hold at present. + + +# Monitoring + +Having deployed all of the above, the cron jobs mentioned above should be in place. + +The `ukwa-monitor` service should be used to check that these are running, and that the W3ACT database export file on HDFS is being updated. + +...monitoring setup TBC... diff --git a/ingest/README.md b/ingest/README.md new file mode 100644 index 0000000..adea22b --- /dev/null +++ b/ingest/README.md @@ -0,0 +1,10 @@ +The Ingest Stacks +================= + +This section covers the service stacks that are used for curation and for crawling. + +- [`w3act`](./w3act/) - where curators define what should be crawled, and describe what has been crawled. +- [`fc`](./fc/) - the Frequent Crawler, which crawls sites as instructed in `w3act`. +- [`dc`](./dc/) - the Domain Crawler, which us used to crawl all UK sites, once per year. + +The [`crawl_log_db`](./crawl_log_db/) service is not in use, but contains a useful example of how a Solr service and it's associated schema can be set up using the Solr API rather than maintaining XML configuration files. diff --git a/ingest/ingest_tests/.gitignore b/ingest/ingest_tests/.gitignore deleted file mode 100644 index 1a06816..0000000 --- a/ingest/ingest_tests/.gitignore +++ /dev/null @@ -1 +0,0 @@ -results diff --git a/ingest/ingest_tests/README.md b/ingest/ingest_tests/README.md deleted file mode 100644 index 2115801..0000000 --- a/ingest/ingest_tests/README.md +++ /dev/null @@ -1,33 +0,0 @@ -Ingest Tests -============ - -This is a system that is intended to be used to run tests on both developmetn and production systems, and post the results to Prometheus. It is also designed to be run against near-production versions by changing the `TEST_HOST` environment variable, so that it can be used to verify that a new version of a service passes the tests ahead of deployment to production. - -Tests that can be safely run on production systems without modifying any important data are in the `tests` folder. - -Any tests that may modify data are kept separate, placed in the `tests_destructive` folder. These should only run on BETA or DEV deployments. - -Most of the depenendencies are handled by `ukwa/robot-framework` docker image on which this relies. This ensure the additional libraries to run web browsers and talks to Prometheus are in place. Specifically, the container includes [RequestsLibrary](https://marketsquare.github.io/robotframework-requests/doc/RequestsLibrary.html) for testing APIs, and the [robotframework-browser](https://robotframework-browser.org/) library (based on [Playwright](https://playwright.dev/)) and [SeleniumLibrary](http://robotframework.org/SeleniumLibrary/) for browser-driven tests. The Playwright-based library is a bit simpler to deploy than the Selenium-based on, so tests should be switched over to the former where possible. - -The `deploy-ingest-tests.sh` shows how the script can be run as a Docker Service. However, when developing tests, it can be easier to set up the necessary environment variables and run the tests using Docker Compose. e.g. - -Set up the environment variables for running tests against the DEV service: - - source /mnt/nfs/config/gitlab/ukwa-services-env/dev.env - -And now run the tests: - - docker-compose run robot - -Which runs the tests and reports to the console, rather than running them as a background service (as `deploy-ingest-tests.sh` would). - -Using Docker avoids having to install dependencies. If using Docker is not an option, you could set up a Python virtual environent and install `robotframework-requests` and `robotframework-browser`, the run the tests like this: - - robot --outputdir ./results ./tests - -Or, if running the full (destructive) test suite against DEV/BETA: - - robot --outputdir ./results ./tests ./tests_destructive - -Once the tests have run, the results of the tests will be in the `results` folder. This is very detailed, and the system will capture screenshots when things go wrong, so this can all be very useful for determining the cause of test failure. - diff --git a/ingest/ingest_tests/deploy-ingest-tests.sh b/ingest/ingest_tests/deploy-ingest-tests.sh deleted file mode 100755 index 73ca6cf..0000000 --- a/ingest/ingest_tests/deploy-ingest-tests.sh +++ /dev/null @@ -1,28 +0,0 @@ -#!/bin/sh - -# read script environ argument -ENVIRON=$1 -if ! [[ ${ENVIRON} =~ dev|beta|prod ]]; then - echo "ERROR: Script $0 requires environment argument (dev|beta|prod)" - exit -fi - -if [[ ${ENVIRON} == 'dev' ]]; then - # Set up the dev.webarchive.org.uk vars - set -a # automatically export all variables - source /mnt/nfs/config/gitlab/ukwa-services-env/dev.env - set +a - export EXTRA_TESTS="/tests_destructive" -else - export PUSH_GATEWAY=monitor.wa.bl.uk:9091 - echo "ERROR - not yet configured!" - exit -fi - -# -- -echo Running tests using TEST_HOST = $TEST_HOST -echo WARNING! Tests will fail if the TEST_HOST variable has a trailing slash! - -#docker stack deploy -c docker-compose.yml ingest_tests -docker-compose run robot - diff --git a/ingest/ingest_tests/docker-compose.yml b/ingest/ingest_tests/docker-compose.yml deleted file mode 100644 index 251e2f2..0000000 --- a/ingest/ingest_tests/docker-compose.yml +++ /dev/null @@ -1,32 +0,0 @@ -version: '3.2' - -services: - -# ----------------------------------------------------------- -# Automated test engine - test services from 'outside' -# ----------------------------------------------------------- - - robot: - image: ukwa/robot-framework:main - command: "/tests ${EXTRA_TESTS}" - environment: - - "HOST=${TEST_HOST}" - - "HOST_NO_AUTH=${HOST_NO_AUTH}" - - "PUSH_GATEWAY=${PUSH_GATEWAY}" - - "PROMETHEUS_JOB_NAME=${PROMETHEUS_JOB_NAME}" - - "HTTP_PROXY=${HTTP_PROXY}" - - "HTTPS_PROXY=${HTTPS_PROXY}" - - "W3ACT_USERNAME=${W3ACT_USERNAME}" - - "W3ACT_PASSWORD=${W3ACT_PASSWORD}" - volumes: - - ./tests:/tests:ro - - ./tests_destructive:/tests_destructive:ro - - ./results:/results:rw - deploy: - restart_policy: - # Run once: - condition: on-failure - # If it fails, run every hour: - delay: 60m - - diff --git a/ingest/ingest_tests/tests/api.robot b/ingest/ingest_tests/tests/api.robot deleted file mode 100644 index 9fec841..0000000 --- a/ingest/ingest_tests/tests/api.robot +++ /dev/null @@ -1,15 +0,0 @@ - -*** Settings *** -Library Collections -Library RequestsLibrary - -# Set up a session for this whole sequence of tests: -Suite Setup Create Session act_api %{HOST} disable_warnings=1 - -*** Test Cases *** -Log into API - &{data}= Create Dictionary email=%{W3ACT_USERNAME} password=%{W3ACT_PASSWORD} - ${resp}= POST On Session act_api %{HOST}/act/login data=${data} - Should Be Equal As Strings ${resp.status_code} 200 - - diff --git a/ingest/ingest_tests/tests/browse.robot b/ingest/ingest_tests/tests/browse.robot deleted file mode 100644 index 96c75c1..0000000 --- a/ingest/ingest_tests/tests/browse.robot +++ /dev/null @@ -1,69 +0,0 @@ - -*** Settings *** -Library Browser auto_closing_level=SUITE - -# Set up a browser context for this whole sequence of tests: -Suite Setup New Page %{HOST} # HOST includes any web server authentication - - -*** Test Cases *** - -W3ACT Not Logged In Requires Authentication # W3ACT authentication; web server already logged in if needed - New Page %{HOST_NO_AUTH}/act/about # redirects to login - ${test}= Get URL - Should Be Equal As Strings ${test} %{HOST_NO_AUTH}/act/login - -W3ACT Not Logged In Allows Login - Get Text button == Login - -Wayback Not Logged In Requires Authentication - &{response}= HTTP %{HOST_NO_AUTH}/act/wayback/ - Should Be Equal As Strings ${response.status} 401 - -Notebook Apps Not Logged In Requires Authentication - &{response}= HTTP %{HOST_NO_AUTH}/act/nbapps/ - Should Be Equal As Strings ${response.status} 401 - -Log Viewer Not Logged In Requires Authentication - &{response}= HTTP %{HOST_NO_AUTH}/act/logs/ - Should Be Equal As Strings ${response.status} 401 - -W3ACT Log In - New Page %{HOST} # not sure why we need to re-init web server auth considering the auto close scope - Go To %{HOST_NO_AUTH}/act/login - Fill Secret input#email %W3ACT_USERNAME - Fill Secret input#password %W3ACT_PASSWORD - Click button#submit # takes us to About page - ${test}= Get URL - Should Be Equal As Strings ${test} %{HOST_NO_AUTH}/act/about - -Wayback Logged In - &{response}= HTTP %{HOST_NO_AUTH}/act/wayback/ - Should Be Equal As Strings ${response.status} 200 - -Notebook Apps Logged In - &{response}= HTTP %{HOST_NO_AUTH}/act/nbapps/ - Should Be Equal As Strings ${response.status} 200 - -Log Viewer Logged In - &{response}= HTTP %{HOST_NO_AUTH}/act/logs/ - Should Be Equal As Strings ${response.status} 200 - -Log Out Returns To Login - Click text=Logout - Get Text button == Login - -# test that logging out denies access as before to the various services - -Wayback Not Logged In Requires Authentication - &{response}= HTTP %{HOST_NO_AUTH}/act/wayback/ - Should Be Equal As Strings ${response.status} 401 - -Notebook Apps Not Logged In Requires Authentication - &{response}= HTTP %{HOST_NO_AUTH}/act/nbapps/ - Should Be Equal As Strings ${response.status} 401 - -Log Viewer Not Logged In Requires Authentication - &{response}= HTTP %{HOST_NO_AUTH}/act/logs/ - Should Be Equal As Strings ${response.status} 401 - diff --git a/ingest/ingest_tests/tests_destructive/api_ddhapt_upload.robot b/ingest/ingest_tests/tests_destructive/api_ddhapt_upload.robot deleted file mode 100644 index 0c638a1..0000000 --- a/ingest/ingest_tests/tests_destructive/api_ddhapt_upload.robot +++ /dev/null @@ -1,20 +0,0 @@ - -*** Settings *** -Library Collections -Library RequestsLibrary - -# Set up a session for this whole sequence of tests: -Suite Setup Create Session act_api %{HOST} disable_warnings=1 - -*** Test Cases *** -Log into API - &{data}= Create Dictionary email=%{W3ACT_USERNAME} password=%{W3ACT_PASSWORD} - ${resp}= POST On Session act_api %{HOST}/act/login data=${data} - Should Be Equal As Strings ${resp.status_code} 200 - -POST a Document - &{data}= Create Dictionary target_id=9022 wayback_timestamp=20211003002015 document_url=https://www.amnesty.org/download/Documents/EUR2500882019ENGLISH.PDF landing_page_url=https://www.amnesty.org/en/documents/eur25/0088/2019/en/ filename=EUR2500882019ENGLISH.PDF - ${resp}= POST On Session act_api %{HOST}/act/documents json=[${data}] - Should Be Equal As Strings ${resp.status_code} 200 - Should Be Equal As Strings ${resp.text} No new documents added -