Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy Sql Tracking Detection Using Easylist Adservers #3730

Merged
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
1fcf6a5
Add GA4 fields to match documentation (#3679)
tunetheweb Jun 10, 2024
523ae0d
Update Timestamps (#3680)
github-actions[bot] Jun 10, 2024
312ef27
Bump web-vitals from 4.1.0 to 4.1.1 in /src (#3681)
dependabot[bot] Jun 11, 2024
d215f0a
Bump puppeteer from 22.10.0 to 22.10.1 in /src (#3682)
dependabot[bot] Jun 11, 2024
0cdf408
Bump prettier from 3.3.1 to 3.3.2 in /src (#3683)
dependabot[bot] Jun 11, 2024
5e008a6
Bump puppeteer from 22.10.1 to 22.11.0 in /src (#3684)
dependabot[bot] Jun 14, 2024
91a5fb9
Translation of security chapter to Japanese (#3685)
ksakae1216 Jun 14, 2024
a610110
Bump puppeteer from 22.11.0 to 22.11.2 in /src (#3688)
dependabot[bot] Jun 19, 2024
7a71ed2
Bump web-vitals from 4.1.1 to 4.2.0 in /src (#3690)
dependabot[bot] Jun 21, 2024
0974d6b
Bump puppeteer from 22.11.2 to 22.12.0 in /src (#3689)
dependabot[bot] Jun 21, 2024
1a8586a
Update Timestamps (#3691)
github-actions[bot] Jun 21, 2024
5107c81
Remove deploy.zip step of deployment (#3692)
tunetheweb Jun 21, 2024
795caeb
Bump puppeteer from 22.12.0 to 22.12.1 in /src (#3694)
dependabot[bot] Jun 26, 2024
4219fbc
Bump treosh/lighthouse-ci-action from 11.4.0 to 12.1.0 (#3693)
dependabot[bot] Jun 26, 2024
28488e8
Bump web-vitals from 4.2.0 to 4.2.1 in /src (#3695)
dependabot[bot] Jul 6, 2024
2fb6060
Bump actions/setup-python from 5.1.0 to 5.1.1 (#3699)
dependabot[bot] Jul 13, 2024
dd0fafc
Bump puppeteer from 22.12.1 to 22.13.0 in /src (#3698)
dependabot[bot] Jul 13, 2024
e0c67bd
Translation of mobile-web chapter to Japanese (#3700)
ksakae1216 Jul 13, 2024
61b52d5
Bump puppeteer from 22.13.0 to 22.15.0 in /src (#3711)
dependabot[bot] Aug 1, 2024
eeb98b2
Bump jsdom from 24.1.0 to 24.1.1 in /src (#3707)
dependabot[bot] Aug 1, 2024
43a2ac9
Bump web-vitals from 4.2.1 to 4.2.2 in /src (#3706)
dependabot[bot] Aug 1, 2024
48d86da
Bump prettier from 3.3.2 to 3.3.3 in /src (#3702)
dependabot[bot] Aug 1, 2024
0b4e6ff
Bump web-vitals from 4.2.2 to 4.2.3 in /src (#3715)
dependabot[bot] Aug 7, 2024
90e7e8a
Update Timestamps (#3716)
github-actions[bot] Aug 7, 2024
b637da1
tracking detection using easylist adservers
Aug 13, 2024
f7b1283
easylist_adserver tracking detection and query
Aug 13, 2024
f428a3e
2022 cdn portuguese (#3725)
HakaCode Aug 14, 2024
2c361dd
Bump puppeteer from 22.15.0 to 23.0.2 in /src (#3719)
dependabot[bot] Aug 14, 2024
74330d7
Update Timestamps (#3726)
github-actions[bot] Aug 14, 2024
5b7997e
Replace `<object>` with `<iframe>` for embedded SVG (#3727)
tunetheweb Aug 14, 2024
bf7fee0
auto upload easylist data to table
Aug 14, 2024
da71b3e
Fix the build to ignore 2024 chapters (for now) (#3728)
tunetheweb Aug 14, 2024
6e01d17
Update Timestamps (#3729)
github-actions[bot] Aug 14, 2024
26965e7
liniting
Aug 15, 2024
3107288
liniting
Aug 15, 2024
bac6526
linting
Aug 15, 2024
e6ec23d
linting
Aug 15, 2024
8331ecf
linting
Aug 15, 2024
d48eafc
linting
Aug 15, 2024
90be0cf
fixes of Simplified Chinese translation for 2020 Performance (#3734)
Zuckjet Aug 16, 2024
3128f3a
Merge branch 'main' into privacy-sql-tracking-detection-easylist
max-ostapenko Aug 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/code-static-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
uses: actions/checkout@v4
- name: Set up Python 3.8
if: ${{ matrix.language == 'python' }}
uses: actions/[email protected].0
uses: actions/[email protected].1
with:
python-version: '3.8'
- name: Install dependencies
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/lintsql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
# Full git history is needed to get a proper list of changed files within `super-linter`
fetch-depth: 0
- name: Set up Python 3.8
uses: actions/[email protected].0
uses: actions/[email protected].1
with:
python-version: '3.8'
- name: Lint SQL code
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/predeploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ jobs:
- name: Setup Node.js for use with actions
uses: actions/setup-node@v4
with:
node-version: '16'
node-version: '20'
- name: Set up Python 3.8
uses: actions/[email protected].0
uses: actions/[email protected].1
with:
python-version: '3.8'
- name: Install Asian Fonts
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/production-checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
- name: Set the list of URLs for Lighthouse to check
run: ./src/tools/scripts/set_lighthouse_urls.sh -p
- name: Audit URLs using Lighthouse
uses: treosh/lighthouse-ci-action@11.4.0
uses: treosh/lighthouse-ci-action@12.1.0
id: LHCIAction
with:
# For prod, we simply check for 100% in Accessibility, Best Practices and SEO
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test-template-changes.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
- name: Setup Node.js for use with actions
uses: actions/setup-node@v4
with:
node-version: '16'
node-version: '20'
- name: Test Template Changes
run: ./src/tools/scripts/test_template_changes.sh
- name: 'Comment PR'
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/test_website.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ jobs:
- name: Setup Node.js for use with actions
uses: actions/setup-node@v4
with:
node-version: '16'
node-version: '20'
- name: Set up Python 3.8
uses: actions/[email protected].0
uses: actions/[email protected].1
with:
python-version: '3.8'
- name: Run the website
Expand All @@ -53,7 +53,7 @@ jobs:
COMMIT_SHA: ${{ github.sha }}
run: ./src/tools/scripts/set_lighthouse_urls.sh
- name: Audit URLs using Lighthouse
uses: treosh/lighthouse-ci-action@11.4.0
uses: treosh/lighthouse-ci-action@12.1.0
id: LHCIAction
with:
# For dev, turn off all timing perf audits (too unreliable) and a few others that don't work on dev
Expand Down
38 changes: 38 additions & 0 deletions sql/2024/privacy/tracking-detection/easylist-tracker-detection.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
CREATE TEMP FUNCTION
CheckDomainInURL(url STRING, domain STRING)
RETURNS INT64
LANGUAGE js AS """
return url.includes(domain) ? 1 : 0;
""";

-- We need to use the `easylist_adservers.csv` to populate the table to get the list of domains to block
-- https://github.com/easylist/easylist/blob/master/easylist/easylist_adservers.txt
WITH easylist_data AS (
SELECT string_field_0
FROM `httparchive.almanac.easylist_adservers`
),
requests_data AS (
SELECT url
FROM `httparchive.all.requests`
WHERE
date = '2024-06-01' AND
is_root_page = TRUE
),
block_status AS (
SELECT
r.url,
MAX(
CASE
WHEN CheckDomainInURL(r.url, e.string_field_0) = 1 THEN 1
ELSE 0
END
) AS should_block
FROM requests_data r
LEFT JOIN easylist_data e
ON CheckDomainInURL(r.url, e.string_field_0) = 1
GROUP BY r.url
)
SELECT
COUNT(0) AS blocked_url_count
FROM block_status
WHERE should_block = 1;
76 changes: 76 additions & 0 deletions sql/util/populate_easylist_adserver.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# pylint: disable=import-error
import requests
import pandas as pd
from google.cloud import bigquery


def extract_domains_from_file(file_path):
domains = []
try:
with open(file_path, "r") as file:
for line in file:
# Remove the '||' prefix and '^' suffix
domain = line.strip().lstrip("||").rstrip("^")
if domain: # Ensure the line is not empty
domains.append(domain)
except FileNotFoundError:
print(f"Error: The file {file_path} does not exist.")
except Exception as e:
print(f"An error occurred: {e}")
return domains


def save_domains_to_csv(domains, csv_file_path):
try:
# Create a DataFrame from the list of domains
df = pd.DataFrame(domains, columns=["Domain"])
# Save the DataFrame to a CSV file
df.to_csv(csv_file_path, index=False)
except Exception as e:
print(f"An error occurred while writing to CSV: {e}")


def upload_csv_to_bigquery(csv_file_path):
# this needs the GOOGLE_APPLICATION_CREDENTIALS env variable to be set
client = bigquery.Client()

# Configure the job
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1, # Adjust if your CSV doesn't have a header row
autodetect=True, # Automatically infer schema
)

# Load data from the CSV file
with open(csv_file_path, "rb") as source_file:
load_job = client.load_table_from_file(
source_file, "httparchive.almanac.easylist_adservers",
job_config=job_config
)

# Wait for the job to complete
load_job.result()


# URL to the text file containing the regex patterns
url = "https://raw.githubusercontent.com/easylist/easylist/master/" \
"easylist/easylist_adservers.txt"
file_path = "easylist_adservers.txt"
# Path to the output CSV file
csv_file_path = "easylist_adservers.csv"

# Download the file and save it locally
response = requests.get(url)
with open(file_path, "wb") as file:
file.write(response.content)

# Extract domains
domains = extract_domains_from_file(file_path)

# Save domains to CSV
save_domains_to_csv(domains, csv_file_path)

# upload domains to BQ
upload_csv_to_bigquery(csv_file_path)

print(f"Domains have been saved to {csv_file_path}")
1 change: 0 additions & 1 deletion src/.gcloudignore
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,4 @@ static/pdfs/*
**/.DS_Store
Dockerfile
.dockerignore
deployed.zip
.coverage
1 change: 0 additions & 1 deletion src/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,4 @@ templates/*/rss.xml
templates/sitemap.xml
static/html/
static/js/web-vitals.js
deployed.zip
.coverage
3 changes: 1 addition & 2 deletions src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ An `.editorconfig` file exists for those using [EditorConfig](https://editorconf

Make sure you run the following commands from within the `src` directory by executing `cd src` first.

Make sure Python (3.8 or above), pip and NodeJS (v16 or above) are installed on your machine.
Make sure Python (3.8 or above), pip and NodeJS (v20 or above) are installed on your machine.

1. If you don't have virtualenv, install it using pip.

Expand Down Expand Up @@ -376,7 +376,6 @@ The deploy script will do the following:
- Ask you to complete any local tests and confirm good to deploy
- Ask for a version number (suggesing the last verision tagged and incrementing the patch)
- Tag the release (after asking you for the version number to use)
- Generate a `deploy.zip` file of what has been deployed
- Deploy to GCP
- Push changes to `production` branch on GitHub
- Switch you back to the `main` branch.
Expand Down
18 changes: 14 additions & 4 deletions src/config/last_updated.json
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@
},
"/static/js/send-web-vitals.js": {
"date_published": "2021-02-24T00:00:00.000Z",
"date_modified": "2024-06-05T00:00:00.000Z",
"hash": "93b415ccbc2d2f5de8627d6019546f09"
"date_modified": "2024-06-10T00:00:00.000Z",
"hash": "b7224f484fe762e075d4838286ddb066"
},
"/static/js/web-vitals.js": {
"date_published": "2020-11-13T00:00:00.000Z",
"date_modified": "2024-06-05T00:00:00.000Z",
"hash": "10638eba1611ff0dc07edbe721e3eb45"
"date_modified": "2024-08-07T00:00:00.000Z",
"hash": "94d123623c67f0e7774480cf1ad078cd"
},
"/static/js/webmentions.js": {
"date_published": "2021-12-01T00:00:00.000Z",
Expand Down Expand Up @@ -1768,6 +1768,11 @@
"date_modified": "2023-04-05T00:00:00.000Z",
"hash": "0841b96b9550aada0ca96c1ba297a702"
},
"ja/2022/chapters/mobile-web.html": {
"date_published": "2024-08-07T00:00:00.000Z",
"date_modified": "2024-08-07T00:00:00.000Z",
"hash": "9847699b89d16424f1dd2c4fab17b713"
},
"ja/2022/chapters/performance.html": {
"date_published": "2024-02-18T00:00:00.000Z",
"date_modified": "2024-02-18T00:00:00.000Z",
Expand All @@ -1778,6 +1783,11 @@
"date_modified": "2024-05-07T00:00:00.000Z",
"hash": "025f7034129e8d56d6b4a7ee0d699762"
},
"ja/2022/chapters/security.html": {
"date_published": "2024-06-21T00:00:00.000Z",
"date_modified": "2024-06-21T00:00:00.000Z",
"hash": "805a470811f9e404912bac394555f032"
},
"ja/2022/chapters/seo.html": {
"date_published": "2023-10-19T00:00:00.000Z",
"date_modified": "2023-10-19T00:00:00.000Z",
Expand Down
Loading