Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cookies sql 2024 #3741

Merged
merged 25 commits into from
Nov 10, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions sql/2024/cookies/0_create_desktop_cookies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
-- Create an intermediate table containing all cookies that were set during the
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table
-- can then be reused in consecutive queries without having to reextract the
-- data every time
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies

WITH intermediate_cookie AS (
SELECT
page,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to restrict this to just root_pages? Or sum up cookies across both home and secondary pages for the same root_page so we can count sites using cookies?

I see you use NET.HOST(page) later on but root_page would perhaps be better and allow this filtering up front.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to be as general as possible when extracting the cookies table, and I did not know up front if we were going to only stick to root_page or no for this year's queries, same for in the future.

rank,
JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime,
cookie
FROM
`httparchive.all.pages`,
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
rank <= 1000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to restrict it like this? Can we add a comment to explain why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to reduce the amount of data processed/parsed; without any cap on the rank, bigquery was returning an error "Response too large to return" or something like that. A solution is to specify a table where to output the results, but I don't have the permissions on the HTTP Archive project to create that.

So, I created the table on a personal project/dataset, but then because I am on the free plan, there are some limits on amount of data I can store, and cookies on top 1M for both mobile and desktop were right below that limit, and top 1M seem plenty of enough data already.

Ideally, these cookies should not be in custom metrics but in their own separate table in my opinion. I believe I saw at some point an issue/discussion on HTTPArchive project where it was proposed to break down some fields in custom metrics in their table/column, instead of the current big blob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can this table, which should be much cheaper to query:

-- Create an intermediate table containing all cookies that were set during the
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table
-- can then be reused in consecutive queries without having to reextract the
-- data every time
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies

CREATE TABLE `httparchive.almanac.cookies` 
(
  date DATE,
  client STRING,
  page STRING,
  root_page STRING,
  rank INTEGER,
  startedDateTime STRING,
  firstPartyCookie BOOL,
  name STRING,
  domain STRING,
  path STRING,
  expires STRING,
  size STRING,
  httpOnly STRING,
  secure STRING,
  session STRING,
  sameSite STRING,
  sameParty STRING,
  partitionKey STRING,
  partitionKeyOpaque STRING
)
PARTITION BY date
CLUSTER BY
  client, rank, page

AS

WITH intermediate_cookie AS (
  SELECT
    date,
    client,
    page,
    root_page,
    rank,
    JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime,
    cookie
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie
  WHERE
    date = '2024-06-01'
)

SELECT
  date,
  client,
  page,
  root_page,
  rank,
  startedDateTime,
  ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie,
  JSON_VALUE(cookie, '$.name') AS name,
  JSON_VALUE(cookie, '$.domain') AS domain,
  JSON_VALUE(cookie, '$.path') AS path,
  JSON_VALUE(cookie, '$.expires') AS expires,
  JSON_VALUE(cookie, '$.size') AS size,
  JSON_VALUE(cookie, '$.httpOnly') AS httpOnly,
  JSON_VALUE(cookie, '$.secure') AS secure,
  JSON_VALUE(cookie, '$.session') AS session,
  JSON_VALUE(cookie, '$.sameSite') AS sameSite,
  JSON_VALUE(cookie, '$.sameParty') AS sameParty,
  JSON_VALUE(cookie, '$.partitionKey') AS partitionKey,
  JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque
FROM intermediate_cookie

Could you update this query to this:

INSERT INTO `httparchive.almanac.cookies` 
WITH intermediate_cookie AS (
  SELECT
    date,
...

But no need to run as I've run it.

Then use this new httparchive.almanac.cookies table in your other queries? I presume it won't change your results much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you! Looking into the new table you created and updating the queries for the cookies chapter accordingly.

)

SELECT
page,
rank,
startedDateTime,
ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie,
JSON_VALUE(cookie, '$.name') AS name,
JSON_VALUE(cookie, '$.domain') AS domain,
JSON_VALUE(cookie, '$.path') AS path,
JSON_VALUE(cookie, '$.expires') AS expires,
JSON_VALUE(cookie, '$.size') AS size,
JSON_VALUE(cookie, '$.httpOnly') AS httpOnly,
JSON_VALUE(cookie, '$.secure') AS secure,
JSON_VALUE(cookie, '$.session') AS session,
JSON_VALUE(cookie, '$.sameSite') AS sameSite,
JSON_VALUE(cookie, '$.sameParty') AS sameParty,
JSON_VALUE(cookie, '$.partitionKey') AS partitionKey,
JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque
FROM intermediate_cookie
39 changes: 39 additions & 0 deletions sql/2024/cookies/0_create_mobile_cookies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
-- Create an intermediate table containing all cookies that were set during the
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table
-- can then be reused in consecutive queries without having to reextract the
-- data every time
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies

WITH intermediate_cookie AS (
SELECT
page,
rank,
JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime,
cookie
FROM
`httparchive.all.pages`,
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie
WHERE
date = '2024-06-01' AND
client = 'mobile' AND
rank <= 1000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

)

SELECT
page,
rank,
startedDateTime,
ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie,
JSON_VALUE(cookie, '$.name') AS name,
JSON_VALUE(cookie, '$.domain') AS domain,
JSON_VALUE(cookie, '$.path') AS path,
JSON_VALUE(cookie, '$.expires') AS expires,
JSON_VALUE(cookie, '$.size') AS size,
JSON_VALUE(cookie, '$.httpOnly') AS httpOnly,
JSON_VALUE(cookie, '$.secure') AS secure,
JSON_VALUE(cookie, '$.session') AS session,
JSON_VALUE(cookie, '$.sameSite') AS sameSite,
JSON_VALUE(cookie, '$.sameParty') AS sameParty,
JSON_VALUE(cookie, '$.partitionKey') AS partitionKey,
JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque
FROM intermediate_cookie
15 changes: 15 additions & 0 deletions sql/2024/cookies/CHIPS_top_20_first_party_cookies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
-- Extract the top 20 first party cookies seen across websites that are
-- partitioned. Note: it is a bit weird that 1st party cookies would also be
-- partitioned, as CHIPS is meant for a 3rd party context...
-- Before running query: edit table name

SELECT
name,
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie = TRUE AND
partitionKey IS NOT NULL
GROUP BY name
ORDER BY percentWebsites DESC
LIMIT 20
16 changes: 16 additions & 0 deletions sql/2024/cookies/CHIPS_top_20_third_party_cookies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
-- Extract the top 20 first party cookies seen across websites that are
-- partitioned. Note: it is a bit weird that 1st party cookies would also be
-- partitioned, as CHIPS is meant for a 3rd party context...
-- Before running query: edit table name

SELECT
name,
domain,
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie = FALSE AND
partitionKey IS NOT NULL
GROUP BY name, domain
ORDER BY percentWebsites DESC
LIMIT 20
8 changes: 8 additions & 0 deletions sql/2024/cookies/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,11 @@
[~google-doc]: https://docs.google.com/document/d/1o2AgdsDq_x3OvthZF7Kb50rUKMVLn7UANT9Stz7ku2I/edit#heading=h.ymg495uvm3yx
[~google-sheets]: https://docs.google.com/spreadsheets/d/1wDGnUkO0rgcU5_V6hmUrhm1pq60VU2XbeMHgYJEEaSM/edit#gid=454016814
[~chapter-markdown]: https://github.com/HTTPArchive/almanac.httparchive.org/tree/main/src/content/en/2024/cookies.md

## Note about SQL queries

First execute [`0_create_desktop_cookies.sql`](0_create_desktop_cookies.sql) and
[`0_create_mobile_cookies.sql`](0_create_mobile_cookies.sql) and export results
into a new table that will be used in other SQL queries. These other SQL queries
assume that the results were exported to
`httparchive.almanac.DATE_CLIENT_RANK_cookies`, rename as needed.
25 changes: 25 additions & 0 deletions sql/2024/cookies/age_expire_cookies_per_type_quantiles.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
-- Extract stats around the expire date (age rounded to closest number of days)
-- (only positive values, non session cookies)
-- Before running query: edit table name

WITH cookies_age AS (
SELECT
firstPartyCookie,
ROUND((CAST(expires AS FLOAT64) - CAST(startedDateTime AS FLOAT64)) / (24 * 3600), 0) AS age
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL AND
CAST(expires AS FLOAT64) >= 0
)

SELECT
firstPartyCookie,
MIN(age) AS min,
APPROX_QUANTILES(age, 100)[OFFSET(25)] AS p25,
APPROX_QUANTILES(age, 100)[OFFSET(50)] AS median,
APPROX_QUANTILES(age, 100)[OFFSET(75)] AS p75,
APPROX_QUANTILES(age, 100)[OFFSET(90)] AS p90,
APPROX_QUANTILES(age, 100)[OFFSET(99)] AS p99,
MAX(age) AS max
FROM cookies_age
GROUP BY firstPartyCookie
22 changes: 22 additions & 0 deletions sql/2024/cookies/age_expire_cookies_quantiles.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
-- Extract stats around the expire date (age rounded to closest number of days)
-- (only positive values, non session cookies)
-- Before running query: edit table name

WITH cookies_age AS (
SELECT
ROUND((CAST(expires AS FLOAT64) - CAST(startedDateTime AS FLOAT64)) / (24 * 3600), 0) AS age
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL AND
CAST(expires AS FLOAT64) >= 0
)

SELECT
MIN(age) AS min,
APPROX_QUANTILES(age, 100)[OFFSET(25)] AS p25,
APPROX_QUANTILES(age, 100)[OFFSET(50)] AS median,
APPROX_QUANTILES(age, 100)[OFFSET(75)] AS p75,
APPROX_QUANTILES(age, 100)[OFFSET(90)] AS p90,
APPROX_QUANTILES(age, 100)[OFFSET(99)] AS p99,
MAX(age) AS max
FROM cookies_age
19 changes: 19 additions & 0 deletions sql/2024/cookies/age_expires_cookies_cdf.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
-- Extract the expire date (age rounded to closest number of days)
-- (only positive values, non session cookies)
-- Before running query: edit table name

WITH cookies_age AS (
SELECT
ROUND((CAST(expires AS FLOAT64) - CAST(startedDateTime AS FLOAT64)) / (24 * 3600), 0) AS age
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL AND
CAST(expires AS FLOAT64) >= 0
)

SELECT
age,
COUNT(0) AS nbCookies
FROM cookies_age
GROUP BY age
ORDER BY age ASC
21 changes: 21 additions & 0 deletions sql/2024/cookies/nb_cookies_cdf.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
-- Extract the nb of cookies
-- Before running query: edit table name

WITH nb_cookies_per_website AS (
SELECT
firstPartyCookie,
NET.HOST(page) AS pageFirstPartyHost,
COUNT(DISTINCT CONCAT(name, domain)) AS distinctNbCookies
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL
GROUP BY firstPartyCookie, pageFirstPartyHost
)

SELECT
firstPartyCookie,
distinctNbCookies,
COUNT(DISTINCT pageFirstPartyHost) AS nbWebsites
FROM nb_cookies_per_website
GROUP BY firstPartyCookie, distinctNbCookies
ORDER BY firstPartyCookie, distinctNbCookies ASC
25 changes: 25 additions & 0 deletions sql/2024/cookies/nb_cookies_per_type_quantiles.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
-- Extract stats about the nb of cookies
-- Before running query: edit table name

WITH nb_cookies_per_website AS (
SELECT
firstPartyCookie,
NET.HOST(page) AS pageFirstPartyHost,
COUNT(DISTINCT CONCAT(name, domain)) AS distinctNbCookies
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL
GROUP BY firstPartyCookie, pageFirstPartyHost
)

SELECT
firstPartyCookie,
MIN(distinctNbCookies) AS min,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(25)] AS p25,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(50)] AS median,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(75)] AS p75,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(90)] AS p90,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(99)] AS p99,
MAX(distinctNbCookies) AS max
FROM nb_cookies_per_website
GROUP BY firstPartyCookie
22 changes: 22 additions & 0 deletions sql/2024/cookies/nb_cookies_quantiles.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
-- Extract stats about the nb of cookies
-- Before running query: edit table name

WITH nb_cookies_per_website AS (
SELECT
NET.HOST(page) AS pageFirstPartyHost,
COUNT(DISTINCT CONCAT(name, domain)) AS distinctNbCookies
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL
GROUP BY pageFirstPartyHost
)

SELECT
MIN(distinctNbCookies) AS min,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(25)] AS p25,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(50)] AS median,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(75)] AS p75,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(90)] AS p90,
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(99)] AS p99,
MAX(distinctNbCookies) AS max
FROM nb_cookies_per_website
22 changes: 22 additions & 0 deletions sql/2024/cookies/prevalence_attributes_per_type.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
-- Prevalence of cookies type and attributes per type of cookie (1st/3rd party)
-- Before running query: edit table name

SELECT
firstPartyCookie,
SUM(IF(httpOnly = 'true', 1, 0)) / COUNT(0) AS httpOnly,
SUM(IF(secure = 'true', 1, 0)) / COUNT(0) AS secure,
SUM(IF(session = 'true', 1, 0)) / COUNT(0) AS session,
SUM(IF(sameParty = 'true', 1, 0)) / COUNT(0) AS sameParty,
SUM(IF(sameSite = 'Lax', 1, 0)) / COUNT(0) AS sameSiteLax,
SUM(IF(sameSite = 'None', 1, 0)) / COUNT(0) AS sameSiteNone,
SUM(IF(sameSite = 'Strict', 1, 0)) / COUNT(0) AS sameSiteStrict,
SUM(IF(sameSite IS NULL, 1, 0)) / COUNT(0) AS sameSiteNull,
SUM(IF(partitionKey IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKey,
SUM(IF(partitionKeyOpaque IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKeyOpaque,
SUM(IF(STARTS_WITH(name, '__Host-'), 1, 0)) / COUNT(0) AS hostPrefix,
SUM(IF(STARTS_WITH(name, '__Secure-'), 1, 0)) / COUNT(0) AS securePrefix
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL -- just in case
GROUP BY
firstPartyCookie
26 changes: 26 additions & 0 deletions sql/2024/cookies/prevalence_type_attributes_per_rank.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
-- Prevalence of cookies type and attributes per rank grouping
-- Before running query: edit table name

SELECT
rank_grouping,
SUM(IF(firstPartyCookie = TRUE, 1, 0)) / COUNT(0) AS firstParty,
SUM(IF(firstPartyCookie = FALSE, 1, 0)) / COUNT(0) AS thirdParty,
SUM(IF(httpOnly = 'true', 1, 0)) / COUNT(0) AS httpOnly,
SUM(IF(secure = 'true', 1, 0)) / COUNT(0) AS secure,
SUM(IF(session = 'true', 1, 0)) / COUNT(0) AS session,
SUM(IF(sameParty = 'true', 1, 0)) / COUNT(0) AS sameParty,
SUM(IF(sameSite = 'Lax', 1, 0)) / COUNT(0) AS sameSiteLax,
SUM(IF(sameSite = 'None', 1, 0)) / COUNT(0) AS sameSiteNone,
SUM(IF(sameSite = 'Strict', 1, 0)) / COUNT(0) AS sameSiteStrict,
SUM(IF(sameSite IS NULL, 1, 0)) / COUNT(0) AS sameSiteNull,
SUM(IF(partitionKey IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKey,
SUM(IF(partitionKeyOpaque IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKeyOpaque
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`,
UNNEST([1000, 5000, 10000, 50000, 100000, 500000, 1000000]) AS rank_grouping
yohhaan marked this conversation as resolved.
Show resolved Hide resolved
WHERE
rank <= rank_grouping AND
firstPartyCookie IS NOT NULL -- just in case
GROUP BY
rank_grouping
ORDER BY
rank_grouping
11 changes: 11 additions & 0 deletions sql/2024/cookies/size_cookies_cdf.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
-- Extract the size of the cookies
-- Before running query: edit table name

SELECT
CAST(size AS INT) AS sizeCookie,
COUNT(0) AS nbCookies
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL
GROUP BY sizeCookie
ORDER BY sizeCookie ASC
16 changes: 16 additions & 0 deletions sql/2024/cookies/size_cookies_per_type_quantiles.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
-- Extract stats around the size of the cookies
-- Before running query: edit table name

SELECT
firstPartyCookie,
MIN(CAST(size AS INT)) AS min,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(25)] AS p25,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(50)] AS median,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(75)] AS p75,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(90)] AS p90,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(99)] AS p99,
MAX(CAST(size AS INT)) AS max
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL
GROUP BY firstPartyCookie
14 changes: 14 additions & 0 deletions sql/2024/cookies/size_cookies_quantiles.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
-- Extract stats around the size of the cookies
-- Before running query: edit table name

SELECT
MIN(CAST(size AS INT)) AS min,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(25)] AS p25,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(50)] AS median,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(75)] AS p75,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(90)] AS p90,
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(99)] AS p99,
MAX(CAST(size AS INT)) AS max
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL
12 changes: 12 additions & 0 deletions sql/2024/cookies/size_extract_largest.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
-- Extract largest cookies being set
-- Before running query: edit table name

SELECT
name,
domain,
CAST(size AS INT) AS sizeCookie
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie = TRUE
ORDER BY sizeCookie DESC
LIMIT 10
12 changes: 12 additions & 0 deletions sql/2024/cookies/top_20_domains_setting_cookies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
-- Extract the top 20 registrable domains that set cookies
-- Before running query: edit table name

SELECT
NET.REG_DOMAIN(domain) AS regDomain,
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie IS NOT NULL
GROUP BY regDomain
ORDER BY percentWebsites DESC
LIMIT 20
12 changes: 12 additions & 0 deletions sql/2024/cookies/top_20_first_party_cookies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
-- Extract the top 20 first party cookies seen across websites
-- Before running query: edit table name

SELECT
name,
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`
WHERE
firstPartyCookie = TRUE
GROUP BY name
ORDER BY percentWebsites DESC
LIMIT 20
Loading