-
-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cookies sql 2024 #3741
Cookies sql 2024 #3741
Changes from 20 commits
ce56cd4
125b40e
81bd54b
a9467ff
06ed1ba
52b275c
276fb71
f22363b
4707f61
dc0e705
fd8dd73
b448d2f
ff2f2d5
092745b
3b7b0c1
c323fee
fa37340
5fe9f58
3af0a20
e77dc7f
23f292f
9c62610
7fe2a8a
3ce844a
7feae53
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
-- Create an intermediate table containing all cookies that were set during the | ||
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table | ||
-- can then be reused in consecutive queries without having to reextract the | ||
-- data every time | ||
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies | ||
|
||
WITH intermediate_cookie AS ( | ||
SELECT | ||
page, | ||
rank, | ||
JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime, | ||
cookie | ||
FROM | ||
`httparchive.all.pages`, | ||
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie | ||
WHERE | ||
date = '2024-06-01' AND | ||
client = 'desktop' AND | ||
rank <= 1000000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason to restrict it like this? Can we add a comment to explain why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was to reduce the amount of data processed/parsed; without any cap on the rank, bigquery was returning an error "Response too large to return" or something like that. A solution is to specify a table where to output the results, but I don't have the permissions on the HTTP Archive project to create that. So, I created the table on a personal project/dataset, but then because I am on the free plan, there are some limits on amount of data I can store, and cookies on top 1M for both mobile and desktop were right below that limit, and top 1M seem plenty of enough data already. Ideally, these cookies should not be in custom metrics but in their own separate table in my opinion. I believe I saw at some point an issue/discussion on HTTPArchive project where it was proposed to break down some fields in custom metrics in their table/column, instead of the current big blob. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can this table, which should be much cheaper to query: -- Create an intermediate table containing all cookies that were set during the
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table
-- can then be reused in consecutive queries without having to reextract the
-- data every time
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies
CREATE TABLE `httparchive.almanac.cookies`
(
date DATE,
client STRING,
page STRING,
root_page STRING,
rank INTEGER,
startedDateTime STRING,
firstPartyCookie BOOL,
name STRING,
domain STRING,
path STRING,
expires STRING,
size STRING,
httpOnly STRING,
secure STRING,
session STRING,
sameSite STRING,
sameParty STRING,
partitionKey STRING,
partitionKeyOpaque STRING
)
PARTITION BY date
CLUSTER BY
client, rank, page
AS
WITH intermediate_cookie AS (
SELECT
date,
client,
page,
root_page,
rank,
JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime,
cookie
FROM
`httparchive.all.pages`,
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie
WHERE
date = '2024-06-01'
)
SELECT
date,
client,
page,
root_page,
rank,
startedDateTime,
ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie,
JSON_VALUE(cookie, '$.name') AS name,
JSON_VALUE(cookie, '$.domain') AS domain,
JSON_VALUE(cookie, '$.path') AS path,
JSON_VALUE(cookie, '$.expires') AS expires,
JSON_VALUE(cookie, '$.size') AS size,
JSON_VALUE(cookie, '$.httpOnly') AS httpOnly,
JSON_VALUE(cookie, '$.secure') AS secure,
JSON_VALUE(cookie, '$.session') AS session,
JSON_VALUE(cookie, '$.sameSite') AS sameSite,
JSON_VALUE(cookie, '$.sameParty') AS sameParty,
JSON_VALUE(cookie, '$.partitionKey') AS partitionKey,
JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque
FROM intermediate_cookie Could you update this query to this: INSERT INTO `httparchive.almanac.cookies`
WITH intermediate_cookie AS (
SELECT
date,
... But no need to run as I've run it. Then use this new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! Thank you! Looking into the new table you created and updating the queries for the cookies chapter accordingly. |
||
) | ||
|
||
SELECT | ||
page, | ||
rank, | ||
startedDateTime, | ||
ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie, | ||
JSON_VALUE(cookie, '$.name') AS name, | ||
JSON_VALUE(cookie, '$.domain') AS domain, | ||
JSON_VALUE(cookie, '$.path') AS path, | ||
JSON_VALUE(cookie, '$.expires') AS expires, | ||
JSON_VALUE(cookie, '$.size') AS size, | ||
JSON_VALUE(cookie, '$.httpOnly') AS httpOnly, | ||
JSON_VALUE(cookie, '$.secure') AS secure, | ||
JSON_VALUE(cookie, '$.session') AS session, | ||
JSON_VALUE(cookie, '$.sameSite') AS sameSite, | ||
JSON_VALUE(cookie, '$.sameParty') AS sameParty, | ||
JSON_VALUE(cookie, '$.partitionKey') AS partitionKey, | ||
JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque | ||
FROM intermediate_cookie |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
-- Create an intermediate table containing all cookies that were set during the | ||
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table | ||
-- can then be reused in consecutive queries without having to reextract the | ||
-- data every time | ||
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies | ||
|
||
WITH intermediate_cookie AS ( | ||
SELECT | ||
page, | ||
rank, | ||
JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime, | ||
cookie | ||
FROM | ||
`httparchive.all.pages`, | ||
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie | ||
WHERE | ||
date = '2024-06-01' AND | ||
client = 'mobile' AND | ||
rank <= 1000000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto. |
||
) | ||
|
||
SELECT | ||
page, | ||
rank, | ||
startedDateTime, | ||
ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie, | ||
JSON_VALUE(cookie, '$.name') AS name, | ||
JSON_VALUE(cookie, '$.domain') AS domain, | ||
JSON_VALUE(cookie, '$.path') AS path, | ||
JSON_VALUE(cookie, '$.expires') AS expires, | ||
JSON_VALUE(cookie, '$.size') AS size, | ||
JSON_VALUE(cookie, '$.httpOnly') AS httpOnly, | ||
JSON_VALUE(cookie, '$.secure') AS secure, | ||
JSON_VALUE(cookie, '$.session') AS session, | ||
JSON_VALUE(cookie, '$.sameSite') AS sameSite, | ||
JSON_VALUE(cookie, '$.sameParty') AS sameParty, | ||
JSON_VALUE(cookie, '$.partitionKey') AS partitionKey, | ||
JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque | ||
FROM intermediate_cookie |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
-- Extract the top 20 first party cookies seen across websites that are | ||
-- partitioned. Note: it is a bit weird that 1st party cookies would also be | ||
-- partitioned, as CHIPS is meant for a 3rd party context... | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
name, | ||
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie = TRUE AND | ||
partitionKey IS NOT NULL | ||
GROUP BY name | ||
ORDER BY percentWebsites DESC | ||
LIMIT 20 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
-- Extract the top 20 first party cookies seen across websites that are | ||
-- partitioned. Note: it is a bit weird that 1st party cookies would also be | ||
-- partitioned, as CHIPS is meant for a 3rd party context... | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
name, | ||
domain, | ||
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie = FALSE AND | ||
partitionKey IS NOT NULL | ||
GROUP BY name, domain | ||
ORDER BY percentWebsites DESC | ||
LIMIT 20 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
-- Extract stats around the expire date (age rounded to closest number of days) | ||
-- (only positive values, non session cookies) | ||
-- Before running query: edit table name | ||
|
||
WITH cookies_age AS ( | ||
SELECT | ||
firstPartyCookie, | ||
ROUND((CAST(expires AS FLOAT64) - CAST(startedDateTime AS FLOAT64)) / (24 * 3600), 0) AS age | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL AND | ||
CAST(expires AS FLOAT64) >= 0 | ||
) | ||
|
||
SELECT | ||
firstPartyCookie, | ||
MIN(age) AS min, | ||
APPROX_QUANTILES(age, 100)[OFFSET(25)] AS p25, | ||
APPROX_QUANTILES(age, 100)[OFFSET(50)] AS median, | ||
APPROX_QUANTILES(age, 100)[OFFSET(75)] AS p75, | ||
APPROX_QUANTILES(age, 100)[OFFSET(90)] AS p90, | ||
APPROX_QUANTILES(age, 100)[OFFSET(99)] AS p99, | ||
MAX(age) AS max | ||
FROM cookies_age | ||
GROUP BY firstPartyCookie |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
-- Extract stats around the expire date (age rounded to closest number of days) | ||
-- (only positive values, non session cookies) | ||
-- Before running query: edit table name | ||
|
||
WITH cookies_age AS ( | ||
SELECT | ||
ROUND((CAST(expires AS FLOAT64) - CAST(startedDateTime AS FLOAT64)) / (24 * 3600), 0) AS age | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL AND | ||
CAST(expires AS FLOAT64) >= 0 | ||
) | ||
|
||
SELECT | ||
MIN(age) AS min, | ||
APPROX_QUANTILES(age, 100)[OFFSET(25)] AS p25, | ||
APPROX_QUANTILES(age, 100)[OFFSET(50)] AS median, | ||
APPROX_QUANTILES(age, 100)[OFFSET(75)] AS p75, | ||
APPROX_QUANTILES(age, 100)[OFFSET(90)] AS p90, | ||
APPROX_QUANTILES(age, 100)[OFFSET(99)] AS p99, | ||
MAX(age) AS max | ||
FROM cookies_age |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
-- Extract the expire date (age rounded to closest number of days) | ||
-- (only positive values, non session cookies) | ||
-- Before running query: edit table name | ||
|
||
WITH cookies_age AS ( | ||
SELECT | ||
ROUND((CAST(expires AS FLOAT64) - CAST(startedDateTime AS FLOAT64)) / (24 * 3600), 0) AS age | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL AND | ||
CAST(expires AS FLOAT64) >= 0 | ||
) | ||
|
||
SELECT | ||
age, | ||
COUNT(0) AS nbCookies | ||
FROM cookies_age | ||
GROUP BY age | ||
ORDER BY age ASC |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
-- Extract the nb of cookies | ||
-- Before running query: edit table name | ||
|
||
WITH nb_cookies_per_website AS ( | ||
SELECT | ||
firstPartyCookie, | ||
NET.HOST(page) AS pageFirstPartyHost, | ||
COUNT(DISTINCT CONCAT(name, domain)) AS distinctNbCookies | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL | ||
GROUP BY firstPartyCookie, pageFirstPartyHost | ||
) | ||
|
||
SELECT | ||
firstPartyCookie, | ||
distinctNbCookies, | ||
COUNT(DISTINCT pageFirstPartyHost) AS nbWebsites | ||
FROM nb_cookies_per_website | ||
GROUP BY firstPartyCookie, distinctNbCookies | ||
ORDER BY firstPartyCookie, distinctNbCookies ASC |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
-- Extract stats about the nb of cookies | ||
-- Before running query: edit table name | ||
|
||
WITH nb_cookies_per_website AS ( | ||
SELECT | ||
firstPartyCookie, | ||
NET.HOST(page) AS pageFirstPartyHost, | ||
COUNT(DISTINCT CONCAT(name, domain)) AS distinctNbCookies | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL | ||
GROUP BY firstPartyCookie, pageFirstPartyHost | ||
) | ||
|
||
SELECT | ||
firstPartyCookie, | ||
MIN(distinctNbCookies) AS min, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(25)] AS p25, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(50)] AS median, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(75)] AS p75, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(90)] AS p90, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(99)] AS p99, | ||
MAX(distinctNbCookies) AS max | ||
FROM nb_cookies_per_website | ||
GROUP BY firstPartyCookie |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
-- Extract stats about the nb of cookies | ||
-- Before running query: edit table name | ||
|
||
WITH nb_cookies_per_website AS ( | ||
SELECT | ||
NET.HOST(page) AS pageFirstPartyHost, | ||
COUNT(DISTINCT CONCAT(name, domain)) AS distinctNbCookies | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL | ||
GROUP BY pageFirstPartyHost | ||
) | ||
|
||
SELECT | ||
MIN(distinctNbCookies) AS min, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(25)] AS p25, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(50)] AS median, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(75)] AS p75, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(90)] AS p90, | ||
APPROX_QUANTILES(distinctNbCookies, 100)[OFFSET(99)] AS p99, | ||
MAX(distinctNbCookies) AS max | ||
FROM nb_cookies_per_website |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
-- Prevalence of cookies type and attributes per type of cookie (1st/3rd party) | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
firstPartyCookie, | ||
SUM(IF(httpOnly = 'true', 1, 0)) / COUNT(0) AS httpOnly, | ||
SUM(IF(secure = 'true', 1, 0)) / COUNT(0) AS secure, | ||
SUM(IF(session = 'true', 1, 0)) / COUNT(0) AS session, | ||
SUM(IF(sameParty = 'true', 1, 0)) / COUNT(0) AS sameParty, | ||
SUM(IF(sameSite = 'Lax', 1, 0)) / COUNT(0) AS sameSiteLax, | ||
SUM(IF(sameSite = 'None', 1, 0)) / COUNT(0) AS sameSiteNone, | ||
SUM(IF(sameSite = 'Strict', 1, 0)) / COUNT(0) AS sameSiteStrict, | ||
SUM(IF(sameSite IS NULL, 1, 0)) / COUNT(0) AS sameSiteNull, | ||
SUM(IF(partitionKey IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKey, | ||
SUM(IF(partitionKeyOpaque IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKeyOpaque, | ||
SUM(IF(STARTS_WITH(name, '__Host-'), 1, 0)) / COUNT(0) AS hostPrefix, | ||
SUM(IF(STARTS_WITH(name, '__Secure-'), 1, 0)) / COUNT(0) AS securePrefix | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL -- just in case | ||
GROUP BY | ||
firstPartyCookie |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
-- Prevalence of cookies type and attributes per rank grouping | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
rank_grouping, | ||
SUM(IF(firstPartyCookie = TRUE, 1, 0)) / COUNT(0) AS firstParty, | ||
SUM(IF(firstPartyCookie = FALSE, 1, 0)) / COUNT(0) AS thirdParty, | ||
SUM(IF(httpOnly = 'true', 1, 0)) / COUNT(0) AS httpOnly, | ||
SUM(IF(secure = 'true', 1, 0)) / COUNT(0) AS secure, | ||
SUM(IF(session = 'true', 1, 0)) / COUNT(0) AS session, | ||
SUM(IF(sameParty = 'true', 1, 0)) / COUNT(0) AS sameParty, | ||
SUM(IF(sameSite = 'Lax', 1, 0)) / COUNT(0) AS sameSiteLax, | ||
SUM(IF(sameSite = 'None', 1, 0)) / COUNT(0) AS sameSiteNone, | ||
SUM(IF(sameSite = 'Strict', 1, 0)) / COUNT(0) AS sameSiteStrict, | ||
SUM(IF(sameSite IS NULL, 1, 0)) / COUNT(0) AS sameSiteNull, | ||
SUM(IF(partitionKey IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKey, | ||
SUM(IF(partitionKeyOpaque IS NOT NULL, 1, 0)) / COUNT(0) AS partitionKeyOpaque | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`, | ||
UNNEST([1000, 5000, 10000, 50000, 100000, 500000, 1000000]) AS rank_grouping | ||
yohhaan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
WHERE | ||
rank <= rank_grouping AND | ||
firstPartyCookie IS NOT NULL -- just in case | ||
GROUP BY | ||
rank_grouping | ||
ORDER BY | ||
rank_grouping |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
-- Extract the size of the cookies | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
CAST(size AS INT) AS sizeCookie, | ||
COUNT(0) AS nbCookies | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL | ||
GROUP BY sizeCookie | ||
ORDER BY sizeCookie ASC |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
-- Extract stats around the size of the cookies | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
firstPartyCookie, | ||
MIN(CAST(size AS INT)) AS min, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(25)] AS p25, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(50)] AS median, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(75)] AS p75, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(90)] AS p90, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(99)] AS p99, | ||
MAX(CAST(size AS INT)) AS max | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL | ||
GROUP BY firstPartyCookie |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
-- Extract stats around the size of the cookies | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
MIN(CAST(size AS INT)) AS min, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(25)] AS p25, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(50)] AS median, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(75)] AS p75, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(90)] AS p90, | ||
APPROX_QUANTILES(CAST(size AS INT), 100)[OFFSET(99)] AS p99, | ||
MAX(CAST(size AS INT)) AS max | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
-- Extract largest cookies being set | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
name, | ||
domain, | ||
CAST(size AS INT) AS sizeCookie | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie = TRUE | ||
ORDER BY sizeCookie DESC | ||
LIMIT 10 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
-- Extract the top 20 registrable domains that set cookies | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
NET.REG_DOMAIN(domain) AS regDomain, | ||
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie IS NOT NULL | ||
GROUP BY regDomain | ||
ORDER BY percentWebsites DESC | ||
LIMIT 20 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
-- Extract the top 20 first party cookies seen across websites | ||
-- Before running query: edit table name | ||
|
||
SELECT | ||
name, | ||
COUNT(DISTINCT NET.HOST(page)) / (SELECT (COUNT(DISTINCT NET.HOST(page))) FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies`) AS percentWebsites | ||
FROM `httparchive.almanac.DATE_CLIENT_RANK_cookies` | ||
WHERE | ||
firstPartyCookie = TRUE | ||
GROUP BY name | ||
ORDER BY percentWebsites DESC | ||
LIMIT 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to restrict this to just root_pages? Or sum up cookies across both home and secondary pages for the same
root_page
so we can count sites using cookies?I see you use
NET.HOST(page)
later on butroot_page
would perhaps be better and allow this filtering up front.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to be as general as possible when extracting the cookies table, and I did not know up front if we were going to only stick to
root_page
or no for this year's queries, same for in the future.