-
-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cookies sql 2024 #3741
Cookies sql 2024 #3741
Conversation
9c83396
to
b448d2f
Compare
WHERE | ||
date = '2024-06-01' AND | ||
client = 'desktop' AND | ||
rank <= 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to restrict it like this? Can we add a comment to explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to reduce the amount of data processed/parsed; without any cap on the rank, bigquery was returning an error "Response too large to return" or something like that. A solution is to specify a table where to output the results, but I don't have the permissions on the HTTP Archive project to create that.
So, I created the table on a personal project/dataset, but then because I am on the free plan, there are some limits on amount of data I can store, and cookies on top 1M for both mobile and desktop were right below that limit, and top 1M seem plenty of enough data already.
Ideally, these cookies should not be in custom metrics but in their own separate table in my opinion. I believe I saw at some point an issue/discussion on HTTPArchive project where it was proposed to break down some fields in custom metrics in their table/column, instead of the current big blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can this table, which should be much cheaper to query:
-- Create an intermediate table containing all cookies that were set during the
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table
-- can then be reused in consecutive queries without having to reextract the
-- data every time
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies
CREATE TABLE `httparchive.almanac.cookies`
(
date DATE,
client STRING,
page STRING,
root_page STRING,
rank INTEGER,
startedDateTime STRING,
firstPartyCookie BOOL,
name STRING,
domain STRING,
path STRING,
expires STRING,
size STRING,
httpOnly STRING,
secure STRING,
session STRING,
sameSite STRING,
sameParty STRING,
partitionKey STRING,
partitionKeyOpaque STRING
)
PARTITION BY date
CLUSTER BY
client, rank, page
AS
WITH intermediate_cookie AS (
SELECT
date,
client,
page,
root_page,
rank,
JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime,
cookie
FROM
`httparchive.all.pages`,
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie
WHERE
date = '2024-06-01'
)
SELECT
date,
client,
page,
root_page,
rank,
startedDateTime,
ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie,
JSON_VALUE(cookie, '$.name') AS name,
JSON_VALUE(cookie, '$.domain') AS domain,
JSON_VALUE(cookie, '$.path') AS path,
JSON_VALUE(cookie, '$.expires') AS expires,
JSON_VALUE(cookie, '$.size') AS size,
JSON_VALUE(cookie, '$.httpOnly') AS httpOnly,
JSON_VALUE(cookie, '$.secure') AS secure,
JSON_VALUE(cookie, '$.session') AS session,
JSON_VALUE(cookie, '$.sameSite') AS sameSite,
JSON_VALUE(cookie, '$.sameParty') AS sameParty,
JSON_VALUE(cookie, '$.partitionKey') AS partitionKey,
JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque
FROM intermediate_cookie
Could you update this query to this:
INSERT INTO `httparchive.almanac.cookies`
WITH intermediate_cookie AS (
SELECT
date,
...
But no need to run as I've run it.
Then use this new httparchive.almanac.cookies
table in your other queries? I presume it won't change your results much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thank you! Looking into the new table you created and updating the queries for the cookies chapter accordingly.
WHERE | ||
date = '2024-06-01' AND | ||
client = 'mobile' AND | ||
rank <= 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
|
||
WITH intermediate_cookie AS ( | ||
SELECT | ||
page, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to restrict this to just root_pages? Or sum up cookies across both home and secondary pages for the same root_page
so we can count sites using cookies?
I see you use NET.HOST(page)
later on but root_page
would perhaps be better and allow this filtering up front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to be as general as possible when extracting the cookies table, and I did not know up front if we were going to only stick to root_page
or no for this year's queries, same for in the future.
Co-authored-by: Barry Pollard <[email protected]>
…parchive.almanac.cookies table
@tunetheweb I just updated the SQL queries for the Cookies 2024 chapter to use the table you created. Thanks! |
Queries for the Cookies Chapter 2024 (#3617)
Extract cookies into intermediate table to reduce size of data processed
Prevalence Cookies type and attributes
Top Cookies of each type and top domains setting the most cookies
Nb cookies
Size cookies
Age of cookies
New Privacy Sandbox APIs
CHIPS:
who is using them?
RWS & Attestation File
cc @ydimova, @shaoormunir, @samdutton @ChrisBeeti