Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cookies sql 2024 #3741

Merged
merged 25 commits into from
Nov 10, 2024
Merged

Cookies sql 2024 #3741

merged 25 commits into from
Nov 10, 2024

Conversation

yohhaan
Copy link
Member

@yohhaan yohhaan commented Aug 18, 2024

Queries for the Cookies Chapter 2024 (#3617)

Extract cookies into intermediate table to reduce size of data processed

  • 0_create_desktop_cookies.sql
  • 0_create_mobile_cookies.sql

Prevalence Cookies type and attributes

  • prevalence_attributes_per_type.sql
  • prevalence_type_attributes_per_rank.sql

Top Cookies of each type and top domains setting the most cookies

  • top_20_domains_setting_cookies.sql
  • top_20_first_party_cookies.sql
  • top_20_third_party_cookies.sql

Nb cookies

  • nb_cookies_cdf.sql
  • nb_cookies_per_type_quantiles.sql
  • nb_cookies_quantiles.sql

Size cookies

  • size_cookies_cdf.sql
  • size_cookies_per_type_quantiles.sql
  • size_cookies_quantiles.sql
  • size_extract_largest.sql

Age of cookies

  • age_expire_cookies_per_type_quantiles.sql
  • age_expire_cookies_quantiles.sql
  • age_expires_cookies_cdf.sql

New Privacy Sandbox APIs

@tunetheweb tunetheweb added this to the 2024 Analysis milestone Aug 21, 2024
@tunetheweb tunetheweb added the analysis Querying the dataset label Aug 21, 2024
@yohhaan yohhaan marked this pull request as ready for review September 18, 2024 19:53
@yohhaan yohhaan marked this pull request as draft September 18, 2024 23:42
@yohhaan yohhaan marked this pull request as ready for review October 3, 2024 16:08
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
rank <= 1000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to restrict it like this? Can we add a comment to explain why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to reduce the amount of data processed/parsed; without any cap on the rank, bigquery was returning an error "Response too large to return" or something like that. A solution is to specify a table where to output the results, but I don't have the permissions on the HTTP Archive project to create that.

So, I created the table on a personal project/dataset, but then because I am on the free plan, there are some limits on amount of data I can store, and cookies on top 1M for both mobile and desktop were right below that limit, and top 1M seem plenty of enough data already.

Ideally, these cookies should not be in custom metrics but in their own separate table in my opinion. I believe I saw at some point an issue/discussion on HTTPArchive project where it was proposed to break down some fields in custom metrics in their table/column, instead of the current big blob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can this table, which should be much cheaper to query:

-- Create an intermediate table containing all cookies that were set during the
-- <DATE> crawl on <CLIENT> when visiting sites of rank <= <RANK>. This table
-- can then be reused in consecutive queries without having to reextract the
-- data every time
-- Export the table as httparchive.almanac.DATE_CLIENT_RANK_cookies

CREATE TABLE `httparchive.almanac.cookies` 
(
  date DATE,
  client STRING,
  page STRING,
  root_page STRING,
  rank INTEGER,
  startedDateTime STRING,
  firstPartyCookie BOOL,
  name STRING,
  domain STRING,
  path STRING,
  expires STRING,
  size STRING,
  httpOnly STRING,
  secure STRING,
  session STRING,
  sameSite STRING,
  sameParty STRING,
  partitionKey STRING,
  partitionKeyOpaque STRING
)
PARTITION BY date
CLUSTER BY
  client, rank, page

AS

WITH intermediate_cookie AS (
  SELECT
    date,
    client,
    page,
    root_page,
    rank,
    JSON_VALUE(summary, '$.startedDateTime') AS startedDateTime,
    cookie
  FROM
    `httparchive.all.pages`,
    UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.cookies')) AS cookie
  WHERE
    date = '2024-06-01'
)

SELECT
  date,
  client,
  page,
  root_page,
  rank,
  startedDateTime,
  ENDS_WITH(NET.HOST(page), NET.REG_DOMAIN(JSON_VALUE(cookie, '$.domain'))) AS firstPartyCookie,
  JSON_VALUE(cookie, '$.name') AS name,
  JSON_VALUE(cookie, '$.domain') AS domain,
  JSON_VALUE(cookie, '$.path') AS path,
  JSON_VALUE(cookie, '$.expires') AS expires,
  JSON_VALUE(cookie, '$.size') AS size,
  JSON_VALUE(cookie, '$.httpOnly') AS httpOnly,
  JSON_VALUE(cookie, '$.secure') AS secure,
  JSON_VALUE(cookie, '$.session') AS session,
  JSON_VALUE(cookie, '$.sameSite') AS sameSite,
  JSON_VALUE(cookie, '$.sameParty') AS sameParty,
  JSON_VALUE(cookie, '$.partitionKey') AS partitionKey,
  JSON_VALUE(cookie, '$.partitionKeyOpaque') AS partitionKeyOpaque
FROM intermediate_cookie

Could you update this query to this:

INSERT INTO `httparchive.almanac.cookies` 
WITH intermediate_cookie AS (
  SELECT
    date,
...

But no need to run as I've run it.

Then use this new httparchive.almanac.cookies table in your other queries? I presume it won't change your results much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you! Looking into the new table you created and updating the queries for the cookies chapter accordingly.

WHERE
date = '2024-06-01' AND
client = 'mobile' AND
rank <= 1000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.


WITH intermediate_cookie AS (
SELECT
page,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to restrict this to just root_pages? Or sum up cookies across both home and secondary pages for the same root_page so we can count sites using cookies?

I see you use NET.HOST(page) later on but root_page would perhaps be better and allow this filtering up front.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to be as general as possible when extracting the cookies table, and I did not know up front if we were going to only stick to root_page or no for this year's queries, same for in the future.

sql/2024/cookies/prevalence_type_attributes_per_rank.sql Outdated Show resolved Hide resolved
@yohhaan
Copy link
Member Author

yohhaan commented Nov 5, 2024

@tunetheweb I just updated the SQL queries for the Cookies 2024 chapter to use the table you created. Thanks!

@tunetheweb tunetheweb merged commit 7a80150 into main Nov 10, 2024
4 checks passed
@tunetheweb tunetheweb deleted the cookies-sql-2024 branch November 10, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants