Skip to content

Commit

Permalink
Merge pull request #186 from alphagov/Adding-detail-to-BQ-information…
Browse files Browse the repository at this point in the history
…,-plus-fixing-typos

More BQ information and the odd typo fix
  • Loading branch information
annecremin authored May 22, 2024
2 parents aeef8bf + a06eb8b commit 7d56c38
Show file tree
Hide file tree
Showing 8 changed files with 83 additions and 35 deletions.
3 changes: 2 additions & 1 deletion source/analysis/govuk-ga4/find-in-ga4/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Find things in the GOV.UK GA4 data
weight: 4
last_reviewed_on: 2024-05-01
last_reviewed_on: 2024-05-21
review_in: 6 months
---

Expand All @@ -14,6 +14,7 @@ There are a [variety of different things you can learn from the GA4 data](/analy
This page provides some guidance on how to find commonly requested information in the GA4 data.

More information on the data source itself can be found on the [GOV.UK GA4 data source page](/data-sources/ga/ga4/).
Information on best practice accessing and using GA4 data can be found in the ['Use the GA4 data' section](/analysis/govuk-ga4/use-ga4/).


## Page views
Expand Down
12 changes: 10 additions & 2 deletions source/analysis/govuk-ga4/ga4-data-information/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,14 +1,22 @@
---
title: What can I learn from the GOV.UK GA4 data?
weight: 1
last_reviewed_on: 2024-05-16
last_reviewed_on: 2024-05-21
review_in: 6 months
---

# What can I learn from the GOV.UK GA4 data?
<span style="color:red">This page is a work in progress.</span>

The [GOV.UK Google Analytics 4 data](/data-sources/ga/ga4/) contains a variety of information on users of GOV.UK, and how those users interacted with various pages on GOV.UK.
The [GOV.UK Google Analytics 4 (GA4) data](/data-sources/ga/ga4/) contains a variety of information on users of GOV.UK, and how those users interacted with various pages on GOV.UK.

## Limitations of GA4 data collection

GA4 data is only collected on GOV.UK when users consent to cookies that measure website use.

Data collection also relies on the [analytics JavaScript code](https://github.com/alphagov/govuk_publishing_components/blob/main/docs/analytics-ga4/analytics.md), and will only occur on browsers we are supporting.
Browsers such as Internet Explorer 11 are not supported (see the [RFC on removing support for legacy browsers](https://github.com/alphagov/govuk-rfcs/blob/rfc-168/rfc-171-remove-legacy-browser-js-support.md#loss-of-analytics-for-legacy-browsers)).


## Information about how users interacted with pages on GOV.UK

Expand Down
33 changes: 24 additions & 9 deletions source/analysis/govuk-ga4/use-ga4/bigquery/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: BigQuery best practice
weight: 3
last_reviewed_on: 2024-04-23
last_reviewed_on: 2024-05-22
review_in: 6 months
---

Expand All @@ -12,16 +12,24 @@ GA4 data is available in BigQuery in the [raw](/data-sources/ga/ga4-bq/) or the

The raw and flattened GA4 data in BigQuery is stored as a table of events - each row, or record, represents an event.
A count of all records on a day will give you the total number of events that were recorded on that day.
A count of all records on a day filtered to only show records with the event name ‘page_view’ will give you the total number of page views recorded on that day.
Metrics such as users and sessions are not already available, but need to be calculated.
A count of all records on a day filtered to only show records with the event name 'page_view' will give you the total number of page views recorded on that day.
Metrics such as users and sessions need to be calculated.
Some example SQL to find common GA4 metrics can be found on the ['Find things in GA4' page](/analysis/govuk-ga4/find-in-ga4/#find-things-in-the-gov-uk-ga4-data).

There are quotas in place on querying in BigQuery to ensure that costs do not get too high.
More information on these can be found in the [quotas guidance](/gcp/BQ/#quotas).
Details of the specific quotas set on various projects can be found under each project on the [GCP page](/gcp/).
## Querying GA4 data in BigQuery

If you are unfamiliar with BigQuery, it may help to review Google’s [documentation explaining the BigQuery user interface](https://cloud.google.com/bigquery/docs/bigquery-web-ui#open-ui).

To query GA4 data in BigQuery you will have to have permissions to view the data and to run queries in whichever project you are running the query in.
If your query fails to run, check whether you are running the query from the right project, and check the error message to see if there was a role or permission error.

You can save queries in BigQuery to return to later.

The GA4 data stored in BigQuery can also be used in Looker Studio. More information on this can be found on the [Looker Studio best practice page](/analysis/govuk-ga4/use-ga4/looker-studio/#use-the-bigquery-ga4-data-in-looker-studio).
There are quotas in place on querying in BigQuery to ensure that costs do not get too high.
More information on these can be found in the [quotas guidance](/tools/google-cloud-platform/bigquery/#quotas).
Details of the specific quotas set on various projects can be found under each project on the [GCP projects page](/tools/google-cloud-platform/gcp-projects/).

## Best practice
### Best practice

Avoid selecting all (`SELECT *`) - there are very few circumstances where you actually need every column from the data source!
If you would like to see what is in the dataset, you can PREVIEW a table in the BigQuery interface.
Expand All @@ -32,4 +40,11 @@ If you are running queries in BigQuery or connecting to BigQuery data, make sure
Using a wildcard in the place of the date at the end of the table queries all the flattened tables (the entire history of the data) and costs can rack up pretty quickly due to the amount of data we are collecting and storing.
Either specify a date in the tables you are selecting data from or make sure to use a WHERE statement where you define the date (or dates) you want.

Note that using a LIMIT statement does not reduce the amount of data queried, just the amount of rows returned to you. `SELECT * FROM [table] LIMIT 20` and `SELECT * FROM [table]` cost the exact same amount.
Note that using a LIMIT statement does not reduce the amount of data queried, just the amount of rows returned to you. `SELECT * FROM [table] LIMIT 20` and `SELECT * FROM [table]` cost the exact same amount.

## Using GA4 data stored in BigQuery in other tools

The GA4 data stored in BigQuery can also be queried into visualisation tools or other products built to use this data.

Looker Studio connects very easily to data stored in BigQuery, and so is often used to display BigQuery data within GDS.
More guidance on this can be found on the [Looker Studio best practice page](/analysis/govuk-ga4/use-ga4/looker-studio/#use-the-bigquery-ga4-data-in-looker-studio).
4 changes: 2 additions & 2 deletions source/analysis/govuk-ga4/use-ga4/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Use the GOV.UK GA4 data
weight: 5
last_reviewed_on: 2024-04-24
last_reviewed_on: 2024-05-21
review_in: 6 months
---

Expand All @@ -22,5 +22,5 @@ This section contains information and best practice on how the GOV.UK GA4 data c

The GA4 data can also be acccessed via:

- the [Content Data app](data-sources/content-data-app/)
- the [Content Data app](/analysis/content-data/)
- Data Services' custom data tools
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Looker Studio best practice
weight: 2
last_reviewed_on: 2024-05-16
last_reviewed_on: 2024-05-21
review_in: 6 months
---

Expand Down Expand Up @@ -66,6 +66,16 @@ If you cannot use the above shared connection we recommend you use a custom quer
6. Tick the checkbox to ‘enable date range parameters’ if needed
7. Selecting ‘Add’


An example of a SQL query that could be used in step 5 above is:

```SQL
SELECT *
FROM `ga4-analytics-352613.flattened_dataset.flattened_daily_ga_data_*`
WHERE _TABLE_SUFFIX BETWEEN @DS_START_DATE AND @DS_END_DATE
```


### Best practice

Where possible, use the flattened data source. The flattened tables are much more efficient to query, and should be easier to use as well.
Expand Down
13 changes: 9 additions & 4 deletions source/data-sources/ga/ga4-bq/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: GOV.UK GA4 (BigQuery export)
weight: 2
last_reviewed_on: 2024-04-23
last_reviewed_on: 2024-05-22
review_in: 6 months
---

Expand All @@ -19,13 +19,18 @@ More information can be found in our [GA access policy](/processes/ga-access/#wh

### Location
There are 3 GA4 datasets. These correspond to the integration, staging, and production or live GOV.UK websites.
All of these datasets are made up of sharded tables. This means that a new table is created each day with the suffix YYYYMMDD.

The GA4 data for the live GOV.UK site is located in BigQuery in the `ga4-analytics-352613.analytics_330577055` dataset.
The GA4 data for the staging site is located in BigQuery in the `ga4-analytics-352613.analytics_330580593` dataset.
The GA4 data for the integration site is located in BigQuery in the `ga4-analytics-352613.analytics_294475112` dataset.

These datasets are all within the [GA4 analytics project](/gcp/#ga4-analytics).
These datasets are all comprised of sharded tables - a new table is created each day with the suffix YYYYMMDD.

Our Google Analytics properties to export GOV.UK data several times a day.
The data for the current day is temporarily stored in intraday tables.
At the end of the day, BigQuery automatically moves the data in the intraday tables to a date table (suffixed `YYYYMMDD`) and deletes the intraday tables in question.
New intraday tables are created and added to throughout the next day.

The GA4 datasets are all stored within the [GA4 analytics project](/gcp/#ga4-analytics).
For more information on the Google Cloud Platform projects, see our [GCP Project Documentation](/gcp/).

## Schema
Expand Down
4 changes: 2 additions & 2 deletions source/tools/ga4-user-admin-tool/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: GA4 User Admin tool
weight: 5
last_reviewed_on: 2024-04-25
last_reviewed_on: 2024-05-21
review_in: 6 months
---

# GA4 User Admin Tool
# GA4 User Admin tool
The GA4 User Admin tool is used to add and delete users' access to www.gov.uk production Google Analytics data.
The tool adds a user to [www.gov.uk GA4 production data](/data-sources/ga/ga4/) as well as providing read access to GA4 nested data, flattened data and www.gov.uk Search Console data stored in BigQuery.

Expand Down
37 changes: 23 additions & 14 deletions source/tools/google-cloud-platform/bigquery/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Use BigQuery
weight: 3
last_reviewed_on: 2024-04-25
last_reviewed_on: 2024-05-22
review_in: 6 months
---

Expand All @@ -26,16 +26,13 @@ For example, our GOV.UK Universal Analytics data is sent to the `govuk-bigquery-
- `87773428` is the dataset name
- `ga_sessions_intraday_YYYYMMDD` is the table name

Our Google Analytics data is stored in sharded tables. These table names end with the suffix `YYYYMMDD`, representing the date in year-month-day format.

### Intraday tables
There are different types of tables that can be used to store data in BigQuery.

We have set up our Google Analytics properties to export GOV.UK data several times a day.
This day is temporarily stored in intraday tables.

At the end of the day, BigQuery automatically moves the data in the intraday table to a date table (suffixed `YYYYMMDD`) and deletes the intraday tables in question.
New intraday tables are created and added to throughout the next day.
Our [raw Google Analytics data](/data-sources/ga/ga4-bq/#location), for example, is stored in [sharded tables](https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard).
These table names end with the suffix `YYYYMMDD`, representing the date in year-month-day format.

[Partitioned tables](https://cloud.google.com/bigquery/docs/partitioned-tables) are in use in other datasets.

## Quotas
Several projects have quotas set up to limit the amount of data that can be queried.
Expand All @@ -47,7 +44,7 @@ The aim is not to be a hindrance to the need to use the data we store so please
Specific quotas can be found detailed under the project name on the [GCP page](https://docs.data-community.publishing.service.gov.uk/gcp/).


### How did I exceed my quota?
### How did I query more data than permitted by my quota?
This is a very good question that we are still investigating.

Our current thinking is that the dynamic concurrent query queue could be loaded with queries before the quota is breached.
Expand All @@ -72,13 +69,25 @@ This does not appear to be the case when using an existing data connection. In t
Where possible, use shared pre-existing data connections in Looker Studio.

## Roles
Google Cloud Platform permissions (IAM) can be a mysterious thing.

BigQuery Data Viewer at the Project level allows the person to see all the data held within a project, it can also be applied at a dataset level.
It does not provide the ability to query the data from that project, they would need to do so from a different project.
A role is a set of permissions. Users should only have the specific role or permissions they need to use the Google Cloud Platform.

Contact the Data Engineering community on Slack to ask for a role, permission or service account.

Common permissions we use include:

- BigQuery Data Viewer
- BigQuery Job User
- BigQuery Read Session User


BigQuery Data Viewer allows the user to access and view data. When granted at the project level, this means the user can see all the data held within a project.
This does not allow the user to query the data from within that project - a user granted only Data Viewer permissions to a given project would need to query the data from a different project.
This permission can also be applied at dataset level.

BigQuery Job User grants the user the ability to run queries from the project. The data could be held in a different project but the query cost it allocated to the querying project.
BigQuery Job User grants the user the ability to run queries from the project in question.
The data being queried could be held in a different project but the query cost is allocated to the querying project.

BigQuery Read Session User permissions are required if the Storage Read API is used when querying.

BigQuery Read Session User is needed if the Storage Read API is used when querying.
More information on roles and IAM permissions can be found in the Google Cloud documentation.

0 comments on commit 7d56c38

Please sign in to comment.