Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Commit

Permalink
Add a job to rerun master etl
Browse files Browse the repository at this point in the history
This adds a job to repopulate the data that has been collected by the
[content data api etl
job](https://deploy.blue.production.govuk.digital/job/content_data_api_import_etl_master_process/).

The data collected by this initial etl job was giving inaccurate results
we traced this down to a [delay in results showing up in Google
analytics](https://support.google.com/analytics/answer/1070983?hl=en#:~:text=Data%20processing%20latency,for%20up%20to%20two%20days),
results can take between 24-48 hours to appear in GA, the initial etl
is run at 7am leaving only 7 hours for the data to appear in GA.

The newly added job will collect the data after 2 days leaving time for
for the data to appear correctly in GA.

We have added this new job instead of moving the current job back since
there are many references in the code of content-data to data collected
yesterday. The date ranges are also done from data received yesterday.
Changing this would be a big job, so initially we use the potentially
inaccurate data and then correct it after 2 days.

Trello card: https://trello.com/c/ODGWWnXt/2005-5-content-data-incorrect-users-who-found-this-useful-data-for-march
  • Loading branch information
beccapearce committed Jun 18, 2020
1 parent fac4b51 commit f0539d9
Show file tree
Hide file tree
Showing 8 changed files with 49 additions and 0 deletions.
1 change: 1 addition & 0 deletions Rakefile
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,7 @@ task :check_consistency_between_aws_and_carrenza do
govuk_jenkins::deploy_all_apps::apps_on_nodes
govuk_jenkins::deploy_all_apps::deploy_environment
govuk_jenkins::jobs::content_data_api::rake_etl_master_process_cron_schedule
govuk_jenkins::jobs::content_data_api_re_run::re_run_rake_etl_master_process_cron_schedule
govuk_jenkins::jobs::deploy_app::graphite_host
govuk_jenkins::jobs::deploy_app::graphite_port
govuk_jenkins::jobs::deploy_emergency_banner::clear_cdn_cache
Expand Down
1 change: 1 addition & 0 deletions hieradata_aws/class/integration/jenkins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ govuk_jenkins::job_builder::jobs:
- govuk_jenkins::jobs::govuk_taxonomy_supervised_learning
- govuk_jenkins::jobs::monitor_taxonomy_health
- govuk_jenkins::jobs::passive_checks
- govuk_jenkins::jobs::content_data_api_re_run
- govuk_jenkins::jobs::publication_delay_report
- govuk_jenkins::jobs::publish_special_routes
- govuk_jenkins::jobs::record_taxonomy_metrics
Expand Down
1 change: 1 addition & 0 deletions hieradata_aws/class/production/jenkins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ govuk_jenkins::job_builder::jobs:
- govuk_jenkins::jobs::email_alert_check
- govuk_jenkins::jobs::enhanced_ecommerce_search_api
- govuk_jenkins::jobs::passive_checks
- govuk_jenkins::jobs::content_data_api_re_run
- govuk_jenkins::jobs::publish_special_routes
- govuk_jenkins::jobs::publishing_api_archive_events
- govuk_jenkins::jobs::remove_emergency_banner
Expand Down
1 change: 1 addition & 0 deletions hieradata_aws/class/staging/jenkins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ govuk_jenkins::job_builder::jobs:
- govuk_jenkins::jobs::govuk_taxonomy_supervised_learning
- govuk_jenkins::jobs::monitor_taxonomy_health
- govuk_jenkins::jobs::passive_checks
- govuk_jenkins::jobs::content_data_api_re_run
- govuk_jenkins::jobs::publish_special_routes
- govuk_jenkins::jobs::record_taxonomy_metrics
- govuk_jenkins::jobs::remove_emergency_banner
Expand Down
1 change: 1 addition & 0 deletions hieradata_aws/common.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -879,6 +879,7 @@ govuk_jenkins::packages::govuk_python::apt_mirror_gpg_key_fingerprint: "%{hiera(

govuk_jenkins::jobs::deploy_app::graphite_host: "graphite.%{hiera('app_domain_internal')}"
govuk_jenkins::jobs::deploy_app::graphite_port: '443'
govuk_jenkins::jobs::content_data_api_re_run::re_run_rake_etl_master_process_cron_schedule: '0 3 * * *'

govuk_jenkins::deploy_all_apps::deploy_environment: "%{hiera('govuk_jenkins::job_builder::environment')}"

Expand Down
22 changes: 22 additions & 0 deletions modules/govuk_jenkins/manifests/jobs/content_data_api_re_run.pp
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# == Class: govuk_jenkins::jobs::content_data_api_re_run
#
# Create a jenkins job to periodically run rake for the following tasks:
# - rake etl:rerun_master
#
# === Parameters:
#
# [*re_run_rake_etl_master_process_cron_schedule *]
# The cron timings for the etl:master process
# Default: undef
#
class govuk_jenkins::jobs::content_data_api_re_run (
$re_run_rake_etl_master_process_cron_schedule,
$app_domain = hiera('app_domain'),
) {

file { '/etc/jenkins_jobs/jobs/content_data_api_re_run.yaml':
ensure => present,
content => template('govuk_jenkins/jobs/content_data_api_re_run.yaml.erb'),
notify => Exec['jenkins_jobs_update'],
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
- job:
name: content_data_api_re_run_re_run_rake_etl_master_process_cron_schedule
display-name: Content Data API - rerun ETL master
project-type: freestyle
description: "<p>Rerun the etl:master rake task to populate missing data.</p>"
builders:
- trigger-builds:
- project: run-rake-task
block: true
predefined-parameters: |
TARGET_APPLICATION=content-data-api
MACHINE_CLASS=backend
RAKE_TASK=etl:rerun_master['<%= "#{(Time.now-2*24*60*60).strftime("%Y-%m-%d")}, #{(Time.now-2*24*60*60).strftime("%Y-%m-%d")}" %>']
wrappers:
- ansicolor:
colormap: xterm
triggers:
- timed: <%= @re_run_rake_etl_master_process_cron_schedule %>
logrotate:
daysToKeep: 365
1 change: 1 addition & 0 deletions spec/fixtures/hieradata/common.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ govuk_jenkins::config::github_web_uri: wibble
govuk_jenkins::jobs::deploy_app::applications: *deployable_applications
govuk_jenkins::jobs::deploy_app_downstream::applications: *deployable_applications
govuk_jenkins::jobs::run_rake_task::applications: *deployable_applications
govuk_jenkins::jobs::content_data_api_re_run::re_run_rake_etl_master_process_cron_schedule: '0 3 * * *'

govuk_jenkins::packages::gcloud::apt_mirror_hostname: "%{hiera('apt_mirror_hostname')}"
govuk_jenkins::packages::gcloud::apt_mirror_gpg_key_fingerprint: "%{hiera('apt_mirror_fingerprint')}"
Expand Down

0 comments on commit f0539d9

Please sign in to comment.