Refactor the GIAS importer to follow ETL pattern #460

ollietreend · 2024-04-09T16:55:07Z

Context

Currently, school data is imported from GIAS in two steps:

Download the CSV
Import the CSV (filtering out unwanted schools in the process)

We need to perform some 'massaging' of the data. For example, the importer excludes schools which aren't needed within our app. These are filtered out because either the school has closed, or it isn't in England (which is where the ITT policy currently applies).

We also have an upcoming need to convert the easting/northing fields provided by GIAS into equivalent latitude/longitude values, which will power our location-based search. This is another bit of 'massaging' which needs to happen before the data can be imported.

Changes proposed in this pull request

I've therefore decided to refactor the GIAS import process so it more explicitly follows an ETL (extract, transform, load) pattern:

Download the CSV
Transform the CSV (filter out unwanted schools)
Import the CSV

This reduces the responsibility of the 'import' step by extracting the filtering logic into a standalone 'transform' step. The importer now simply needs to import every row of the provided CSV. And it'll make it easier to change and add to the transformation logic in future.

The same end result

This refactor doesn't affect the outcome of the import. The high-level integration test for the entire import still passes as before:

bundle exec rspec spec/jobs/gias/sync_all_schools_job_spec.rb

Other changes of note

One change I have made is in testing for edge cases. Previously the importer handled edge cases such as "invalid" CSV files where schools were missing a name or URN. However in practice this never happens: the GIAS CSV file always has a name for each school. The edge-case handling was also not comprehensive – it didn't handle cases where the CSV was missing expected columns or other field values, for example.

Since we haven't seen "invalid" CSV files in practice, it seems like a low-risk change to remove tests for these edge cases. I'd rather keep things simple. And then if we end up receiving corrupt CSV files in the future, we'll be alerted by Sentry with unhandled exceptions when schools fail to save to the database. This will make the problem more obvious instead of quietly skipping those schools.

Guidance to review

I've tried to create a tidy commit history with useful commit messages. I recommend reviewing this PR commit-by-commit if you want to see the individual steps I took.

Link to Trello card

This is an enabler for a proper implementation of: https://trello.com/c/3rYfhwuq/228-spike-using-the-easting-northing-values-from-gias-to-geocode-schools

I've added a high-level integration test for this job because I intend to split & refactor the underlying service objects, and I want to make sure I don't break the overall GIAS import feature. This tests the end to end happy path, starting with the GIAS CSV file being downloaded, and ending with schools existing in the database.

GIAS CSV files are published with ISO-8859-1 character encoding. However the GIAS web server doesn't specify this in the response headers, meaning we need to set it manually on the downloaded CSV file. Without this, strange characters appear in imported school data. This was originally handled in the 'importer' service, but I think it's actually more appropriate to set the encoding correctly in the 'downloader' service. That way we can keep this encoding quirk 'at the front door' and avoid any other part of our app needing to know the specifics of how that file is encoded.

Currently, school data is imported from GIAS in two steps: 1. Download the CSV 2. Import the CSV (filtering out unwanted schools in the process) We need to perform some 'massaging' of the data. For example, the importer excludes schools which aren't needed within our app. These are filtered out because either the school has closed, or it isn't in England (which is where the ITT policy currently applies). We also have an upcoming need to convert the easting/northing fields provided by GIAS into equivalent latitude/longitude values, which will power our location-based search. This is another bit of 'massaging' which needs to happen before the data can be imported. I've therefore decided to refactor the GIAS import process so it more explicitly follows an [ETL][1] (extract, transform, load) pattern: 1. Download the CSV 2. Transform the CSV (filter out unwanted schools) 3. Import the CSV This reduces the responsibility of the 'import' step by extracting the filtering logic into a standalone 'transform' step. The importer now simply needs to import every row of the provided CSV. And it'll make it easier to change and add to the transformation logic in future. This refactor doesn't affect the outcome of the import. The high-level integration test for the entire import still passes as before: ``` bundle exec rspec spec/jobs/gias/sync_all_schools_job_spec.rb ``` One change I have made is in testing for edge cases. Previously the importer handled edge cases such as "invalid" CSV files where schools were missing a name or URN. However in practice this never happens: the GIAS CSV file always has a name for each school. The edge-case handling was also not comprehensive – it didn't handle cases where the CSV was missing expected columns or other field values, for example. Since we haven't seen "invalid" CSV files in practice, it seems like a low-risk change to remove tests for these edge cases. I'd rather keep things simple. And then if we end up receiving corrupt CSV files in the future, we'll be alerted by Sentry with unhandled exceptions when schools fail to save to the database. This will make the problem more obvious instead of quietly skipping those schools. [1]: https://en.wikipedia.org/wiki/Extract,_transform,_load

The CI builds were missing the environment variable `GIAS_CSV_BASE_URL`

app/services/gias/csv_transformer.rb

ollietreend added 3 commits April 9, 2024 17:27

ollietreend requested review from a team as code owners April 9, 2024 16:55

Add missing environment variable in CI

3232773

The CI builds were missing the environment variable `GIAS_CSV_BASE_URL`

Nitemaeric approved these changes Apr 10, 2024

View reviewed changes

app/services/gias/csv_transformer.rb Show resolved Hide resolved

Kizr approved these changes Apr 10, 2024

View reviewed changes

ollietreend merged commit 160b0c1 into main Apr 10, 2024
7 checks passed

ollietreend deleted the refactor-gias-import branch April 10, 2024 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the GIAS importer to follow ETL pattern #460

Refactor the GIAS importer to follow ETL pattern #460

ollietreend commented Apr 9, 2024

Refactor the GIAS importer to follow ETL pattern #460

Refactor the GIAS importer to follow ETL pattern #460

Conversation

ollietreend commented Apr 9, 2024

Context

Changes proposed in this pull request

The same end result

Other changes of note

Guidance to review

Link to Trello card