Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the GIAS importer to follow ETL pattern #460

Merged
merged 4 commits into from
Apr 10, 2024

Conversation

ollietreend
Copy link
Member

Context

Currently, school data is imported from GIAS in two steps:

  1. Download the CSV
  2. Import the CSV (filtering out unwanted schools in the process)

We need to perform some 'massaging' of the data. For example, the importer excludes schools which aren't needed within our app. These are filtered out because either the school has closed, or it isn't in England (which is where the ITT policy currently applies).

We also have an upcoming need to convert the easting/northing fields provided by GIAS into equivalent latitude/longitude values, which will power our location-based search. This is another bit of 'massaging' which needs to happen before the data can be imported.

Changes proposed in this pull request

I've therefore decided to refactor the GIAS import process so it more explicitly follows an ETL (extract, transform, load) pattern:

  1. Download the CSV
  2. Transform the CSV (filter out unwanted schools)
  3. Import the CSV

This reduces the responsibility of the 'import' step by extracting the filtering logic into a standalone 'transform' step. The importer now simply needs to import every row of the provided CSV. And it'll make it easier to change and add to the transformation logic in future.

The same end result

This refactor doesn't affect the outcome of the import. The high-level integration test for the entire import still passes as before:

bundle exec rspec spec/jobs/gias/sync_all_schools_job_spec.rb

Other changes of note

One change I have made is in testing for edge cases. Previously the importer handled edge cases such as "invalid" CSV files where schools were missing a name or URN. However in practice this never happens: the GIAS CSV file always has a name for each school. The edge-case handling was also not comprehensive – it didn't handle cases where the CSV was missing expected columns or other field values, for example.

Since we haven't seen "invalid" CSV files in practice, it seems like a low-risk change to remove tests for these edge cases. I'd rather keep things simple. And then if we end up receiving corrupt CSV files in the future, we'll be alerted by Sentry with unhandled exceptions when schools fail to save to the database. This will make the problem more obvious instead of quietly skipping those schools.

Guidance to review

I've tried to create a tidy commit history with useful commit messages. I recommend reviewing this PR commit-by-commit if you want to see the individual steps I took.

Link to Trello card

This is an enabler for a proper implementation of: https://trello.com/c/3rYfhwuq/228-spike-using-the-easting-northing-values-from-gias-to-geocode-schools

I've added a high-level integration test for this job because I intend
to split & refactor the underlying service objects, and I want to make
sure I don't break the overall GIAS import feature.

This tests the end to end happy path, starting with the GIAS CSV file
being downloaded, and ending with schools existing in the database.
GIAS CSV files are published with ISO-8859-1 character encoding.

However the GIAS web server doesn't specify this in the response
headers, meaning we need to set it manually on the downloaded CSV file.

Without this, strange characters appear in imported school data.

This was originally handled in the 'importer' service, but I think it's
actually more appropriate to set the encoding correctly in the
'downloader' service. That way we can keep this encoding quirk 'at the
front door' and avoid any other part of our app needing to know the
specifics of how that file is encoded.
Currently, school data is imported from GIAS in two steps:

1. Download the CSV
2. Import the CSV (filtering out unwanted schools in the process)

We need to perform some 'massaging' of the data. For example, the
importer excludes schools which aren't needed within our app. These are
filtered out because either the school has closed, or it isn't in
England (which is where the ITT policy currently applies).

We also have an upcoming need to convert the easting/northing fields
provided by GIAS into equivalent latitude/longitude values, which will
power our location-based search. This is another bit of 'massaging'
which needs to happen before the data can be imported.

I've therefore decided to refactor the GIAS import process so it more
explicitly follows an [ETL][1] (extract, transform, load) pattern:

1. Download the CSV
2. Transform the CSV (filter out unwanted schools)
3. Import the CSV

This reduces the responsibility of the 'import' step by extracting the
filtering logic into a standalone 'transform' step. The importer now
simply needs to import every row of the provided CSV. And it'll make it
easier to change and add to the transformation logic in future.

This refactor doesn't affect the outcome of the import. The high-level
integration test for the entire import still passes as before:

```
bundle exec rspec spec/jobs/gias/sync_all_schools_job_spec.rb
```

One change I have made is in testing for edge cases. Previously the
importer handled edge cases such as "invalid" CSV files where schools
were missing a name or URN. However in practice this never happens: the
GIAS CSV file always has a name for each school. The edge-case handling
was also not comprehensive – it didn't handle cases where the CSV was
missing expected columns or other field values, for example.

Since we haven't seen "invalid" CSV files in practice, it seems like a
low-risk change to remove tests for these edge cases. I'd rather keep
things simple. And then if we end up receiving corrupt CSV files in the
future, we'll be alerted by Sentry with unhandled exceptions when
schools fail to save to the database. This will make the problem more
obvious instead of quietly skipping those schools.

[1]: https://en.wikipedia.org/wiki/Extract,_transform,_load
@ollietreend ollietreend requested review from a team as code owners April 9, 2024 16:55
The CI builds were missing the environment variable `GIAS_CSV_BASE_URL`
@ollietreend ollietreend merged commit 160b0c1 into main Apr 10, 2024
7 checks passed
@ollietreend ollietreend deleted the refactor-gias-import branch April 10, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants