Refactor the GIAS importer to follow ETL pattern #460
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Currently, school data is imported from GIAS in two steps:
We need to perform some 'massaging' of the data. For example, the importer excludes schools which aren't needed within our app. These are filtered out because either the school has closed, or it isn't in England (which is where the ITT policy currently applies).
We also have an upcoming need to convert the easting/northing fields provided by GIAS into equivalent latitude/longitude values, which will power our location-based search. This is another bit of 'massaging' which needs to happen before the data can be imported.
Changes proposed in this pull request
I've therefore decided to refactor the GIAS import process so it more explicitly follows an ETL (extract, transform, load) pattern:
This reduces the responsibility of the 'import' step by extracting the filtering logic into a standalone 'transform' step. The importer now simply needs to import every row of the provided CSV. And it'll make it easier to change and add to the transformation logic in future.
The same end result
This refactor doesn't affect the outcome of the import. The high-level integration test for the entire import still passes as before:
Other changes of note
One change I have made is in testing for edge cases. Previously the importer handled edge cases such as "invalid" CSV files where schools were missing a name or URN. However in practice this never happens: the GIAS CSV file always has a name for each school. The edge-case handling was also not comprehensive – it didn't handle cases where the CSV was missing expected columns or other field values, for example.
Since we haven't seen "invalid" CSV files in practice, it seems like a low-risk change to remove tests for these edge cases. I'd rather keep things simple. And then if we end up receiving corrupt CSV files in the future, we'll be alerted by Sentry with unhandled exceptions when schools fail to save to the database. This will make the problem more obvious instead of quietly skipping those schools.
Guidance to review
I've tried to create a tidy commit history with useful commit messages. I recommend reviewing this PR commit-by-commit if you want to see the individual steps I took.
Link to Trello card
This is an enabler for a proper implementation of: https://trello.com/c/3rYfhwuq/228-spike-using-the-easting-northing-values-from-gias-to-geocode-schools