-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor the GIAS importer to follow ETL pattern
Currently, school data is imported from GIAS in two steps: 1. Download the CSV 2. Import the CSV (filtering out unwanted schools in the process) We need to perform some 'massaging' of the data. For example, the importer excludes schools which aren't needed within our app. These are filtered out because either the school has closed, or it isn't in England (which is where the ITT policy currently applies). We also have an upcoming need to convert the easting/northing fields provided by GIAS into equivalent latitude/longitude values, which will power our location-based search. This is another bit of 'massaging' which needs to happen before the data can be imported. I've therefore decided to refactor the GIAS import process so it more explicitly follows an [ETL][1] (extract, transform, load) pattern: 1. Download the CSV 2. Transform the CSV (filter out unwanted schools) 3. Import the CSV This reduces the responsibility of the 'import' step by extracting the filtering logic into a standalone 'transform' step. The importer now simply needs to import every row of the provided CSV. And it'll make it easier to change and add to the transformation logic in future. This refactor doesn't affect the outcome of the import. The high-level integration test for the entire import still passes as before: ``` bundle exec rspec spec/jobs/gias/sync_all_schools_job_spec.rb ``` One change I have made is in testing for edge cases. Previously the importer handled edge cases such as "invalid" CSV files where schools were missing a name or URN. However in practice this never happens: the GIAS CSV file always has a name for each school. The edge-case handling was also not comprehensive – it didn't handle cases where the CSV was missing expected columns or other field values, for example. Since we haven't seen "invalid" CSV files in practice, it seems like a low-risk change to remove tests for these edge cases. I'd rather keep things simple. And then if we end up receiving corrupt CSV files in the future, we'll be alerted by Sentry with unhandled exceptions when schools fail to save to the database. This will make the problem more obvious instead of quietly skipping those schools. [1]: https://en.wikipedia.org/wiki/Extract,_transform,_load
- Loading branch information
1 parent
271a915
commit 9a8fe86
Showing
9 changed files
with
161 additions
and
131 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
module Gias | ||
class CsvTransformer | ||
include ServicePattern | ||
|
||
def initialize(input_file) | ||
@input_file = input_file | ||
@output_file = Tempfile.new | ||
end | ||
|
||
def call | ||
output_csv = CSV.new(output_file) | ||
|
||
CSV.new(input_file, headers: true, return_headers: true).each do |row| | ||
if row.header_row? || school_in_scope?(row) | ||
output_csv << row | ||
end | ||
end | ||
|
||
# Rewind the file so it's ready for reading | ||
output_file.rewind | ||
output_file | ||
end | ||
|
||
private | ||
|
||
attr_reader :input_file, :output_file | ||
|
||
# The 'School Placements' and 'Funding Mentors' services only need to know | ||
# about schools in England. Closed schools and those outside of England can | ||
# be filtered out from the CSV. | ||
def school_in_scope?(row) | ||
school = School.new(row) | ||
school.open? && school.in_england? | ||
end | ||
|
||
class School | ||
OPEN_SCHOOL = "1".freeze | ||
NON_ENGLISH_ESTABLISHMENTS = %w[8 10 25 24 26 27 29 30 32 37 49 56 57].freeze | ||
|
||
attr_reader :row | ||
|
||
def initialize(row) | ||
@row = row | ||
end | ||
|
||
def open? | ||
row.fetch("EstablishmentStatus (code)") == OPEN_SCHOOL | ||
end | ||
|
||
def in_england? | ||
!NON_ENGLISH_ESTABLISHMENTS.include? row.fetch("TypeOfEstablishment (code)") | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
URN,LA (code),LA (name),EstablishmentNumber,EstablishmentName,TypeOfEstablishment (code),TypeOfEstablishment (name),EstablishmentTypeGroup (code),EstablishmentTypeGroup (name),EstablishmentStatus (code),EstablishmentStatus (name),ReasonEstablishmentOpened (code),ReasonEstablishmentOpened (name),OpenDate,ReasonEstablishmentClosed (code),ReasonEstablishmentClosed (name),CloseDate,PhaseOfEducation (code),PhaseOfEducation (name),StatutoryLowAge,StatutoryHighAge,Boarders (code),Boarders (name),NurseryProvision (name),OfficialSixthForm (code),OfficialSixthForm (name),Gender (code),Gender (name),ReligiousCharacter (code),ReligiousCharacter (name),ReligiousEthos (name),Diocese (code),Diocese (name),AdmissionsPolicy (code),AdmissionsPolicy (name),SchoolCapacity,SpecialClasses (code),SpecialClasses (name),CensusDate,NumberOfPupils,NumberOfBoys,NumberOfGirls,PercentageFSM,TrustSchoolFlag (code),TrustSchoolFlag (name),Trusts (code),Trusts (name),SchoolSponsorFlag (name),SchoolSponsors (name),FederationFlag (name),Federations (code),Federations (name),UKPRN,FEHEIdentifier,FurtherEducationType (name),OfstedLastInsp,OfstedSpecialMeasures (code),OfstedSpecialMeasures (name),LastChangedDate,Street,Locality,Address3,Town,County (name),Postcode,SchoolWebsite,TelephoneNum,HeadTitle (name),HeadFirstName,HeadLastName,HeadPreferredJobTitle,BSOInspectorateName (name),InspectorateReport,DateOfLastInspectionVisit,NextInspectionVisit,TeenMoth (name),TeenMothPlaces,CCF (name),SENPRU (name),EBD (name),PlacesPRU,FTProv (name),EdByOther (name),Section41Approved (name),SEN1 (name),SEN2 (name),SEN3 (name),SEN4 (name),SEN5 (name),SEN6 (name),SEN7 (name),SEN8 (name),SEN9 (name),SEN10 (name),SEN11 (name),SEN12 (name),SEN13 (name),TypeOfResourcedProvision (name),ResourcedProvisionOnRoll,ResourcedProvisionCapacity,SenUnitOnRoll,SenUnitCapacity,GOR (code),GOR (name),DistrictAdministrative (code),DistrictAdministrative (name),AdministrativeWard (code),AdministrativeWard (name),ParliamentaryConstituency (code),ParliamentaryConstituency (name),UrbanRural (code),UrbanRural (name),GSSLACode (name),Easting,Northing,MSOA (name),LSOA (name),InspectorateName (name),SENStat,SENNoStat,BoardingEstablishment (name),PropsName,PreviousLA (code),PreviousLA (name),PreviousEstablishmentNumber,OfstedRating (name),RSCRegion (name),Country (name),UPRN,SiteName,QABName (code),QABName (name),EstablishmentAccredited (code),EstablishmentAccredited (name),QABReport,CHNumber,MSOA (code),LSOA (code),FSM,AccreditationExpiryDate | ||
100000,201,City of London,3614,The Aldgate School,2,Voluntary aided school,4,Local authority maintained schools,1,Open,0,Not applicable,,0,Not applicable,,2,Primary,3,11,1,No boarders,Has Nursery Classes,2,Does not have a sixth form,3,Mixed,2,Church of England,Does not apply,CE23,Diocese of London,0,Not applicable,271,2,No Special Classes,19-01-2023,271,144,127,18.10,0,Not applicable,,,Not applicable,,Not under a federation,,,10079319,,Not applicable,19-04-2013,0,Not applicable,09-02-2024,St James's Passage,Duke's Place,,London,,EC3A 5DE,www.thealdgateschool.org,2072831147,Miss,Alexandra,Allan,Headteacher,Not applicable,,,,Not applicable,,Not applicable,Not applicable,Not applicable,,,Not applicable,Not applicable,,,,,,,,,,,,,,,,,,,H,London,E09000001,City of London,E05009308,Portsoken,E14000639,Cities of London and Westminster,A1,(England/Wales) Urban major conurbation,E09000001,533498,181201,City of London 001,City of London 001F,,,,,,999,,,Outstanding,North-West London and South-Central England,,200000071925,,0,Not applicable,0,Not applicable,,,E02000001,E01032739,49, | ||
137666,878,Devon,3106,Chudleigh Knighton Church of England Primary School,34,Academy converter,10,Academies,1,Open,10,Academy Converter,01-11-2011,99,,,2,Primary,5,11,1,No boarders,No Nursery Classes,0,Not applicable,3,Mixed,2,Church of England,Does not apply,CE15,Diocese of Exeter,0,Not applicable,105,2,No Special Classes,19-01-2023,100,51,49,19.00,3,Supported by a multi-academy trust,3104,THE FIRST FEDERATION TRUST,Linked to a sponsor,The First Federation Trust,Not applicable,,,10035224,,Not applicable,09-02-2023,0,Not applicable,13-09-2023,Chudleigh Knighton,,,Newton Abbot,Devon,TQ13 0EU,http://www.chudleigh-knighton.devon.sch.uk,1626852314,Mr,Simon,Westwood,Head of Teaching & Learning,Not applicable,,,,Not applicable,,Not applicable,Not applicable,Not applicable,,,Not applicable,Not applicable,,,,,,,,,,,,,,,,,,,K,South West,E07000045,Teignbridge,E05011899,Chudleigh,E14000623,Central Devon,D1,(England/Wales) Rural town and fringe,E10000008,284509,77456,Teignbridge 004,Teignbridge 004F,,,,,,911,Pre LGR (1998) Devon,3106,Good,South-West England,,10032960701,,0,Not applicable,0,Not applicable,,,E02004204,E01020223,19, | ||
124087,860,Staffordshire,2216,Thomas Barnes Primary School,5,Foundation school,4,Local authority maintained schools,1,Open,0,Not applicable,,0,Not applicable,,2,Primary,4,11,1,No boarders,No Nursery Classes,2,Does not have a sixth form,3,Mixed,0,Does not apply,Does not apply,0,Not applicable,0,Not applicable,105,2,No Special Classes,19-01-2023,106,53,53,9.40,1,Supported by a trust,1579,Tame Valley Co-Operative Learning Trust,Not applicable,,Not under a federation,,,10073083,,Not applicable,28-03-2014,0,Not applicable,19-01-2024,School Lane,Hopwas,,Tamworth,Staffordshire,B78 3AD,http://www.thomasbarnes.staffs.sch.uk,1827213840,Mrs,E,Tibbitts,Headteacher,Not applicable,,,,Not applicable,,Not applicable,Not applicable,Not applicable,,,Not applicable,Not applicable,,,,,,,,,,,,,,,,,,,F,West Midlands,E07000194,Lichfield,E05010672,Whittington & Streethay,E14000986,Tamworth,F1,(England/Wales) Rural hamlet and isolated dwellings,E10000028,417927,305209,Lichfield 008,Lichfield 008B,,,,,,934,Pre LGR (1997) Staffordshire,,Outstanding,West Midlands,,100031712411,,0,Not applicable,0,Not applicable,,,E02006153,E01029517,10, |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,108 +1,82 @@ | ||
require "rails_helper" | ||
|
||
RSpec.describe Gias::CsvImporter do | ||
subject(:gias_importer) { described_class.call(file_path) } | ||
subject(:gias_importer) { described_class.call(csv_path) } | ||
|
||
let(:csv_path) { "spec/fixtures/gias/import_with_trusts_and_regions.csv" } | ||
|
||
it_behaves_like "a service object" do | ||
let(:params) { { csv_path: "a_path" } } | ||
let(:params) { { csv_path: } } | ||
end | ||
|
||
context "with an invalid row in the csv" do | ||
# CSV contains 4 valid schools and 3 invalid schools | ||
let(:file_path) { "spec/fixtures/gias/import_with_invalid_schools.csv" } | ||
it "creates new schools" do | ||
expect { gias_importer }.to change(School, :count).from(0).to(5) | ||
end | ||
|
||
it "inserts the correct schools" do | ||
expect { gias_importer }.to change(School, :count).from(0).to(5) | ||
end | ||
it "updates existing schools" do | ||
school = create(:school, urn: "123", name: "The wrong name") | ||
expect { gias_importer }.to change { school.reload.name }.to "FringeSchool" | ||
end | ||
|
||
it "updates the correct schools" do | ||
school = create(:school, urn: "123") | ||
expect { gias_importer }.to change { school.reload.name }.to "FringeSchool" | ||
end | ||
it "logs messages to STDOUT" do | ||
expect(Rails.logger).to receive(:info).with("GIAS Data Imported!") | ||
|
||
it "logs messages to STDOUT" do | ||
expect(Rails.logger).to receive(:info).with("Invalid rows - [\"Row 8 is invalid\"]") | ||
expect(Rails.logger).to receive(:info).with("GIAS Data Imported!") | ||
gias_importer | ||
end | ||
|
||
gias_importer | ||
end | ||
it "associates schools to regions" do | ||
gias_importer | ||
inner_london_school = School.find_by(urn: "130") | ||
outer_london_school = School.find_by(urn: "131") | ||
fringe_school = School.find_by(urn: "123") | ||
rest_of_england_school = School.find_by(urn: "132") | ||
|
||
expect(inner_london_school.region.name).to eq("Inner London") | ||
expect(outer_london_school.region.name).to eq("Outer London") | ||
expect(fringe_school.region.name).to eq("Fringe") | ||
expect(rest_of_england_school.region.name).to eq("Rest of England") | ||
end | ||
|
||
describe "regions" do | ||
context "when associating regions" do | ||
it "associates schools to regions" do | ||
gias_importer | ||
inner_london_school = School.find_by(urn: "130") | ||
outer_london_school = School.find_by(urn: "131") | ||
fringe_school = School.find_by(urn: "123") | ||
rest_of_england_school = School.find_by(urn: "132") | ||
|
||
expect(inner_london_school.region.name).to eq("Inner London") | ||
expect(outer_london_school.region.name).to eq("Outer London") | ||
expect(fringe_school.region.name).to eq("Fringe") | ||
expect(rest_of_england_school.region.name).to eq("Rest of England") | ||
end | ||
describe "associating schools with trusts" do | ||
context "when the trust does not exist" do | ||
it "creates the trust" do | ||
expect { gias_importer }.to change(Trust, :count).from(0).to(1) | ||
end | ||
end | ||
|
||
describe "trusts" do | ||
context "when associating trusts" do | ||
context "and the trust does not exist" do | ||
it "creates the trust" do | ||
expect { gias_importer }.to change(Trust, :count).from(0).to(1) | ||
end | ||
|
||
it "associates schools to the trust" do | ||
gias_importer | ||
trust = Trust.find_by(uid: "12345") | ||
school = School.find_by(urn: "140") | ||
|
||
expect(school.trust).to eq(trust) | ||
end | ||
end | ||
|
||
context "and the trust already exists" do | ||
before do | ||
create(:trust, uid: "12345") | ||
end | ||
|
||
it "does not create a new trust" do | ||
expect { gias_importer }.not_to change(Trust, :count) | ||
end | ||
|
||
it "associates schools to the trust" do | ||
gias_importer | ||
trust = Trust.find_by(uid: "12345") | ||
school = School.find_by(urn: "140") | ||
|
||
expect(school.trust).to eq(trust) | ||
end | ||
end | ||
|
||
context "and the school is not associated with a trust" do | ||
it "does not associate the school to a trust" do | ||
gias_importer | ||
school = School.find_by(urn: "132") | ||
|
||
expect(school.trust).to be_nil | ||
end | ||
end | ||
it "associates schools to the trust" do | ||
gias_importer | ||
trust = Trust.find_by(uid: "12345") | ||
school = School.find_by(urn: "140") | ||
|
||
expect(school.trust).to eq(trust) | ||
end | ||
end | ||
end | ||
|
||
context "with all valid rows in the csv" do | ||
# CSV contains 4 valid schools | ||
let(:file_path) { "spec/fixtures/gias/import_with_all_valid_schools.csv" } | ||
context "when the trust already exists" do | ||
before do | ||
create(:trust, uid: "12345") | ||
end | ||
|
||
it "does not create a new trust" do | ||
expect { gias_importer }.not_to change(Trust, :count) | ||
end | ||
|
||
it "associates schools to the trust" do | ||
gias_importer | ||
trust = Trust.find_by(uid: "12345") | ||
school = School.find_by(urn: "140") | ||
|
||
it "inserts the correct schools" do | ||
expect { gias_importer }.to change(School, :count).from(0).to(4) | ||
expect(school.trust).to eq(trust) | ||
end | ||
end | ||
|
||
it "logs messages to STDOUT" do | ||
expect(Rails.logger).not_to receive(:info).with("Invalid rows - [\"Row 8 is invalid\"]") | ||
expect(Rails.logger).to receive(:info).with("GIAS Data Imported!") | ||
context "when the school is not associated with a trust" do | ||
it "does not associate the school to a trust" do | ||
gias_importer | ||
school = School.find_by(urn: "132") | ||
|
||
gias_importer | ||
expect(school.trust).to be_nil | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
require "rails_helper" | ||
|
||
RSpec.describe Gias::CsvTransformer do | ||
subject(:output_file) { described_class.call(input_file) } | ||
|
||
let(:input_file_path) { "spec/fixtures/gias/gias_subset.csv" } | ||
let(:input_file) { File.open(input_file_path, "r") } | ||
|
||
it_behaves_like "a service object" do | ||
let(:params) { { 0 => input_file } } | ||
end | ||
|
||
it "transforms the GIAS CSV" do | ||
expected_output = File.read("spec/fixtures/gias/gias_subset_transformed.csv") | ||
actual_output = output_file.read | ||
expect(actual_output).to eq(expected_output) | ||
end | ||
|
||
it "filters out schools which are Closed or not in England" do | ||
output_csv = CSV.read(output_file, headers: true) | ||
urns = output_csv.values_at("URN").flatten | ||
expect(urns).to eq %w[100000 137666 124087] | ||
end | ||
end |