Skip to content

Commit

Permalink
Refactor the GIAS importer to follow ETL pattern
Browse files Browse the repository at this point in the history
Currently, school data is imported from GIAS in two steps:

1. Download the CSV
2. Import the CSV (filtering out unwanted schools in the process)

We need to perform some 'massaging' of the data. For example, the
importer excludes schools which aren't needed within our app. These are
filtered out because either the school has closed, or it isn't in
England (which is where the ITT policy currently applies).

We also have an upcoming need to convert the easting/northing fields
provided by GIAS into equivalent latitude/longitude values, which will
power our location-based search. This is another bit of 'massaging'
which needs to happen before the data can be imported.

I've therefore decided to refactor the GIAS import process so it more
explicitly follows an [ETL][1] (extract, transform, load) pattern:

1. Download the CSV
2. Transform the CSV (filter out unwanted schools)
3. Import the CSV

This reduces the responsibility of the 'import' step by extracting the
filtering logic into a standalone 'transform' step. The importer now
simply needs to import every row of the provided CSV. And it'll make it
easier to change and add to the transformation logic in future.

This refactor doesn't affect the outcome of the import. The high-level
integration test for the entire import still passes as before:

```
bundle exec rspec spec/jobs/gias/sync_all_schools_job_spec.rb
```

One change I have made is in testing for edge cases. Previously the
importer handled edge cases such as "invalid" CSV files where schools
were missing a name or URN. However in practice this never happens: the
GIAS CSV file always has a name for each school. The edge-case handling
was also not comprehensive – it didn't handle cases where the CSV was
missing expected columns or other field values, for example.

Since we haven't seen "invalid" CSV files in practice, it seems like a
low-risk change to remove tests for these edge cases. I'd rather keep
things simple. And then if we end up receiving corrupt CSV files in the
future, we'll be alerted by Sentry with unhandled exceptions when
schools fail to save to the database. This will make the problem more
obvious instead of quietly skipping those schools.

[1]: https://en.wikipedia.org/wiki/Extract,_transform,_load
  • Loading branch information
ollietreend committed Apr 9, 2024
1 parent 271a915 commit 9a8fe86
Show file tree
Hide file tree
Showing 9 changed files with 161 additions and 131 deletions.
13 changes: 9 additions & 4 deletions app/jobs/gias/sync_all_schools_job.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,19 @@ class SyncAllSchoolsJob < ApplicationJob
queue_as :default

def perform
tempfile = Gias::CsvDownloader.call
Rails.logger.info "GIAS CSV Downloaded!"
Rails.logger.info "Downloading GIAS CSV file..."
gias_csv = Gias::CsvDownloader.call

Rails.logger.info "Transforming GIAS CSV file..."
transformed_csv = Gias::CsvTransformer.call(gias_csv)

Rails.logger.info "Importing GIAS data"
Gias::CsvImporter.call(tempfile.path)
Gias::CsvImporter.call(transformed_csv.path)
Geocoder::UpdateAllSchoolsJob.perform_later

tempfile.unlink
gias_csv.unlink
transformed_csv.unlink
Rails.logger.info "Done"
end
end
end
23 changes: 1 addition & 22 deletions app/services/gias/csv_importer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@ class CsvImporter
include ServicePattern
include RegionalAreas

OPEN_SCHOOL = "1".freeze
SUPPORTED_BY_A_TRUST = "1".freeze
NON_ENGLISH_ESTABLISHMENTS = %w[8 10 25 24 26 27 29 30 32 37 49 56 57].freeze

attr_reader :csv_path

Expand All @@ -16,17 +14,11 @@ def initialize(csv_path)
end

def call
invalid_records = []
school_records = []
trust_records = []
trust_associations = Hash.new { |h, k| h[k] = [] }

CSV
.foreach(csv_path, headers: true)
.with_index(2) do |school, row_number|
invalid_records << "Row #{row_number} is invalid" if invalid?(school)
next if school_excluded?(school) || invalid?(school)

CSV.foreach(csv_path, headers: true) do |school|
school_records << {
urn: school["URN"],
name: school["EstablishmentName"],
Expand Down Expand Up @@ -72,10 +64,6 @@ def call
end
end

if invalid_records.any?
Rails.logger.info "Invalid rows - #{invalid_records.inspect}"
end

Rails.logger.silence do
School.upsert_all(school_records, unique_by: :urn)
Trust.upsert_all(trust_records.uniq, unique_by: :uid)
Expand Down Expand Up @@ -117,14 +105,5 @@ def associate_schools_to_trusts(trust_data)
School.where(urn: urns).update_all(trust_id: trust.id)
end
end

def invalid?(school)
school["URN"].blank? || school["EstablishmentName"].blank?
end

def school_excluded?(school)
school["EstablishmentStatus (code)"] != OPEN_SCHOOL ||
NON_ENGLISH_ESTABLISHMENTS.include?(school["TypeOfEstablishment (code)"])
end
end
end
55 changes: 55 additions & 0 deletions app/services/gias/csv_transformer.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
module Gias
class CsvTransformer
include ServicePattern

def initialize(input_file)
@input_file = input_file
@output_file = Tempfile.new
end

def call
output_csv = CSV.new(output_file)

CSV.new(input_file, headers: true, return_headers: true).each do |row|
if row.header_row? || school_in_scope?(row)
output_csv << row
end
end

# Rewind the file so it's ready for reading
output_file.rewind
output_file
end

private

attr_reader :input_file, :output_file

# The 'School Placements' and 'Funding Mentors' services only need to know
# about schools in England. Closed schools and those outside of England can
# be filtered out from the CSV.
def school_in_scope?(row)
school = School.new(row)
school.open? && school.in_england?
end

class School
OPEN_SCHOOL = "1".freeze
NON_ENGLISH_ESTABLISHMENTS = %w[8 10 25 24 26 27 29 30 32 37 49 56 57].freeze

attr_reader :row

def initialize(row)
@row = row
end

def open?
row.fetch("EstablishmentStatus (code)") == OPEN_SCHOOL
end

def in_england?
!NON_ENGLISH_ESTABLISHMENTS.include? row.fetch("TypeOfEstablishment (code)")
end
end
end
end
4 changes: 4 additions & 0 deletions spec/fixtures/gias/gias_subset_transformed.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
URN,LA (code),LA (name),EstablishmentNumber,EstablishmentName,TypeOfEstablishment (code),TypeOfEstablishment (name),EstablishmentTypeGroup (code),EstablishmentTypeGroup (name),EstablishmentStatus (code),EstablishmentStatus (name),ReasonEstablishmentOpened (code),ReasonEstablishmentOpened (name),OpenDate,ReasonEstablishmentClosed (code),ReasonEstablishmentClosed (name),CloseDate,PhaseOfEducation (code),PhaseOfEducation (name),StatutoryLowAge,StatutoryHighAge,Boarders (code),Boarders (name),NurseryProvision (name),OfficialSixthForm (code),OfficialSixthForm (name),Gender (code),Gender (name),ReligiousCharacter (code),ReligiousCharacter (name),ReligiousEthos (name),Diocese (code),Diocese (name),AdmissionsPolicy (code),AdmissionsPolicy (name),SchoolCapacity,SpecialClasses (code),SpecialClasses (name),CensusDate,NumberOfPupils,NumberOfBoys,NumberOfGirls,PercentageFSM,TrustSchoolFlag (code),TrustSchoolFlag (name),Trusts (code),Trusts (name),SchoolSponsorFlag (name),SchoolSponsors (name),FederationFlag (name),Federations (code),Federations (name),UKPRN,FEHEIdentifier,FurtherEducationType (name),OfstedLastInsp,OfstedSpecialMeasures (code),OfstedSpecialMeasures (name),LastChangedDate,Street,Locality,Address3,Town,County (name),Postcode,SchoolWebsite,TelephoneNum,HeadTitle (name),HeadFirstName,HeadLastName,HeadPreferredJobTitle,BSOInspectorateName (name),InspectorateReport,DateOfLastInspectionVisit,NextInspectionVisit,TeenMoth (name),TeenMothPlaces,CCF (name),SENPRU (name),EBD (name),PlacesPRU,FTProv (name),EdByOther (name),Section41Approved (name),SEN1 (name),SEN2 (name),SEN3 (name),SEN4 (name),SEN5 (name),SEN6 (name),SEN7 (name),SEN8 (name),SEN9 (name),SEN10 (name),SEN11 (name),SEN12 (name),SEN13 (name),TypeOfResourcedProvision (name),ResourcedProvisionOnRoll,ResourcedProvisionCapacity,SenUnitOnRoll,SenUnitCapacity,GOR (code),GOR (name),DistrictAdministrative (code),DistrictAdministrative (name),AdministrativeWard (code),AdministrativeWard (name),ParliamentaryConstituency (code),ParliamentaryConstituency (name),UrbanRural (code),UrbanRural (name),GSSLACode (name),Easting,Northing,MSOA (name),LSOA (name),InspectorateName (name),SENStat,SENNoStat,BoardingEstablishment (name),PropsName,PreviousLA (code),PreviousLA (name),PreviousEstablishmentNumber,OfstedRating (name),RSCRegion (name),Country (name),UPRN,SiteName,QABName (code),QABName (name),EstablishmentAccredited (code),EstablishmentAccredited (name),QABReport,CHNumber,MSOA (code),LSOA (code),FSM,AccreditationExpiryDate
100000,201,City of London,3614,The Aldgate School,2,Voluntary aided school,4,Local authority maintained schools,1,Open,0,Not applicable,,0,Not applicable,,2,Primary,3,11,1,No boarders,Has Nursery Classes,2,Does not have a sixth form,3,Mixed,2,Church of England,Does not apply,CE23,Diocese of London,0,Not applicable,271,2,No Special Classes,19-01-2023,271,144,127,18.10,0,Not applicable,,,Not applicable,,Not under a federation,,,10079319,,Not applicable,19-04-2013,0,Not applicable,09-02-2024,St James's Passage,Duke's Place,,London,,EC3A 5DE,www.thealdgateschool.org,2072831147,Miss,Alexandra,Allan,Headteacher,Not applicable,,,,Not applicable,,Not applicable,Not applicable,Not applicable,,,Not applicable,Not applicable,,,,,,,,,,,,,,,,,,,H,London,E09000001,City of London,E05009308,Portsoken,E14000639,Cities of London and Westminster,A1,(England/Wales) Urban major conurbation,E09000001,533498,181201,City of London 001,City of London 001F,,,,,,999,,,Outstanding,North-West London and South-Central England,,200000071925,,0,Not applicable,0,Not applicable,,,E02000001,E01032739,49,
137666,878,Devon,3106,Chudleigh Knighton Church of England Primary School,34,Academy converter,10,Academies,1,Open,10,Academy Converter,01-11-2011,99,,,2,Primary,5,11,1,No boarders,No Nursery Classes,0,Not applicable,3,Mixed,2,Church of England,Does not apply,CE15,Diocese of Exeter,0,Not applicable,105,2,No Special Classes,19-01-2023,100,51,49,19.00,3,Supported by a multi-academy trust,3104,THE FIRST FEDERATION TRUST,Linked to a sponsor,The First Federation Trust,Not applicable,,,10035224,,Not applicable,09-02-2023,0,Not applicable,13-09-2023,Chudleigh Knighton,,,Newton Abbot,Devon,TQ13 0EU,http://www.chudleigh-knighton.devon.sch.uk,1626852314,Mr,Simon,Westwood,Head of Teaching & Learning,Not applicable,,,,Not applicable,,Not applicable,Not applicable,Not applicable,,,Not applicable,Not applicable,,,,,,,,,,,,,,,,,,,K,South West,E07000045,Teignbridge,E05011899,Chudleigh,E14000623,Central Devon,D1,(England/Wales) Rural town and fringe,E10000008,284509,77456,Teignbridge 004,Teignbridge 004F,,,,,,911,Pre LGR (1998) Devon,3106,Good,South-West England,,10032960701,,0,Not applicable,0,Not applicable,,,E02004204,E01020223,19,
124087,860,Staffordshire,2216,Thomas Barnes Primary School,5,Foundation school,4,Local authority maintained schools,1,Open,0,Not applicable,,0,Not applicable,,2,Primary,4,11,1,No boarders,No Nursery Classes,2,Does not have a sixth form,3,Mixed,0,Does not apply,Does not apply,0,Not applicable,0,Not applicable,105,2,No Special Classes,19-01-2023,106,53,53,9.40,1,Supported by a trust,1579,Tame Valley Co-Operative Learning Trust,Not applicable,,Not under a federation,,,10073083,,Not applicable,28-03-2014,0,Not applicable,19-01-2024,School Lane,Hopwas,,Tamworth,Staffordshire,B78 3AD,http://www.thomasbarnes.staffs.sch.uk,1827213840,Mrs,E,Tibbitts,Headteacher,Not applicable,,,,Not applicable,,Not applicable,Not applicable,Not applicable,,,Not applicable,Not applicable,,,,,,,,,,,,,,,,,,,F,West Midlands,E07000194,Lichfield,E05010672,Whittington & Streethay,E14000986,Tamworth,F1,(England/Wales) Rural hamlet and isolated dwellings,E10000028,417927,305209,Lichfield 008,Lichfield 008B,,,,,,934,Pre LGR (1997) Staffordshire,,Outstanding,West Midlands,,100031712411,,0,Not applicable,0,Not applicable,,,E02006153,E01029517,10,
5 changes: 0 additions & 5 deletions spec/fixtures/gias/import_with_all_valid_schools.csv

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,4 @@ URN,EstablishmentName,Town,Postcode,EstablishmentStatus (code),TypeOfEstablishme
131,OuterLondonSchool,OuterLondonTown,Postcode,1,7,E09000004,,
123,FringeSchool,FringeTown,Postcode,1,7,E06000039,,
132,RestOfEnglandSchool,RestOfEnglandTown,Postcode,1,7,E09000099,,
124,School,Town,Postcode,2,7,E06000031,,
124,School,Town,Postcode,1,8,E06000031,,
124,,Town,Postcode,1,7,E06000031,,
140,TrustSchool,TrustTown,Postcode,1,7,E09000007,1,12345,Department for Education Trust
23 changes: 10 additions & 13 deletions spec/jobs/gias/sync_all_schools_job_spec.rb
Original file line number Diff line number Diff line change
@@ -1,21 +1,18 @@
require "rails_helper"

RSpec.describe Gias::SyncAllSchoolsJob, type: :job do
describe "#perform" do
it "calls Gias::CsvDownloader and Gias::CsvImporter service" do
# Testing Gias::CsvDownloader
today = Time.zone.today.strftime("%Y%m%d")
gias_filename = "edubasealldata#{today}.csv"
tempfile = Tempfile.new("foo")
it "calls Gias::CsvDownloader, then Gias::CsvTransformer, then Gias::CsvImporter" do
downloaded_csv = instance_double(Tempfile)
transformed_csv = instance_double(Tempfile, path: "/fake/path/to/stubbed/file.csv")

allow(Down).to receive(:download).with(
"#{ENV["GIAS_CSV_BASE_URL"]}/#{gias_filename}",
).and_return(tempfile)
allow(Gias::CsvDownloader).to receive(:call).and_return(downloaded_csv)
allow(Gias::CsvTransformer).to receive(:call).with(downloaded_csv).and_return(transformed_csv)

# Testing Gias::CsvImporter
expect(Gias::CsvImporter).to receive(:call).with(tempfile.path)
described_class.perform_now
end
expect(Gias::CsvImporter).to receive(:call).with(transformed_csv.path).ordered
expect(downloaded_csv).to receive(:unlink).ordered
expect(transformed_csv).to receive(:unlink).ordered

described_class.perform_now
end

describe "integration test" do
Expand Down
142 changes: 58 additions & 84 deletions spec/services/gias/csv_importer_spec.rb
Original file line number Diff line number Diff line change
@@ -1,108 +1,82 @@
require "rails_helper"

RSpec.describe Gias::CsvImporter do
subject(:gias_importer) { described_class.call(file_path) }
subject(:gias_importer) { described_class.call(csv_path) }

let(:csv_path) { "spec/fixtures/gias/import_with_trusts_and_regions.csv" }

it_behaves_like "a service object" do
let(:params) { { csv_path: "a_path" } }
let(:params) { { csv_path: } }
end

context "with an invalid row in the csv" do
# CSV contains 4 valid schools and 3 invalid schools
let(:file_path) { "spec/fixtures/gias/import_with_invalid_schools.csv" }
it "creates new schools" do
expect { gias_importer }.to change(School, :count).from(0).to(5)
end

it "inserts the correct schools" do
expect { gias_importer }.to change(School, :count).from(0).to(5)
end
it "updates existing schools" do
school = create(:school, urn: "123", name: "The wrong name")
expect { gias_importer }.to change { school.reload.name }.to "FringeSchool"
end

it "updates the correct schools" do
school = create(:school, urn: "123")
expect { gias_importer }.to change { school.reload.name }.to "FringeSchool"
end
it "logs messages to STDOUT" do
expect(Rails.logger).to receive(:info).with("GIAS Data Imported!")

it "logs messages to STDOUT" do
expect(Rails.logger).to receive(:info).with("Invalid rows - [\"Row 8 is invalid\"]")
expect(Rails.logger).to receive(:info).with("GIAS Data Imported!")
gias_importer
end

gias_importer
end
it "associates schools to regions" do
gias_importer
inner_london_school = School.find_by(urn: "130")
outer_london_school = School.find_by(urn: "131")
fringe_school = School.find_by(urn: "123")
rest_of_england_school = School.find_by(urn: "132")

expect(inner_london_school.region.name).to eq("Inner London")
expect(outer_london_school.region.name).to eq("Outer London")
expect(fringe_school.region.name).to eq("Fringe")
expect(rest_of_england_school.region.name).to eq("Rest of England")
end

describe "regions" do
context "when associating regions" do
it "associates schools to regions" do
gias_importer
inner_london_school = School.find_by(urn: "130")
outer_london_school = School.find_by(urn: "131")
fringe_school = School.find_by(urn: "123")
rest_of_england_school = School.find_by(urn: "132")

expect(inner_london_school.region.name).to eq("Inner London")
expect(outer_london_school.region.name).to eq("Outer London")
expect(fringe_school.region.name).to eq("Fringe")
expect(rest_of_england_school.region.name).to eq("Rest of England")
end
describe "associating schools with trusts" do
context "when the trust does not exist" do
it "creates the trust" do
expect { gias_importer }.to change(Trust, :count).from(0).to(1)
end
end

describe "trusts" do
context "when associating trusts" do
context "and the trust does not exist" do
it "creates the trust" do
expect { gias_importer }.to change(Trust, :count).from(0).to(1)
end

it "associates schools to the trust" do
gias_importer
trust = Trust.find_by(uid: "12345")
school = School.find_by(urn: "140")

expect(school.trust).to eq(trust)
end
end

context "and the trust already exists" do
before do
create(:trust, uid: "12345")
end

it "does not create a new trust" do
expect { gias_importer }.not_to change(Trust, :count)
end

it "associates schools to the trust" do
gias_importer
trust = Trust.find_by(uid: "12345")
school = School.find_by(urn: "140")

expect(school.trust).to eq(trust)
end
end

context "and the school is not associated with a trust" do
it "does not associate the school to a trust" do
gias_importer
school = School.find_by(urn: "132")

expect(school.trust).to be_nil
end
end
it "associates schools to the trust" do
gias_importer
trust = Trust.find_by(uid: "12345")
school = School.find_by(urn: "140")

expect(school.trust).to eq(trust)
end
end
end

context "with all valid rows in the csv" do
# CSV contains 4 valid schools
let(:file_path) { "spec/fixtures/gias/import_with_all_valid_schools.csv" }
context "when the trust already exists" do
before do
create(:trust, uid: "12345")
end

it "does not create a new trust" do
expect { gias_importer }.not_to change(Trust, :count)
end

it "associates schools to the trust" do
gias_importer
trust = Trust.find_by(uid: "12345")
school = School.find_by(urn: "140")

it "inserts the correct schools" do
expect { gias_importer }.to change(School, :count).from(0).to(4)
expect(school.trust).to eq(trust)
end
end

it "logs messages to STDOUT" do
expect(Rails.logger).not_to receive(:info).with("Invalid rows - [\"Row 8 is invalid\"]")
expect(Rails.logger).to receive(:info).with("GIAS Data Imported!")
context "when the school is not associated with a trust" do
it "does not associate the school to a trust" do
gias_importer
school = School.find_by(urn: "132")

gias_importer
expect(school.trust).to be_nil
end
end
end
end
24 changes: 24 additions & 0 deletions spec/services/gias/csv_transformer_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
require "rails_helper"

RSpec.describe Gias::CsvTransformer do
subject(:output_file) { described_class.call(input_file) }

let(:input_file_path) { "spec/fixtures/gias/gias_subset.csv" }
let(:input_file) { File.open(input_file_path, "r") }

it_behaves_like "a service object" do
let(:params) { { 0 => input_file } }
end

it "transforms the GIAS CSV" do
expected_output = File.read("spec/fixtures/gias/gias_subset_transformed.csv")
actual_output = output_file.read
expect(actual_output).to eq(expected_output)
end

it "filters out schools which are Closed or not in England" do
output_csv = CSV.read(output_file, headers: true)
urns = output_csv.values_at("URN").flatten
expect(urns).to eq %w[100000 137666 124087]
end
end

0 comments on commit 9a8fe86

Please sign in to comment.