This package parses data from the Ed-Fi data standard into standard tabular data models.
Its goals are:
- To be as unopinionated as possible in preparing Ed-Fi data for analysis
- To enforce Ed-Fi's rules on data uniqueness and delete processing
- To make Ed-Fi data easier to use and consume, such as by adding simple surrogate keys to represent Ed-Fi's multi-column natural keys
This package can be used on its own, but is part of the larger EDU analytics framework, and we assume the metadata structure produced by EDU's data loading process.
dbt version required: >=1.0.0, <2.0.0
Include the following in your packages.yml
file:
packages:
- package: edanalytics/edu_edfi_source
version: [">=0.2.6", "<0.3.0"]
Note: if you're using the downstream warehouse package, it already includes this source package, and you don't need to install it again.
This package is free to use for noncommercial purposes. See License.
Importing projects must add their own source definitions. We don't include them here because Ed-Fi can be extended and new resources are added over time, so each project should generate their sources from their own Ed-Fi API.
We have code to generate source definitions from an Ed-Fi API's swagger definition here.
This will generate a src_edfi_3.yml
file containing the location and definitions
of all raw raw resources coming from Ed-Fi APIs which must be included in your project.
Any resources not in use can be disabled here to exclude them from dbt's graph.
Resources that have custom columns added via extensions can be added into any model. Ed-Fi extensions tend to come in two flavors: new columns on existing resources, or new resources. This section deals with new columns, while new/custom resources can be handled by adding base/stage models in the importing project.
Extension columns are stored in the API payloads under an _ext
element in the
JSON, which contains a dictionary of named extensions, each containing
their extended columns.
To add these to your tables, you need a configuration block in dbt_project.yml
specifying the name of the extended resource, the desired warehouse column name,
a SQL snippet necessary to extract the element, and the data type you want it to have.
# general structure
extensions:
{stg_table_name}:
{desired_column_name}:
name: '{extension_name}:{element_name}'
dtype: '{SQL data type}'
# example:
extensions:
stg_ef3__student_special_education_program_associations:
bps_iep_exit_date:
name: 'myBPS:iepExitDate'
dtype: 'timestamp'
Base models are views on raw Ed-Fi data, and show the data as it was loaded incrementally. This means uniqueness and deletes are not enforced, so base tables contain the full incremental change history since the sync began.
Base models enforce the following rules:
- Table names are prefixed
base_ef3__
, and follow the API resource names, converted to snake case - Base tables only source data from raw tables
- Base tables represent only one raw object, and do not perform joins
- Metadata from the loading process (such as timestamps, tenant codes, and data lake file names) are preserved
- Ed-Fi deletion records are joined in to produce an
is_deleted
flag. - Data type coercions are enforced
- Column names are converted to snake case
- Descriptors are extracted to simple strings (rather than URIs)
- Nested lists are passed forward untouched
- Reference objects are preserved for key generation
- Columns are re-ordered semantically rather than alphabetically
Base models are a clean, tabular look into the data as retrieved from Ed-Fi incrementally. They should not be used directly, but are useful for understanding how records changed over time, or how the data loading process operates.
Stage models are (by default) materialized tables enforcing Ed-Fi's uniqueness constraints, removing records marked as deleted, and unpacking nested lists. The goal is to present Ed-Fi data in a flat tabular format for further analytics.
Stage models enforce the rules:
- Table names are prefixed
stg_ef3__
, and tables with nested lists are unpacked into sub-tables with the formatresource_name__list_name
- Stage tables generally have a one-to-one relationship with base tables, except in the case of unpacked lists.
- Surrogate keys are generated to create links between objects
- Some tables are 'annualized', to preserve year-over-year changes
- e.g.
courses
are not annualized in Ed-Fi (school_year
is not part of the unique key), but we preserve a separate record for each school year so that course characteristics can change over time.
- e.g.
- Deleted records are removed
- Uniqueness constraints are enforced, and only the most recent version of any element is preserved.
- Ed-Fi extensions are unpacked into the table
Stage models are a true representation of Ed-Fi data as of the last sync, ready for further use directly.
Ed-Fi uses multiple natural keys for joins, and our goal is to simplify this down to a single column key with consistent naming.
Surrogate keys follow these conventions:
- All surrogate keys are prefixed with
k_
- The key name should align to a resource specified in the singular, such as
k_school
ork_student
- If a table must contain multiple foreign keys to the same object, they are disambiguated
with double underscores, such as
k_school__responsibility
andk_school__assignment
- Keys (or other columns contributing to table uniqueness) should be the leftmost columns in a table, generally from least to most distinct values.
- Keys are
md5
hashes of the underlying natural keys concatanated together
gen_skey
converts Ed-Fi reference objects to surrogate keys. There are several
complexities inherent in doing this, which the macro helps to make consistent.
dbt surrogate keys are md5
hashes of the concatenated natural keys.
- We add
tenant_code
to all natural keys to facilitate multi-tenant environments, such as district collaboratives or states. - In some cases we add the API year to annualize the resource, allowing for example each school year to have its own definition of courses.
- Because some Ed-Fi implementations use Case Insensitive Collations (allowing
any casing of a value to be considered identical) we
lower
all natural keys to force them to be byte-identical. - In some cases Ed-Fi is inconsistent in whether dates are treated as
date
ortimestamp
in different places, so we coerce date fields to be strictlydate
to drop the time component. - When natural keys contain descriptors, we drop the namespace/URI components and use only the descriptor code itself.
Also note that reference objects are not included in the base object: while
studentSchoolAssociations
contains a schoolReference
, schools
itself does not.
This means the key must be generated separately in defining objects from referencing
objects. To make things simpler, we always include tenant_code
and (when annualized)
api_year
first, and then remaining natural keys in alphabetical order.
extract_extension
pulls Ed-Fi extension objects out into stand-alone columns.
See the Configuration section for usage.
extract_descriptor
pulls the descriptor_code
from a descriptor URI.
source_edfi3
replaces the standard source
macro, joining in the _deletes
raw table to create the is_deleted
flag.
edorg_ref
breaks apart educationOrganizationReference
objects into a pair
of key columns k_lea
and k_school
, disambiguating such references.
This package depends on dbt_utils
. If your project also makes use of dbt_utils
,
we recommend you remove it from your root packages.yml
to avoid package
version conflicts.
The Education Analytics team only maintains the latest version of the package. We recommend that you stay consistent with the latest version of the package and refer to the CHANGELOG and release notes for more information on changes across versions.
Currently only Snowflake is supported.
We are working on adding the scaffolding for multi-platform support, and once this is in place would welcome contributions.
Contact us if you're interested in support in another platform or contributing to this effort.