-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(pipeline): reuse schema validation in pipeline #290
base: main
Are you sure you want to change the base?
Conversation
The trick is to leverage plpython to do the geocoding inside the database. Doing so, geocoding can now be modelled as a dbt model and orchestrate as such. The geocoding implementation has been moved to an actual package maintained next to the datawarehouse image. The plpython udf simply wraps the call. Is it worth it ? * it heavely simplifies the flow and set clearer concerns between airflow and dbt. Dbt does the transformation, airflow orchestrate it. * less error prone since we do not have to pull data from the db and map it to python ourselves. * geocoding now can simply be done per source. This would have been terribly cumbersome. * we can even leverage dbt to to the geocoding incrementally on inputs that have not been seen before. This will drastically reduce our carbon footprint... There are a few enhancements we would probably want : * obviously clean up * define a macro with the geocoding model that can be used for all sources * re-do geocodings for relatively old inputs
Currently validation differs between the pipeline and the api. The api relies on the schema validation, using pydantic. But the pipeline has its own sql-based validation. This can lead to inconsistencies between pipeline and api and duplicates code. This commit leverages plpython to reuse the schema validation in the pipeline. Validation is done at the source level. This commit also materializes validation errors in a dedicated model, rather than using dbt data tests, which are less convenient for products to use.
$(PIP_COMPILE) --extra=dev --output-file=requirements/dev-requirements.txt | ||
|
||
test: | ||
$(PIP_COMPILE) --extra=test --output-file=requirements/test-requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cf le requirements.mk
^^
FROM | ||
structures, | ||
LATERAL (SELECT * FROM VALIDATE('structure', TO_JSONB(structures))) AS errors | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Je vais etre tellement heureux de voir ces tables, c'est si pénible de débugger ce genre de trucs actuellement...
OR {{ this }}.input != ARRAY[adresses.adresse, adresses.code_postal, adresses.commune] | ||
{% endif %} | ||
) | ||
) AS geocodings ON adresses._di_surrogate_id = geocodings.adresse_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comme dit quelque part, un unit test de ce code me rassurerait grandement
data_tests: | ||
- check_structure: | ||
config: | ||
severity: warn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faudra pas oublier de retirer les macros associées. Quid du check_adresses
?
errors.input AS "input" | ||
FROM | ||
structures, | ||
LATERAL (SELECT * FROM VALIDATE('structure', TO_JSONB(structures))) AS errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question pour plus tard : quand on aura de la validation plus poussée, tu verras les choses comment ? On met tout dans le validate
plpython qui appellerait par exemple, la validation pydantic + des validations HTTP par exemple ?
Ou plutot des etapes supplémentaires comme le geocoding (pour moi je penserais plutot à ça)
depends on #270