Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(pipeline): reuse schema validation in pipeline #290

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vmttn
Copy link
Contributor

@vmttn vmttn commented Sep 11, 2024

depends on #270

The trick is to leverage plpython to do the geocoding
inside the database. Doing so, geocoding can now be
modelled as a dbt model and orchestrate as such.

The geocoding implementation has been moved to an
actual package maintained next to the datawarehouse
image. The plpython udf simply wraps the call.

Is it worth it ?

* it heavely simplifies the flow and set clearer concerns
between airflow and dbt. Dbt does the transformation, airflow
orchestrate it.
* less error prone since we do not have to pull data from the db
and map it to python ourselves.
* geocoding now can simply be done per source. This would have
been terribly cumbersome.
* we can even leverage dbt to to the geocoding incrementally on
inputs that have not been seen before. This will drastically
reduce our carbon footprint...

There are a few enhancements we would probably want :

* obviously clean up
* define a macro with the geocoding model that can be used for all
sources
* re-do geocodings for relatively old inputs
Currently validation differs between the pipeline and the api.
The api relies on the schema validation, using pydantic. But
the pipeline has its own sql-based validation. This can lead
to inconsistencies between pipeline and api and duplicates code.

This commit leverages plpython to reuse the schema validation
in the pipeline.

Validation is done at the source level. This commit also materializes
validation errors in a dedicated model, rather than using dbt data
tests, which are less convenient for products to use.
$(PIP_COMPILE) --extra=dev --output-file=requirements/dev-requirements.txt

test:
$(PIP_COMPILE) --extra=test --output-file=requirements/test-requirements.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf le requirements.mk ^^

FROM
structures,
LATERAL (SELECT * FROM VALIDATE('structure', TO_JSONB(structures))) AS errors
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Je vais etre tellement heureux de voir ces tables, c'est si pénible de débugger ce genre de trucs actuellement...

OR {{ this }}.input != ARRAY[adresses.adresse, adresses.code_postal, adresses.commune]
{% endif %}
)
) AS geocodings ON adresses._di_surrogate_id = geocodings.adresse_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comme dit quelque part, un unit test de ce code me rassurerait grandement

data_tests:
- check_structure:
config:
severity: warn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faudra pas oublier de retirer les macros associées. Quid du check_adresses ?

errors.input AS "input"
FROM
structures,
LATERAL (SELECT * FROM VALIDATE('structure', TO_JSONB(structures))) AS errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question pour plus tard : quand on aura de la validation plus poussée, tu verras les choses comment ? On met tout dans le validate plpython qui appellerait par exemple, la validation pydantic + des validations HTTP par exemple ?
Ou plutot des etapes supplémentaires comme le geocoding (pour moi je penserais plutot à ça)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants