Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to DB native export format #357

Merged
merged 7 commits into from
Mar 4, 2025
Merged

Switch to DB native export format #357

merged 7 commits into from
Mar 4, 2025

Conversation

dogversioning
Copy link
Contributor

This changes exports to use whatever the db-appropriate method of exporting while preserving types is, as opposed to trying to cast columns based on an observed type converted to a parquet value (which can be a little lossy).

Checklist

  • Consider if documentation in docs/ needs to be updated
    • If you've changed the structure of a table, you may need to run generate-md
    • If you've added/removed core study fields that not in US Core, update our list of those in core-study-details.md
  • Consider if tests should be added
  • Update template repo if there are changes to study configuration in manifest.toml

Comment on lines +164 to +168
self.connection.cursor().execute(f"""UNLOAD
(SELECT * FROM {table_name})
TO '{s3_path}'
WITH (format='PARQUET', compression='SNAPPY')
""") # noqa: S608
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could? create a centralized jinja template for these DB specific queires. I don't know how much this export mechanism needs the injection protection, since it's not in queries that are being distributed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table name comes from the manifest yeah? So it is user input, which does mean it's subject to chicanery, by the user or study author or malicious 3rd party app modifying files in $HOME... (btw: do we have any sanitizing of table names when reading the manifest?)

But I'm not overly stressed - I'll leave it up to you on risk assessment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do have a regex validation for custom prefixes, but not for table names. if we did that, I'd be fine saying this is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but... on the other hand, a user can run a DROP TABLE query if they edit the manifest and it would be valid, so... maybe we're just not safe from a malicious user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a malicious script on the user's machine

Copy link

github-actions bot commented Mar 3, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
2439 2439 100% 100% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cumulus_library/actions/exporter.py 100% 🟢
cumulus_library/databases/athena.py 100% 🟢
cumulus_library/databases/base.py 100% 🟢
cumulus_library/databases/duckdb.py 100% 🟢
TOTAL 100% 🟢

updated for commit: e4131b4 by action🐍

@@ -208,6 +205,16 @@ def upload_file(
have an API for file upload (i.e. cloud databases)"""
return None

@abc.abstractmethod
def export_table_as_parquet(
self, table_name: str, table_type: str, location: pathlib.Path, *args, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: table_type seems like an odd addition here. In both implementations, it's just used to create the output file name, which felt like a bit of duplicated business logic.

What if you used a "input/output, did it write anything" pattern like (totally ignorable suggestion, just brainstorming):

def export_table_as_parquet(self, table_name: str, output_path: pathlib.Path) -> bool:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you need the bare table name for the actual sql query, and then this concat with table_type to tell you if it's flat/cube/something else. I like that a :little: bit more than having a split for readability.

And I need the path downstream so that I have one place to handle the 'take the parquet and create a csv from it once it's downloaded' logic. So... I think I'm going to soft advocate for as is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, as discussed offline, this now takes a filename arg and returns a bool

Comment on lines +164 to +168
self.connection.cursor().execute(f"""UNLOAD
(SELECT * FROM {table_name})
TO '{s3_path}'
WITH (format='PARQUET', compression='SNAPPY')
""") # noqa: S608
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table name comes from the manifest yeah? So it is user input, which does mean it's subject to chicanery, by the user or study author or malicious 3rd party app modifying files in $HOME... (btw: do we have any sanitizing of table names when reading the manifest?)

But I'm not overly stressed - I'll leave it up to you on risk assessment here.

16,,"2018-08-01",,
16,,2018-08-01,,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that Vlad's dashboard sometimes makes assumptions about the quoting scheme in use. Have you tested whether this scheme works correctly? You might have to add some flags to the csv exporter to keep the same quoting behavior if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this format works.

@dogversioning dogversioning merged commit 91eb861 into main Mar 4, 2025
7 checks passed
@dogversioning dogversioning deleted the mg/db_based_export branch March 4, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants