Switch to DB native export format #357

dogversioning · 2025-03-03T20:11:12Z

This changes exports to use whatever the db-appropriate method of exporting while preserving types is, as opposed to trying to cast columns based on an observed type converted to a parquet value (which can be a little lossy).

Checklist

Consider if documentation in docs/ needs to be updated
- If you've changed the structure of a table, you may need to run generate-md
- If you've added/removed core study fields that not in US Core, update our list of those in core-study-details.md
Consider if tests should be added
Update template repo if there are changes to study configuration in manifest.toml

dogversioning · 2025-03-03T20:35:45Z

cumulus_library/databases/athena.py

+        self.connection.cursor().execute(f"""UNLOAD
+            (SELECT * FROM {table_name})
+            TO '{s3_path}'
+            WITH (format='PARQUET', compression='SNAPPY')
+            """)  # noqa: S608


I could? create a centralized jinja template for these DB specific queires. I don't know how much this export mechanism needs the injection protection, since it's not in queries that are being distributed.

The table name comes from the manifest yeah? So it is user input, which does mean it's subject to chicanery, by the user or study author or malicious 3rd party app modifying files in $HOME... (btw: do we have any sanitizing of table names when reading the manifest?)

But I'm not overly stressed - I'll leave it up to you on risk assessment here.

we do have a regex validation for custom prefixes, but not for table names. if we did that, I'd be fine saying this is safe.

but... on the other hand, a user can run a DROP TABLE query if they edit the manifest and it would be valid, so... maybe we're just not safe from a malicious user.

Or a malicious script on the user's machine

github-actions · 2025-03-03T20:47:42Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
2439	2439	100%	100%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
cumulus_library/actions/exporter.py	100%	🟢
cumulus_library/databases/athena.py	100%	🟢
cumulus_library/databases/base.py	100%	🟢
cumulus_library/databases/duckdb.py	100%	🟢
TOTAL	100%	🟢

updated for commit: e4131b4 by action🐍

cumulus_library/actions/exporter.py

cumulus_library/databases/duckdb.py

cumulus_library/databases/athena.py

mikix · 2025-03-04T14:31:48Z

cumulus_library/databases/base.py

@@ -208,6 +205,16 @@ def upload_file(
        have an API for file upload (i.e. cloud databases)"""
        return None

+    @abc.abstractmethod
+    def export_table_as_parquet(
+        self, table_name: str, table_type: str, location: pathlib.Path, *args, **kwargs


nit: table_type seems like an odd addition here. In both implementations, it's just used to create the output file name, which felt like a bit of duplicated business logic.

What if you used a "input/output, did it write anything" pattern like (totally ignorable suggestion, just brainstorming):

def export_table_as_parquet(self, table_name: str, output_path: pathlib.Path) -> bool:

Well, you need the bare table name for the actual sql query, and then this concat with table_type to tell you if it's flat/cube/something else. I like that a :little: bit more than having a split for readability.

And I need the path downstream so that I have one place to handle the 'take the parquet and create a csv from it once it's downloaded' logic. So... I think I'm going to soft advocate for as is?

ok, as discussed offline, this now takes a filename arg and returns a bool

cumulus_library/databases/duckdb.py

mikix · 2025-03-04T14:43:24Z

cumulus_library/databases/athena.py

+        self.connection.cursor().execute(f"""UNLOAD
+            (SELECT * FROM {table_name})
+            TO '{s3_path}'
+            WITH (format='PARQUET', compression='SNAPPY')
+            """)  # noqa: S608


The table name comes from the manifest yeah? So it is user input, which does mean it's subject to chicanery, by the user or study author or malicious 3rd party app modifying files in $HOME... (btw: do we have any sanitizing of table names when reading the manifest?)

But I'm not overly stressed - I'll leave it up to you on risk assessment here.

cumulus_library/databases/athena.py

mikix · 2025-03-04T14:53:31Z

tests/test_data/duckdb_data/expected_export/core/core__count_allergyintolerance_month.cube.csv

-16,,"2018-08-01",,
+16,,2018-08-01,,


I know that Vlad's dashboard sometimes makes assumptions about the quoting scheme in use. Have you tested whether this scheme works correctly? You might have to add some flags to the csv exporter to keep the same quoting behavior if not.

Confirmed this format works.

dogversioning force-pushed the mg/db_based_export branch from fbaccb2 to 338d165 Compare March 3, 2025 20:23

Switch to DB native export format

f012522

dogversioning force-pushed the mg/db_based_export branch from 338d165 to f012522 Compare March 3, 2025 20:31

dogversioning commented Mar 3, 2025

View reviewed changes

base db docstring tweak

bc33066

Switched duckdb sort to be pandas based

29174e4

dogversioning force-pushed the mg/db_based_export branch from 8fb78ba to 29174e4 Compare March 3, 2025 21:07

cleanup dangling chunk ref

a7051c9

mikix approved these changes Mar 4, 2025

View reviewed changes

dogversioning added 2 commits March 4, 2025 13:33

Coverage, PR feedback

33a5ba3

moved sort out of export

a6b7768

dogversioning force-pushed the mg/db_based_export branch from 9b9e7bf to 6bc5bbf Compare March 4, 2025 19:41

Added credentials to client

e4131b4

dogversioning force-pushed the mg/db_based_export branch from 6bc5bbf to e4131b4 Compare March 4, 2025 19:58

dogversioning merged commit 91eb861 into main Mar 4, 2025
7 checks passed

dogversioning deleted the mg/db_based_export branch March 4, 2025 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to DB native export format #357

Switch to DB native export format #357

dogversioning commented Mar 3, 2025

dogversioning Mar 3, 2025

mikix Mar 4, 2025

dogversioning Mar 4, 2025

dogversioning Mar 4, 2025

mikix Mar 4, 2025

github-actions bot commented Mar 3, 2025 •

edited

Loading

mikix Mar 4, 2025

dogversioning Mar 4, 2025

dogversioning Mar 4, 2025

mikix Mar 4, 2025

mikix Mar 4, 2025

dogversioning Mar 4, 2025

		16,,"2018-08-01",,
		16,,2018-08-01,,

Switch to DB native export format #357

Switch to DB native export format #357

Conversation

dogversioning commented Mar 3, 2025

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 3, 2025 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 3, 2025 •

edited

Loading