-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to DB native export format #357
Conversation
fbaccb2
to
338d165
Compare
338d165
to
f012522
Compare
self.connection.cursor().execute(f"""UNLOAD | ||
(SELECT * FROM {table_name}) | ||
TO '{s3_path}' | ||
WITH (format='PARQUET', compression='SNAPPY') | ||
""") # noqa: S608 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could? create a centralized jinja template for these DB specific queires. I don't know how much this export mechanism needs the injection protection, since it's not in queries that are being distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table name comes from the manifest yeah? So it is user input, which does mean it's subject to chicanery, by the user or study author or malicious 3rd party app modifying files in $HOME... (btw: do we have any sanitizing of table names when reading the manifest?)
But I'm not overly stressed - I'll leave it up to you on risk assessment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do have a regex validation for custom prefixes, but not for table names. if we did that, I'd be fine saying this is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but... on the other hand, a user can run a DROP TABLE query if they edit the manifest and it would be valid, so... maybe we're just not safe from a malicious user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a malicious script on the user's machine
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
8fb78ba
to
29174e4
Compare
cumulus_library/databases/base.py
Outdated
@@ -208,6 +205,16 @@ def upload_file( | |||
have an API for file upload (i.e. cloud databases)""" | |||
return None | |||
|
|||
@abc.abstractmethod | |||
def export_table_as_parquet( | |||
self, table_name: str, table_type: str, location: pathlib.Path, *args, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: table_type
seems like an odd addition here. In both implementations, it's just used to create the output file name, which felt like a bit of duplicated business logic.
What if you used a "input/output, did it write anything" pattern like (totally ignorable suggestion, just brainstorming):
def export_table_as_parquet(self, table_name: str, output_path: pathlib.Path) -> bool:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you need the bare table name for the actual sql query, and then this concat with table_type to tell you if it's flat/cube/something else. I like that a :little: bit more than having a split for readability.
And I need the path downstream so that I have one place to handle the 'take the parquet and create a csv from it once it's downloaded' logic. So... I think I'm going to soft advocate for as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, as discussed offline, this now takes a filename
arg and returns a bool
self.connection.cursor().execute(f"""UNLOAD | ||
(SELECT * FROM {table_name}) | ||
TO '{s3_path}' | ||
WITH (format='PARQUET', compression='SNAPPY') | ||
""") # noqa: S608 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table name comes from the manifest yeah? So it is user input, which does mean it's subject to chicanery, by the user or study author or malicious 3rd party app modifying files in $HOME... (btw: do we have any sanitizing of table names when reading the manifest?)
But I'm not overly stressed - I'll leave it up to you on risk assessment here.
16,,"2018-08-01",, | ||
16,,2018-08-01,, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that Vlad's dashboard sometimes makes assumptions about the quoting scheme in use. Have you tested whether this scheme works correctly? You might have to add some flags to the csv exporter to keep the same quoting behavior if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed this format works.
9b9e7bf
to
6bc5bbf
Compare
6bc5bbf
to
e4131b4
Compare
This changes exports to use whatever the db-appropriate method of exporting while preserving types is, as opposed to trying to cast columns based on an observed type converted to a parquet value (which can be a little lossy).
Checklist
docs/
needs to be updatedgenerate-md
core
study fields that not in US Core, update our list of those incore-study-details.md
manifest.toml