-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Polars a first class citizen for pulling and loading data to snowflake #295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall good work with the logic.
I have a few requested changes regarding undeleted comments.
Then a few questions/comments on potentially using more dispatching and removing support for LazyFrame that would love to hear your thoughts on.
pyproject.toml
Outdated
@@ -6,7 +6,7 @@ authors = [ | |||
{ name="Faisal Dosani", email="[email protected]" }, | |||
] | |||
license = {text = "Apache Software License"} | |||
dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0"] | |||
dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0", "polars>=1.5.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1.5 actually the min version that works? I think it's worth looking into this as I think people are still using versions before 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the minimum version needed is 1.0 for one of the methods used (collect_schema) which is necessary for supporting both lazyframe and dataframe. It really draws down to if we want to support both. I personally prefer setting it to at least 1.0.0 because of the significant upgrades from 0.20.0 to 1.0.0.
locopy/snowflake.py
Outdated
|
||
if columns: | ||
dataframe = dataframe[columns] | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the method, would it be possible to use a single dispatch helper here for the first part of this function? E.g. a dispatched function for line 431 to 456 - creating the to_insert
. Once we have the list of tuples, the rest is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is possible, my only concern was readability for future developers, but happy to refactor it to dispatched function if there is any significant upside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more readable to use the isinstance checks below to select columns from dataframe. I don't think anything between requires this to be done first. That or doing the dispatch. Lmk if I'm wrong here. Like moving the dataframe=dataframe[columns]
into the pandas branch and the .select(columns)
into the polars branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually polars dataframe can read dataframe[columns]
as of version 1.0.0, the .select(columns)
was intended for LazyFrame. Since we are not supporting LazyFrame now, it can be removed completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
locopy/redshift.py
Outdated
try: | ||
length = len(dataframe) | ||
except TypeError: | ||
length = dataframe.select(pl.len()).collect().item() # for polars lazyframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here on LazyFrame. If we want to consider not supporting that.
Same comment here on considering dispatch helper method to get to_insert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comment that I missed last time, everything else looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good work!
pyproject.toml
&requirements.txt
add polars as dependencyFor pulling data from snowflake:
to_dataframe
in database.py and snowflake.py: added an argument calleddf_type
for users to specify the dataframe type of output, defaults to "pandas" to prevent any breaking changes for exisiting users. snowflake.py version uses pyarrow form instead of native fetching to improve performanceFor loading data to snowflake:
insert_dataframe_to_table
in snowflake.py and redshift.py: allowdataframe
argument to take inpolars.DataFrame
andpolars.LazyFrame
find_column_type
in utility.py uses singledispatch to extend function to have polars functionality.