Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Polars a first class citizen for pulling and loading data to snowflake #295

Merged
merged 17 commits into from
Sep 5, 2024

Conversation

gladysteh99
Copy link
Contributor

pyproject.toml & requirements.txt add polars as dependency

For pulling data from snowflake:

  • to_dataframe in database.py and snowflake.py: added an argument called df_type for users to specify the dataframe type of output, defaults to "pandas" to prevent any breaking changes for exisiting users. snowflake.py version uses pyarrow form instead of native fetching to improve performance

For loading data to snowflake:

  • insert_dataframe_to_table in snowflake.py and redshift.py: allow dataframe argument to take in polars.DataFrame and polars.LazyFrame
  • find_column_type in utility.py uses singledispatch to extend function to have polars functionality.

Copy link
Contributor

@jdawang jdawang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good work with the logic.

I have a few requested changes regarding undeleted comments.

Then a few questions/comments on potentially using more dispatching and removing support for LazyFrame that would love to hear your thoughts on.

locopy/utility.py Outdated Show resolved Hide resolved
locopy/snowflake.py Outdated Show resolved Hide resolved
pyproject.toml Outdated
@@ -6,7 +6,7 @@ authors = [
{ name="Faisal Dosani", email="[email protected]" },
]
license = {text = "Apache Software License"}
dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0"]
dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0", "polars>=1.5.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1.5 actually the min version that works? I think it's worth looking into this as I think people are still using versions before 1.0.

Copy link
Contributor Author

@gladysteh99 gladysteh99 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the minimum version needed is 1.0 for one of the methods used (collect_schema) which is necessary for supporting both lazyframe and dataframe. It really draws down to if we want to support both. I personally prefer setting it to at least 1.0.0 because of the significant upgrades from 0.20.0 to 1.0.0.

locopy/snowflake.py Outdated Show resolved Hide resolved

if columns:
dataframe = dataframe[columns]
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the method, would it be possible to use a single dispatch helper here for the first part of this function? E.g. a dispatched function for line 431 to 456 - creating the to_insert. Once we have the list of tuples, the rest is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is possible, my only concern was readability for future developers, but happy to refactor it to dispatched function if there is any significant upside

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable to use the isinstance checks below to select columns from dataframe. I don't think anything between requires this to be done first. That or doing the dispatch. Lmk if I'm wrong here. Like moving the dataframe=dataframe[columns] into the pandas branch and the .select(columns) into the polars branch.

Copy link
Contributor Author

@gladysteh99 gladysteh99 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually polars dataframe can read dataframe[columns] as of version 1.0.0, the .select(columns) was intended for LazyFrame. Since we are not supporting LazyFrame now, it can be removed completely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

try:
length = len(dataframe)
except TypeError:
length = dataframe.select(pl.len()).collect().item() # for polars lazyframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here on LazyFrame. If we want to consider not supporting that.

Same comment here on considering dispatch helper method to get to_insert.

locopy/database.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jdawang jdawang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment that I missed last time, everything else looks good to me!

locopy/database.py Show resolved Hide resolved
@jdawang jdawang self-requested a review September 5, 2024 17:21
Copy link
Contributor

@jdawang jdawang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work!

@gladysteh99 gladysteh99 merged commit 57568fa into develop Sep 5, 2024
7 checks passed
@gladysteh99 gladysteh99 deleted the polars branch September 5, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants