Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Dask Support Implementation #484

Closed
wants to merge 22 commits into from
Closed

feat: Dask Support Implementation #484

wants to merge 22 commits into from

Conversation

benrutter
Copy link
Contributor

@benrutter benrutter commented Jul 10, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes (I don't think so, is there a changelog to add to?)

If you have comments or can explain your changes, please do so below.

Hello,

This is still a draft PR (although everything is in a working state, there's a very unperformance step in it I've put in that I do need to take out) but I'd love to get some feedback around making integration as sleek as possible!

I've implemented a bunch of logic with something along the lines of:

if self._integration == "dask":
    return do_some_bespoke_thing()
return the_happy_path()

Which I think for the most part is fine because that's generally how other implementations like modin etc are done? Not sure if there's any guidance around another way of doing this? (or maybe I should just put in some friendly comments where that happens to explain why dask is implemented differently). I actually found it pretty clear to follow along, but the alternative I could think of woud be subclassing or something.

The other thing I've done (mentioned over on the Narwhals discord) is implement .collect() to .compute() a dask dataframe, which seems like good behaviour. Dask dataframes are a little like polars lazyframes, and computing them forces all data into a single noded dataframe.

There are also a few things where something just isn't possible with dask, for instance cross-joins. Which dask doesn't support natively. Theoretically you could hack something in by doing something like adding two columns that are always true, joining on that, then dropping the column, but I thought it was a better move to maintain the implementations of dask as mark as "not implemented" (particularly as my understanding is dask intentionally doesn't implement cross joins for performance reasons).

I was thinking it might be nice to put in a decorator along the lines of:

@not_implemented_in(["dask", "modin"])
def self.some_method(x: Any) -> Any:
   ...

To nicely raise a NotImplementedError if that method gets called with one of those backends? Any thoughts?

It's my first time working with the Narwhal's codebase - I've had a blast running through it! I'm probably not familiar with all the conventions etc, so please do let me know if I've missed something major.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome effort, thanks @benrutter !

a few comments:

  • I think a subclass or a decorator should be fine. I think there's not that many if/then statements you had to add? maybe a decorator is easier then. for shape we can unify the implementations under len(self._native_series)
  • merge conflicts: there are some merge conflicts, but it's mostly down to some things having been renamed:
    • PandasDataFrame -> PandasLikeDataFrame
    • ._series -> ._native_series
    • implementation == "dask" => implementation is Implementation.DASK

The major refactors are over (for pandas-like), I promise not to do this again πŸ˜„ 😳 Fancy fixing up the merge conflicts?

narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved
@benrutter
Copy link
Contributor Author

@MarcoGorelli absolutely! I'll do some "tidy up" changes and then work through the merge conflicts.

@benrutter
Copy link
Contributor Author

Ok, I thought I'd sorted out the merge conflicts, and at least test_common is working for both. But looks like I've introduced some kind of syntax error somewhere else or something. There's a lot of failing tests, oh dear. . .

I'll have a bit more of a look some time soon.

@MarcoGorelli
Copy link
Member

awesome stuff! love the decorator

a lot of narwhals/_pandas_like/utils.py was moved to narwhals/_expression_parsing.py, as it's reused between pandas-like APIs and pyarrow

@benrutter
Copy link
Contributor Author

Ah thanks! That probably explains it then- I might just need to implement a few small bits over there then.

@DeaMariaLeon DeaMariaLeon changed the title Dask Support Implementation (still draft, don't merge yet!) feat: Dask Support Implementation (still draft, don't merge yet!) Jul 14, 2024
@github-actions github-actions bot added the enhancement New feature or request label Jul 14, 2024
@benrutter benrutter changed the title feat: Dask Support Implementation (still draft, don't merge yet!) feat: Dask Support Implementation Jul 15, 2024
@benrutter
Copy link
Contributor Author

Ok, all tests passing bar one which I'll mention in a sec (and sorted out the latest conflicts).

Looks like one of the tests which checks the API docs for stable.v1 is passing, but I'm not 100% sure of the intention behind it so don't want to mess too much without checking.

Is the idea that I should go in and update the documentation so that v1 matches elsewhere?

@MarcoGorelli
Copy link
Member

Is the idea that I should go in and update the documentation so that v1 matches elsewhere?

yup, if you update the main namespace docs, you should also update the v1 docs

@benrutter
Copy link
Contributor Author

Awesome - thanks @MarcoGorelli for all the guidance! Looks like all the tests are passing now (I spotted they're failing in 3.8 python but that just looks like the old type hinting API not being supported rather than anything new here)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @benrutter !

left some comments, there's some pre-commit checks to fixup too, but awesome work here

narwhals/_pandas_like/dataframe.py Outdated Show resolved Hide resolved
narwhals/_pandas_like/group_by.py Outdated Show resolved Hide resolved
Comment on lines +62 to +66
dataframe_is_empty = (
self._df._native_dataframe.empty
if self._df._implementation != Implementation.DASK
else len(self._df._native_dataframe) == 0
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use self._df.is_empty()?

narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved
Comment on lines 56 to 63
if other._native_series.index is not index:
if (
other._native_series.index is not index
and other._implementation != Implementation.DASK
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if other._native_series.index is not index and other._implementation is Implementation.DASK? I think we need to raise an error message in that case

narwhals/_pandas_like/utils.py Outdated Show resolved Hide resolved
@@ -218,6 +298,9 @@ def set_axis(
kwargs["copy"] = False
else: # pragma: no cover
pass
if implementation == "dask":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -449,6 +532,8 @@ def to_datetime(implementation: Implementation) -> Any:
return get_modin().to_datetime
if implementation is Implementation.CUDF:
return get_cudf().to_datetime
if implementation == "dask":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

narwhals/dependencies.py Outdated Show resolved Hide resolved
@@ -51,6 +57,17 @@ def maybe_get_modin_df(df_pandas: pd.DataFrame) -> Any:
return mpd.DataFrame(df_pandas.to_dict(orient="list"))


def maybe_get_dask_df(df_pandas: pd.DataFrame) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

@benrutter
Copy link
Contributor Author

thanks @benrutter !

left some comments, there's some pre-commit checks to fixup too, but awesome work here

Thanks! Yeah just spotted precommit and it turned up a bunch of issues, I'll work through them plus your comments

@benrutter
Copy link
Contributor Author

Ah quick update on this, I actually spotted that "get_dask()" (and get_modin()) obviously only imports dask if it's already imported. Which makes total sense, but also meant that dask wasn't being included in the new tests at all, since I hadn't added an initial import statement anywhere.

I've put one + an ImportError suppression in the conftest which has worked nicely (yaaay! πŸ™Œ) but that's also turned up a whole bunch of errors in the test.

I suspect its a mix of:

  • Stuff that's not supported in dask so the test just needs to be skippped
  • Stuff where I've not implemented something correctly / at all yet and just hadn't realised till now
  • Stuff where I've borked the implementation in the merge conflict

I'll work through, and the failing tests actually give me better confidence that the final integration should be reliable and nicely tested. Might take a little time though, let me know if it'd be easier for me to close off the PR while I work on it and reopen another later or not.

@MarcoGorelli
Copy link
Member

thanks @benrutter for the update! up to you if you want to keep this PR or open a new one, whichever's easiest for you

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jul 19, 2024

i'll work on updating this, would quite like to see it, but I think it should be back lazyframe, not dataframe?

EDIT ok, turns out this is going to be really complicated. putting it off for now, gonna see if I can catch up with a dask maintainer first

@benrutter
Copy link
Contributor Author

Oh yikes! Thanks for taking a look, I think it might look worse than it is because some of the testing stuff like .to_dict isn't supported. Maybe I'm being naively optimistic πŸ˜…

I haven't taken a look at this for a little but but I'm hoping to spend some time hacking on it again soon.

@benrutter
Copy link
Contributor Author

Closing off for now so that dask functionality can be implemented in line with #566 later.

I'll keep playing around with the fork though.

@benrutter benrutter closed this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: Add support for dask
2 participants