Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add methods to Dask backend #637

Open
MarcoGorelli opened this issue Jul 27, 2024 · 28 comments
Open

feat: add methods to Dask backend #637

MarcoGorelli opened this issue Jul 27, 2024 · 28 comments
Labels
enhancement New feature or request needs discussion

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jul 27, 2024

Since #272 we have really minimal support for Dask DataFrame

Expr methods should go in narwhals/_dask/expr.py. For tests, you should use existing tests, and just remove the

    if "dask" in str(constructor):
        request.applymarker(pytest.mark.xfail)

part. If you can remove that, and the test passes, it means you've done it correctly

Not too hard

Examples of Expr methods which we should add (see here for the full list):

  • Expr.__sub__
  • Expr.__mul__
  • Expr.shift
  • Expr.cum_sum
  • Expr.is_between

Note: we should not add anything which modifies the index. So, the following should not be added, even though they appear on the list in the link above:

  • Expr.sort
  • Expr.head
  • Expr.tail

Harder

  • DataFrame.group_by
  • DataFrame.filter
  • get things started with namespaces (e.g. Expr.dt, Expr.str, ...)

General guidelines:

  • please don't ask for the issue to be assigned to you
  • please don't ask for permission to work on this issue
  • if you're a first time contribute, please choose 1 method at a time to implement, and leave a comment noting which method you're working on (if you've contributed to Narwhals before, feel free to choose multiple)
  • have fun 🥳


An easy way to check if something still needs doing is to look for tests with

    if "dask" in str(constructor):
        request.applymarker(pytest.mark.xfail)

Then:

  1. try removing those two lines
  2. run the test, check that it fails
  3. implement this functionality for Dask
  4. check the test passes
@MarcoGorelli MarcoGorelli added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jul 27, 2024
@anopsy
Copy link
Member

anopsy commented Jul 27, 2024

I'll take Expr.__sub__

@aidoskanapyanov
Copy link
Contributor

I'll take Expr.is_between

@anopsy
Copy link
Member

anopsy commented Jul 27, 2024

Taking Expr.__mul__

@DeaMariaLeon
Copy link
Member

Take Expr.sum

@FBruzzesi
Copy link
Member

@MarcoGorelli I have a couple of questions:

  • Could you expand a little bit more, here or somewhere else on the following:

    we should not add anything which modifies the index

  • Should we refer to this issue also for dataframe methods? i.e. not Expr only?

@MarcoGorelli
Copy link
Member Author

hey!

in pandas, some Series methods such as Series.sort_values change the index:

In [3]: s
Out[3]:
0    3
1    2
2    1
dtype: int64

In [4]: s.sort_values()
Out[4]:
2    1
1    2
0    3
dtype: int64

pandas / Dask would then auto-align on the index, which is what we want to avoid. in pandas we can just check the index values, as it's already eager, but in dask we don't have a way to do this (though I have opened an issue about this dask/dask-expr#1112)

yup, dataframe methods too 😎

@anopsy
Copy link
Member

anopsy commented Jul 29, 2024

I'm assuming Expr.min/Expr.max have also to be done, so I'll take those

@FBruzzesi
Copy link
Member

(Asking for a friend 👀) how much cheating are we allowed to? Specifically, pandas-like translate_dtype should apply one-to-one for dask.

@MarcoGorelli
Copy link
Member Author

😄 should be fine to reuse that one

@FBruzzesi
Copy link
Member

FBruzzesi commented Aug 11, 2024

Just to give a sense of how far we have gone and what's still missing.

DaskExpr methods (those marked with * at the end means that modify the index and require evaluation of #743 before implementing):

  • cast
  • quantile
  • over
  • diff
  • is_duplicated
  • is_in
  • is_unique
  • null_count
  • arg_true (*)
  • cat (* as .cat.get_categories() changes the size/index, and depends on ExprCatNamespace)
  • drop_null (*) - Currently implemented as NotImplementedError
  • filter (*)
  • gather_every (*) - Currently implemented as NotImplementedError
  • head (*) - Currently implemented as NotImplementedError
  • sample (*)
  • sort (*) - Currently implemented as NotImplementedError
  • tail (*) - Currently implemented as NotImplementedError
  • unique (*)

ExprDateTimeNamespace methods:

  • to_string
  • total_microseconds
  • total_milliseconds
  • total_minutes
  • total_nanoseconds
  • total_seconds

ExprCatNamespace is missing entirely.

@anopsy
Copy link
Member

anopsy commented Aug 12, 2024

I'm working on null_count and quantile.

@lucianosrp
Copy link
Member

will work on dt.to_string

@luke396
Copy link
Member

luke396 commented Aug 15, 2024

will work on total_microseconds

@raisadz
Copy link
Contributor

raisadz commented Aug 15, 2024

I will take diff

@benrutter
Copy link
Contributor

benrutter commented Aug 16, 2024

I'll pick up quantile! Edit: No I won't, sorry @anopsy!

Will look at is_duplicated instead 😅

Double edit: I think is_duplicated might be a candidate for "not_implemented", looking at these github issues, it was initially deemed out of scope (since it a tricky thing to check in parallel) and now looks like it could be tabled again soon.

GH Issues:

Theoretically it'd be possible to write an implementation of some kind, but that's probably outside of the scope of Narwhals.

I think the same goes for is_unique as well.

Triple edit: Have taken is_in instead, PR incoming!

@aidoskanapyanov
Copy link
Contributor

I'll take cast

@FBruzzesi
Copy link
Member

FBruzzesi commented Aug 21, 2024

As there is not much left: for anyone interested, we could use the DaskNamespace concat implementation 😇

@benrutter
Copy link
Contributor

I'll take concat if nobody else has yet!

@luke396 luke396 unpinned this issue Aug 21, 2024
@luke396 luke396 pinned this issue Aug 21, 2024
@anopsy anopsy mentioned this issue Aug 21, 2024
10 tasks
@benrutter
Copy link
Contributor

Woah, looks like a lot is done! Is any stuff left around this? Looks like the remaining expression implementations are dependent on #743. I know a lot have dataframe equivalent implementations though (i.e. filtering a single expression is risky, but there's already a filter method that applies to the whole frame).

Also, happy to volunteer myself to compile a list if it'd be handy! 😁

@FBruzzesi FBruzzesi removed help wanted Extra attention is needed good first issue Good for newcomers labels Sep 21, 2024
@FBruzzesi FBruzzesi unpinned this issue Sep 21, 2024
@benrutter
Copy link
Contributor

benrutter commented Oct 22, 2024

Ok, here's the list as good as I can figure for 'done' vs 'outstanding'. I've copied @FBruzzesi's one as well just so that they're all in one place! (stuck with the * for things waiting on #743 decision)

Seems like almost everything (aside from index-tricky stuff) is done! 🎉

DaskExpr

  • cast
  • quantile
  • over
  • diff
  • is_duplicated
  • is_in
  • is_unique
  • null_count
  • arg_true (*)
  • cat (* as .cat.get_categories() changes the size/index, and depends on ExprCatNamespace)
  • drop_null (*) - Currently implemented as NotImplementedError
  • filter ()
    gather_every (
    ) - Currently implemented as NotImplementedError
  • head (*) - Currently implemented as NotImplementedError
  • sample (*)
  • sort (*) - Currently implemented as NotImplementedError
  • tail (*) - Currently implemented as NotImplementedError
  • unique (*)

ExprDateTimeNamespace

  • to_string
  • total_microseconds
  • total_milliseconds
  • total_minutes
  • total_nanoseconds
  • total_seconds

ExprCatNamespace

  • get_categories*

Dataframe

  • columns
  • schema
  • collect_schema
  • collect
  • group_by
  • with_columns
  • with_row_index*
  • drop
  • drop_nulls
  • unique
  • sort
  • unpivot
  • join
  • join_asof
  • filter
  • select

DaskNamespace

  • nth
  • len
  • concat
  • concat_str
  • lit
  • mean
  • mean_horizontal
  • cat*

Edit: Just looked at this again, and realised that:

  • drop is actually implemented and being tested
  • categories are eager only looking at tests, which makes sense, so probably requiring more discussion to implement.

So I think everything that doesn't require more discussion/thought is implemented?

PeanutsLucyGIF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion
Projects
None yet
Development

No branches or pull requests

10 participants