Skip to content

Commit

Permalink
reword, add groupby section
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcoGorelli committed May 25, 2024
2 parents e15b5ef + 04054b6 commit 4aa875d
Show file tree
Hide file tree
Showing 35 changed files with 1,657 additions and 149 deletions.
6 changes: 5 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,8 @@ repos:
entry: python -m utils.check_api_reference
language: python
additional_dependencies: [polars]

- id: imports-are-banned
name: import are banned (use `get_pandas` instead of `import pandas`)
entry: (?<!>>> )import (pandas|polars|modin|cudf)
language: pygrep
files: ^narwhals/
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,15 @@ Seamlessly support all, without depending on any!
- ✅ Use **Expressions**
- ✅ 100% branch coverage, tested against pandas and Polars nightly builds!
- ✅ Preserve your Index (if present) without it getting in the way!
-**Zero 3rd party imports**, Narwhals only uses what you already have!

## Used by

Join the party!

- [timebasedcv](https://github.com/FBruzzesi/timebasedcv)
- [scikit-lego](https://github.com/koaning/scikit-lego) (work-in-progress, in `narwhals-development` branch)
- [scikit-lego](https://github.com/koaning/scikit-lego)
- [scikit-playtime](https://github.com/koaning/scikit-playtime)
- [timebasedcv](https://github.com/FBruzzesi/timebasedcv)

## Installation

Expand Down
5 changes: 5 additions & 0 deletions docs/api-reference/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,16 @@
- fill_null
- filter
- is_between
- is_duplicated
- is_first_distinct
- is_in
- is_last_distinct
- is_null
- is_unique
- max
- mean
- min
- null_count
- n_unique
- over
- unique
Expand Down
11 changes: 11 additions & 0 deletions docs/api-reference/selectors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# `narwhals.selectors`

::: narwhals.selectors
handler: python
options:
members:
- by_dtype
- numeric
show_root_heading: false
show_source: false
show_bases: false
8 changes: 8 additions & 0 deletions docs/api-reference/series.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,19 @@
- fill_null
- filter
- is_between
- is_duplicated
- is_empty
- is_first_distinct
- is_in
- is_last_distinct
- is_null
- is_sorted
- is_unique
- max
- mean
- min
- name
- null_count
- n_unique
- sample
- shape
Expand All @@ -32,5 +39,6 @@
- to_numpy
- to_pandas
- unique
- value_counts
show_source: false
show_bases: false
50 changes: 44 additions & 6 deletions docs/how_it_works.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ def sum_horizontal_a_b(df):
```

Note that although an expression may have multiple columns as input,
those columns must all have been derived from the same dataframe.
those columns must all have been derived from the same dataframe. This last sentence was
quite important, you might want to re-read it to make sure it sunk in.

By itself, an expression doesn't produce a value. It only produces a value once you give it to a
DataFrame context. What happens to the value(s) it produces depends on which context you hand
Expand All @@ -70,7 +71,7 @@ df.select(nw.col('a')+1)
```

`nw.col('a')` produces a `narwhals.expression.Expr` object, which has a private `_call` method.
We can call `nw.col('a')._call(pl)`, then the result is actually `pl.col('a')`.
Inside `DataFrame.select`, we call `nw.col('a')._call(pl)`, which produces `pl.col('a')`.

We then let Polars do its thing. Which is nice, but also not particularly interesting.
How about translating expressions to pandas? Well, it's
Expand All @@ -94,13 +95,13 @@ The result from the last line above is the same as we'd get from `pn.col('a')`,
a `narwhals._pandas_like.expression.PandasExpr` object, which we'll call `PandasExpr` for
short.

`PandasExpr` also have a `_call` method - but this one expects a `PandasDataFrame` as input.
`PandasExpr` also has a `_call` method - but this one expects a `PandasDataFrame` as input.
Recall from above that an expression is a function from a dataframe to a sequence of series.
The `_call` method gives us that function! Let's see it in action.

Note: the following examples uses `PandasDataFrame` and `PandasSeries`. These are wrappers
around pandas DataFrame and pandas Series, which are Narwhals-compliant. To get the native
pandas objects out from inside them, we access `PandasDataFrame._dataframe` and `PandasSeries._series`.
Note: the following examples use `PandasDataFrame` and `PandasSeries`. These are backed
by actual `pandas.DataFrame`s and `pandas.Series` respectively and are Narwhals-compliant. We can access the
underlying pandas objects via `PandasDataFrame._dataframe` and `PandasSeries._series`.

```python
import narwhals as nw
Expand Down Expand Up @@ -140,3 +141,40 @@ than running pandas directly.
Further attempts at demistifying Narwhals, refactoring code so it's clearer, and explaining
this section better are 110% welcome.

## Group-by

Group-by is probably one of Polars' most significant innovations (on the syntax side) with respect
to pandas. We can write something like
```python
df: pl.DataFrame
df.group_by('a').agg((pl.col('c') > pl.col('b').mean()).max())
```
To do this in pandas, we need to either use `GroupBy.apply` (sloooow), or do some crazy manual
optimisations to get it to work.

In Narwhals, here's what we do:

- if somebody uses a simple group-by aggregation (e.g. `df.group_by('a').agg(nw.col('b').mean())`),
then on the pandas side we translate it to
```python

df: pd.DataFrame
df.groupby('a').agg({'b': ['mean']})
```
- if somebody passes a complex group-by aggregation, then we use `apply` and raise a `UserWarning`, warning
users of the performance penalty and advising them to refactor their code so that the aggregation they perform
ends up being a simple one.

In order to tell whether an aggregation is simple, Narwhals uses the private `_depth` attribute of `PandasExpr`:

```python
>>> pn.col('a').mean()
PandasExpr(depth=1, function_name=col->mean, root_names=['a'], output_names=['a']
>>> (pn.col('a')+1).mean()
PandasExpr(depth=2, function_name=col->__add__->mean, root_names=['a'], output_names=['a']
>>> pn.mean('a')
PandasExpr(depth=1, function_name=col->mean, root_names=['a'], output_names=['a']
```

For simple aggregations, Narwhals can just look at `_depth` and `function_name` and figure out
which (efficient) elementary operation this corresponds to in pandas.
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Seamlessly support both, without depending on either!
- ✅ Use **Expressions**
- ✅ Tested against pandas and Polars nightly builds!
- ✅ Preserve your Index (if present) without it getting in the way!
-**Zero 3rd party imports**, Narwhals only uses what you already have!

## Who's this for?

Expand Down
2 changes: 1 addition & 1 deletion docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,6 @@ Then, if you start the Python REPL and see the following:
```python
>>> import narwhals
>>> narwhals
'0.8.15'
'0.8.18'
```
then installation worked correctly!
6 changes: 3 additions & 3 deletions docs/related.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ The projects are not in competition and have different goals.
[Presents itself as a dataframe standard](https://voltrondata.com/resources/open-source-standards), and
dispatches to 20+ backends. Some differences with Narwhals are:

- Narwhals is ~1000 times lighter
- Narwhals is ~1000 times lighter and is aimed at library maintainers as opposed to end users
- Narwhals only supports 4 backends, Ibis more than 20
- Narwhals is limited to fundamental dataframe operations, Ibis includes more advanced and niche ones.
- Narwhals is focused on fundamental dataframe operations, Ibis on SQL backends

Again, the projects are not in competition and have different goals.
The projects are not in competition and have different goals.

## Array API

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ nav:
- api-reference/expressions_str.md
- api-reference/dtypes.md
- api-reference/dependencies.md
- api-reference/selectors.md
theme:
name: material
font: false
Expand Down
2 changes: 1 addition & 1 deletion narwhals/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from narwhals.utils import maybe_align_index
from narwhals.utils import maybe_set_index

__version__ = "0.8.15"
__version__ = "0.8.18"

__all__ = [
"concat",
Expand Down
29 changes: 27 additions & 2 deletions narwhals/_pandas_like/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,8 +184,12 @@ def fill_null(self, value: Any) -> Self:
def is_in(self, other: Any) -> Self:
return register_expression_call(self, "is_in", other)

def filter(self, other: Any) -> Self:
return register_expression_call(self, "filter", other)
def filter(self, *predicates: Any) -> Self:
from narwhals._pandas_like.namespace import PandasNamespace

plx = PandasNamespace(self._implementation)
expr = plx.all_horizontal(*predicates)
return register_expression_call(self, "filter", expr)

def drop_nulls(self) -> Self:
return register_expression_call(self, "drop_nulls")
Expand Down Expand Up @@ -253,6 +257,21 @@ def func(df: PandasDataFrame) -> list[PandasSeries]:
implementation=self._implementation,
)

def is_duplicated(self) -> Self:
return register_expression_call(self, "is_duplicated")

def is_unique(self) -> Self:
return register_expression_call(self, "is_unique")

def null_count(self) -> Self:
return register_expression_call(self, "null_count")

def is_first_distinct(self) -> Self:
return register_expression_call(self, "is_first_distinct")

def is_last_distinct(self) -> Self:
return register_expression_call(self, "is_last_distinct")

@property
def str(self) -> PandasExprStringNamespace:
return PandasExprStringNamespace(self)
Expand Down Expand Up @@ -313,6 +332,12 @@ def minute(self) -> PandasExpr:
def second(self) -> PandasExpr:
return register_namespace_expression_call(self._expr, "dt", "second")

def millisecond(self) -> PandasExpr:
return register_namespace_expression_call(self._expr, "dt", "millisecond")

def microsecond(self) -> PandasExpr:
return register_namespace_expression_call(self._expr, "dt", "microsecond")

def ordinal_day(self) -> PandasExpr:
return register_namespace_expression_call(self._expr, "dt", "ordinal_day")

Expand Down
16 changes: 8 additions & 8 deletions narwhals/_pandas_like/group_by.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
from narwhals._pandas_like.utils import is_simple_aggregation
from narwhals._pandas_like.utils import item
from narwhals._pandas_like.utils import parse_into_exprs
from narwhals._pandas_like.utils import series_from_iterable
from narwhals.dependencies import get_pandas
from narwhals.utils import parse_version
from narwhals.utils import remove_prefix

Expand Down Expand Up @@ -88,9 +90,7 @@ def agg_pandas( # noqa: PLR0913
- https://github.com/rapidsai/cudf/issues/15118
- https://github.com/rapidsai/cudf/issues/15084
"""
import pandas as pd

from narwhals._pandas_like.namespace import PandasNamespace
pd = get_pandas()

all_simple_aggs = True
for expr in exprs:
Expand Down Expand Up @@ -140,8 +140,6 @@ def agg_pandas( # noqa: PLR0913
stacklevel=2,
)

plx = PandasNamespace(implementation=implementation)

def func(df: Any) -> Any:
out_group = []
out_names = []
Expand All @@ -150,14 +148,16 @@ def func(df: Any) -> Any:
for result_keys in results_keys:
out_group.append(item(result_keys._series))
out_names.append(result_keys.name)
return plx.make_native_series(name="", data=out_group, index=out_names)
return series_from_iterable(
out_group, index=out_names, name="", implementation=implementation
)

if implementation == "pandas":
import pandas as pd
pd = get_pandas()

if parse_version(pd.__version__) < parse_version("2.2.0"): # pragma: no cover
result_complex = grouped.apply(func)
else: # pragma: no cover
else:
result_complex = grouped.apply(func, include_groups=False)
else: # pragma: no cover
result_complex = grouped.apply(func)
Expand Down
18 changes: 4 additions & 14 deletions narwhals/_pandas_like/namespace.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from narwhals import dtypes
from narwhals._pandas_like.dataframe import PandasDataFrame
from narwhals._pandas_like.expr import PandasExpr
from narwhals._pandas_like.selectors import PandasSelector
from narwhals._pandas_like.series import PandasSeries
from narwhals._pandas_like.utils import horizontal_concat
from narwhals._pandas_like.utils import parse_into_exprs
Expand Down Expand Up @@ -36,20 +37,9 @@ class PandasNamespace:
String = dtypes.String
Datetime = dtypes.Datetime

def make_native_series(self, name: str, data: list[Any], index: Any) -> Any:
if self._implementation == "pandas":
import pandas as pd

return pd.Series(name=name, data=data, index=index)
if self._implementation == "modin": # pragma: no cover
import modin.pandas as mpd

return mpd.Series(name=name, data=data, index=index)
if self._implementation == "cudf": # pragma: no cover
import cudf

return cudf.Series(name=name, data=data, index=index)
raise NotImplementedError # pragma: no cover
@property
def selectors(self) -> PandasSelector:
return PandasSelector(self._implementation)

# --- not in spec ---
def __init__(self, implementation: str) -> None:
Expand Down
28 changes: 28 additions & 0 deletions narwhals/_pandas_like/selectors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from __future__ import annotations

from typing import TYPE_CHECKING

from narwhals._pandas_like.expr import PandasExpr

if TYPE_CHECKING:
from narwhals._pandas_like.dataframe import PandasDataFrame
from narwhals._pandas_like.series import PandasSeries
from narwhals.dtypes import DType


class PandasSelector:
def __init__(self, implementation: str) -> None:
self._implementation = implementation

def by_dtype(self, dtypes: list[DType]) -> PandasExpr:
def func(df: PandasDataFrame) -> list[PandasSeries]:
return [df[col] for col in df.columns if df.schema[col] in dtypes]

return PandasExpr(
func,
depth=0,
function_name="type_selector",
root_names=None,
output_names=None,
implementation=self._implementation,
)
Loading

0 comments on commit 4aa875d

Please sign in to comment.