reword, add groupby section

narwhals-dev · May 25, 2024 · 4aa875d · 4aa875d
2 parents e15b5ef + 04054b6
commit 4aa875d
Show file tree

Hide file tree

Showing 35 changed files with 1,657 additions and 149 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -28,4 +28,8 @@ repos:
       entry: python -m utils.check_api_reference
       language: python
       additional_dependencies: [polars]
-
+    - id: imports-are-banned
+      name: import are banned (use `get_pandas` instead of `import pandas`)
+      entry: (?<!>>> )import (pandas|polars|modin|cudf)
+      language: pygrep
+      files: ^narwhals/
diff --git a/README.md b/README.md
@@ -22,14 +22,15 @@ Seamlessly support all, without depending on any!
 - ✅ Use **Expressions**
 - ✅ 100% branch coverage, tested against pandas and Polars nightly builds!
 - ✅ Preserve your Index (if present) without it getting in the way!
+- ✅ **Zero 3rd party imports**, Narwhals only uses what you already have!
 
 ## Used by
 
 Join the party!
 
-- [timebasedcv](https://github.com/FBruzzesi/timebasedcv)
-- [scikit-lego](https://github.com/koaning/scikit-lego) (work-in-progress, in `narwhals-development` branch)
+- [scikit-lego](https://github.com/koaning/scikit-lego)
 - [scikit-playtime](https://github.com/koaning/scikit-playtime)
+- [timebasedcv](https://github.com/FBruzzesi/timebasedcv)
 
 ## Installation
 

diff --git a/docs/api-reference/expressions.md b/docs/api-reference/expressions.md
@@ -14,11 +14,16 @@
         - fill_null
         - filter
         - is_between
+        - is_duplicated
+        - is_first_distinct
         - is_in
+        - is_last_distinct
         - is_null
+        - is_unique
         - max
         - mean
         - min
+        - null_count
         - n_unique
         - over
         - unique

diff --git a/docs/api-reference/selectors.md b/docs/api-reference/selectors.md
@@ -0,0 +1,11 @@
+# `narwhals.selectors`
+
+::: narwhals.selectors
+    handler: python
+    options:
+      members:
+        - by_dtype
+        - numeric
+      show_root_heading: false
+      show_source: false
+      show_bases: false
diff --git a/docs/api-reference/series.md b/docs/api-reference/series.md
@@ -15,12 +15,19 @@
         - fill_null
         - filter
         - is_between
+        - is_duplicated
+        - is_empty
+        - is_first_distinct
         - is_in
+        - is_last_distinct
         - is_null
+        - is_sorted
+        - is_unique
         - max
         - mean
         - min
         - name
+        - null_count
         - n_unique
         - sample
         - shape
@@ -32,5 +39,6 @@
         - to_numpy
         - to_pandas
         - unique
+        - value_counts
       show_source: false
       show_bases: false
diff --git a/docs/how_it_works.md b/docs/how_it_works.md
@@ -43,7 +43,8 @@ def sum_horizontal_a_b(df):
 ```
 
 Note that although an expression may have multiple columns as input,
-those columns must all have been derived from the same dataframe.
+those columns must all have been derived from the same dataframe. This last sentence was
+quite important, you might want to re-read it to make sure it sunk in.
 
 By itself, an expression doesn't produce a value. It only produces a value once you give it to a
 DataFrame context. What happens to the value(s) it produces depends on which context you hand
@@ -70,7 +71,7 @@ df.select(nw.col('a')+1)
 ```
 
 `nw.col('a')` produces a `narwhals.expression.Expr` object, which has a private `_call` method.
-We can call `nw.col('a')._call(pl)`, then the result is actually `pl.col('a')`.
+Inside `DataFrame.select`, we call `nw.col('a')._call(pl)`, which produces `pl.col('a')`.
 
 We then let Polars do its thing. Which is nice, but also not particularly interesting.
 How about translating expressions to pandas? Well, it's
@@ -94,13 +95,13 @@ The result from the last line above is the same as we'd get from `pn.col('a')`,
 a `narwhals._pandas_like.expression.PandasExpr` object, which we'll call `PandasExpr` for
 short.
 
-`PandasExpr` also have a `_call` method - but this one expects a `PandasDataFrame` as input.
+`PandasExpr` also has a `_call` method - but this one expects a `PandasDataFrame` as input.
 Recall from above that an expression is a function from a dataframe to a sequence of series.
 The `_call` method gives us that function! Let's see it in action.
 
-Note: the following examples uses `PandasDataFrame` and `PandasSeries`. These are wrappers
-around pandas DataFrame and pandas Series, which are Narwhals-compliant. To get the native
-pandas objects out from inside them, we access `PandasDataFrame._dataframe` and `PandasSeries._series`.
+Note: the following examples use `PandasDataFrame` and `PandasSeries`. These are backed
+by actual `pandas.DataFrame`s and `pandas.Series` respectively and are Narwhals-compliant. We can access the 
+underlying pandas objects via `PandasDataFrame._dataframe` and `PandasSeries._series`.
 
 ```python
 import narwhals as nw
@@ -140,3 +141,40 @@ than running pandas directly.
 Further attempts at demistifying Narwhals, refactoring code so it's clearer, and explaining
 this section better are 110% welcome.
 
+## Group-by
+
+Group-by is probably one of Polars' most significant innovations (on the syntax side) with respect
+to pandas. We can write something like
+```python
+df: pl.DataFrame
+df.group_by('a').agg((pl.col('c') > pl.col('b').mean()).max())
+```
+To do this in pandas, we need to either use `GroupBy.apply` (sloooow), or do some crazy manual
+optimisations to get it to work.
+
+In Narwhals, here's what we do:
+
+- if somebody uses a simple group-by aggregation (e.g. `df.group_by('a').agg(nw.col('b').mean())`),
+  then on the pandas side we translate it to
+  ```python
+
+  df: pd.DataFrame
+  df.groupby('a').agg({'b': ['mean']})
+  ```
+- if somebody passes a complex group-by aggregation, then we use `apply` and raise a `UserWarning`, warning
+  users of the performance penalty and advising them to refactor their code so that the aggregation they perform
+  ends up being a simple one.
+
+In order to tell whether an aggregation is simple, Narwhals uses the private `_depth` attribute of `PandasExpr`:
+
+```python
+>>> pn.col('a').mean()
+PandasExpr(depth=1, function_name=col->mean, root_names=['a'], output_names=['a']
+>>> (pn.col('a')+1).mean()
+PandasExpr(depth=2, function_name=col->__add__->mean, root_names=['a'], output_names=['a']
+>>> pn.mean('a')
+PandasExpr(depth=1, function_name=col->mean, root_names=['a'], output_names=['a']
+```
+
+For simple aggregations, Narwhals can just look at `_depth` and `function_name` and figure out
+which (efficient) elementary operation this corresponds to in pandas.
diff --git a/docs/index.md b/docs/index.md
@@ -12,6 +12,7 @@ Seamlessly support both, without depending on either!
 - ✅ Use **Expressions**
 - ✅ Tested against pandas and Polars nightly builds!
 - ✅ Preserve your Index (if present) without it getting in the way!
+- ✅ **Zero 3rd party imports**, Narwhals only uses what you already have!
 
 ## Who's this for?
 

diff --git a/docs/installation.md b/docs/installation.md
@@ -11,6 +11,6 @@ Then, if you start the Python REPL and see the following:
 ```python
 >>> import narwhals
 >>> narwhals
-'0.8.15'
+'0.8.18'
 ```
 then installation worked correctly!
diff --git a/docs/related.md b/docs/related.md
@@ -23,11 +23,11 @@ The projects are not in competition and have different goals.
 [Presents itself as a dataframe standard](https://voltrondata.com/resources/open-source-standards), and
 dispatches to 20+ backends. Some differences with Narwhals are:
 
-- Narwhals is ~1000 times lighter
+- Narwhals is ~1000 times lighter and is aimed at library maintainers as opposed to end users
 - Narwhals only supports 4 backends, Ibis more than 20
-- Narwhals is limited to fundamental dataframe operations, Ibis includes more advanced and niche ones.
+- Narwhals is focused on fundamental dataframe operations, Ibis on SQL backends
 
-Again, the projects are not in competition and have different goals.
+The projects are not in competition and have different goals.
 
 ## Array API
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -28,6 +28,7 @@ nav:
     - api-reference/expressions_str.md
     - api-reference/dtypes.md
     - api-reference/dependencies.md
+    - api-reference/selectors.md
 theme:
   name: material
   font: false

diff --git a/narwhals/__init__.py b/narwhals/__init__.py
@@ -31,7 +31,7 @@
 from narwhals.utils import maybe_align_index
 from narwhals.utils import maybe_set_index
 
-__version__ = "0.8.15"
+__version__ = "0.8.18"
 
 __all__ = [
     "concat",

diff --git a/narwhals/_pandas_like/expr.py b/narwhals/_pandas_like/expr.py
@@ -184,8 +184,12 @@ def fill_null(self, value: Any) -> Self:
     def is_in(self, other: Any) -> Self:
         return register_expression_call(self, "is_in", other)
 
-    def filter(self, other: Any) -> Self:
-        return register_expression_call(self, "filter", other)
+    def filter(self, *predicates: Any) -> Self:
+        from narwhals._pandas_like.namespace import PandasNamespace
+
+        plx = PandasNamespace(self._implementation)
+        expr = plx.all_horizontal(*predicates)
+        return register_expression_call(self, "filter", expr)
 
     def drop_nulls(self) -> Self:
         return register_expression_call(self, "drop_nulls")
@@ -253,6 +257,21 @@ def func(df: PandasDataFrame) -> list[PandasSeries]:
             implementation=self._implementation,
         )
 
+    def is_duplicated(self) -> Self:
+        return register_expression_call(self, "is_duplicated")
+
+    def is_unique(self) -> Self:
+        return register_expression_call(self, "is_unique")
+
+    def null_count(self) -> Self:
+        return register_expression_call(self, "null_count")
+
+    def is_first_distinct(self) -> Self:
+        return register_expression_call(self, "is_first_distinct")
+
+    def is_last_distinct(self) -> Self:
+        return register_expression_call(self, "is_last_distinct")
+
     @property
     def str(self) -> PandasExprStringNamespace:
         return PandasExprStringNamespace(self)
@@ -313,6 +332,12 @@ def minute(self) -> PandasExpr:
     def second(self) -> PandasExpr:
         return register_namespace_expression_call(self._expr, "dt", "second")
 
+    def millisecond(self) -> PandasExpr:
+        return register_namespace_expression_call(self._expr, "dt", "millisecond")
+
+    def microsecond(self) -> PandasExpr:
+        return register_namespace_expression_call(self._expr, "dt", "microsecond")
+
     def ordinal_day(self) -> PandasExpr:
         return register_namespace_expression_call(self._expr, "dt", "ordinal_day")
 

diff --git a/narwhals/_pandas_like/group_by.py b/narwhals/_pandas_like/group_by.py
@@ -12,6 +12,8 @@
 from narwhals._pandas_like.utils import is_simple_aggregation
 from narwhals._pandas_like.utils import item
 from narwhals._pandas_like.utils import parse_into_exprs
+from narwhals._pandas_like.utils import series_from_iterable
+from narwhals.dependencies import get_pandas
 from narwhals.utils import parse_version
 from narwhals.utils import remove_prefix
 
@@ -88,9 +90,7 @@ def agg_pandas(  # noqa: PLR0913
     - https://github.com/rapidsai/cudf/issues/15118
     - https://github.com/rapidsai/cudf/issues/15084
     """
-    import pandas as pd
-
-    from narwhals._pandas_like.namespace import PandasNamespace
+    pd = get_pandas()
 
     all_simple_aggs = True
     for expr in exprs:
@@ -140,8 +140,6 @@ def agg_pandas(  # noqa: PLR0913
         stacklevel=2,
     )
 
-    plx = PandasNamespace(implementation=implementation)
-
     def func(df: Any) -> Any:
         out_group = []
         out_names = []
@@ -150,14 +148,16 @@ def func(df: Any) -> Any:
             for result_keys in results_keys:
                 out_group.append(item(result_keys._series))
                 out_names.append(result_keys.name)
-        return plx.make_native_series(name="", data=out_group, index=out_names)
+        return series_from_iterable(
+            out_group, index=out_names, name="", implementation=implementation
+        )
 
     if implementation == "pandas":
-        import pandas as pd
+        pd = get_pandas()
 
         if parse_version(pd.__version__) < parse_version("2.2.0"):  # pragma: no cover
             result_complex = grouped.apply(func)
-        else:  # pragma: no cover
+        else:
             result_complex = grouped.apply(func, include_groups=False)
     else:  # pragma: no cover
         result_complex = grouped.apply(func)

diff --git a/narwhals/_pandas_like/namespace.py b/narwhals/_pandas_like/namespace.py
@@ -9,6 +9,7 @@
 from narwhals import dtypes
 from narwhals._pandas_like.dataframe import PandasDataFrame
 from narwhals._pandas_like.expr import PandasExpr
+from narwhals._pandas_like.selectors import PandasSelector
 from narwhals._pandas_like.series import PandasSeries
 from narwhals._pandas_like.utils import horizontal_concat
 from narwhals._pandas_like.utils import parse_into_exprs
@@ -36,20 +37,9 @@ class PandasNamespace:
     String = dtypes.String
     Datetime = dtypes.Datetime
 
-    def make_native_series(self, name: str, data: list[Any], index: Any) -> Any:
-        if self._implementation == "pandas":
-            import pandas as pd
-
-            return pd.Series(name=name, data=data, index=index)
-        if self._implementation == "modin":  # pragma: no cover
-            import modin.pandas as mpd
-
-            return mpd.Series(name=name, data=data, index=index)
-        if self._implementation == "cudf":  # pragma: no cover
-            import cudf
-
-            return cudf.Series(name=name, data=data, index=index)
-        raise NotImplementedError  # pragma: no cover
+    @property
+    def selectors(self) -> PandasSelector:
+        return PandasSelector(self._implementation)
 
     # --- not in spec ---
     def __init__(self, implementation: str) -> None:

diff --git a/narwhals/_pandas_like/selectors.py b/narwhals/_pandas_like/selectors.py
@@ -0,0 +1,28 @@
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from narwhals._pandas_like.expr import PandasExpr
+
+if TYPE_CHECKING:
+    from narwhals._pandas_like.dataframe import PandasDataFrame
+    from narwhals._pandas_like.series import PandasSeries
+    from narwhals.dtypes import DType
+
+
+class PandasSelector:
+    def __init__(self, implementation: str) -> None:
+        self._implementation = implementation
+
+    def by_dtype(self, dtypes: list[DType]) -> PandasExpr:
+        def func(df: PandasDataFrame) -> list[PandasSeries]:
+            return [df[col] for col in df.columns if df.schema[col] in dtypes]
+
+        return PandasExpr(
+            func,
+            depth=0,
+            function_name="type_selector",
+            root_names=None,
+            output_names=None,
+            implementation=self._implementation,
+        )