Skip to content

Commit

Permalink
chore: Factor out _polars from top-level Narwhals modules (i.e.: the …
Browse files Browse the repository at this point in the history
…mother of all refactors!) (#561)
  • Loading branch information
MarcoGorelli authored Jul 20, 2024
1 parent 2f60044 commit fb80986
Show file tree
Hide file tree
Showing 24 changed files with 890 additions and 377 deletions.
9 changes: 6 additions & 3 deletions .github/workflows/downstream_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
altair:
strategy:
matrix:
python-version: ["3.11"]
python-version: ["3.12"]
os: [ubuntu-latest]

runs-on: ${{ matrix.os }}
Expand All @@ -27,7 +27,10 @@ jobs:
~\AppData\Local\pip\Cache
key: ${{ runner.os }}-build-${{ matrix.python-version }}
- name: clone-altair
run: git clone https://github.com/vega/altair.git
run: |
git clone https://github.com/vega/altair.git --depth=1
cd altair
git log
- name: install-basics
run: python -m pip install --upgrade tox virtualenv setuptools pip
- name: install-altair-dev
Expand Down Expand Up @@ -70,7 +73,7 @@ jobs:
~\AppData\Local\pip\Cache
key: ${{ runner.os }}-build-${{ matrix.python-version }}
- name: clone-scikit-lego
run: git clone https://github.com/koaning/scikit-lego.git
run: git clone https://github.com/koaning/scikit-lego.git --depth 1
- name: install-basics
run: python -m pip install --upgrade tox virtualenv setuptools pip
- name: install-scikit-lego-dev
Expand Down
38 changes: 18 additions & 20 deletions docs/how_it_works.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,28 +58,8 @@ it to:

Now let's turn our attention to the implementation.

## Polars implementation

For Polars, Narwhals just "passes everything through". For example consider the following:
```python exec="1"
import polars as pl
import narwhals as nw

df_pl = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df = nw.from_native(df_pl)
df.select(nw.col('a')+1)
```

`nw.col('a')` produces a `narwhals.expression.Expr` object, which has a private `_call` method.
Inside `DataFrame.select`, we call `nw.col('a')._call(pl)`, which produces `pl.col('a')`.

We then let Polars do its thing. Which is nice, but also not particularly interesting.
How about translating expressions to pandas? Well, it's
interesting to us, and you're still reading, so maybe it'll be interesting to you too.

## pandas implementation

When we called `nw.col('a')._call(pl)`, we got a Narwhals-compliant Polars namespace.
The pandas namespace (`pd`) isn't Narwhals-compliant, as the pandas API is very different
from Polars'. So...Narwhals implements a `PandasLikeNamespace`, which includes the top-level
Polars functions included in the Narwhals API:
Expand Down Expand Up @@ -151,6 +131,24 @@ than running pandas directly.
Further attempts at demistifying Narwhals, refactoring code so it's clearer, and explaining
this section better are 110% welcome.

## Polars and other implementations

Other implementations are similar to the above: their define their own Narwhals-compliant
objects. So, all-in-all, there are a couple of layers here:

- `nw.DataFrame` is backed by a Narwhals-compliant Dataframe, such as:
- `narwhals._pandas_like.dataframe.PandasLikeDataFrame`
- `narwhals._arrow.dataframe.ArrowDataFrame`
- `narwhals._polars.dataframe.PolarsDataFrame`
- each Narwhals-compliant DataFrame is backed by a native Dataframe, for example:
- `narwhals._pandas_like.dataframe.PandasLikeDataFrame` is backed by a pandas DataFrame
- `narwhals._arrow.dataframe.ArrowDataFrame` is backed by a PyArrow Table
- `narwhals._polars.dataframe.PolarsDataFrame` is backed by a Polars DataFrame

Each implementation defines its own objects in subfolders such as `narwhals._pandas_like`,
`narwhals._arrow`, `narwhals._polars`, whereas the top-level modules such as `narwhals.dataframe`
and `narwhals.series` coordinate how to dispatch the Narwhals API to each backend.

## Group-by

Group-by is probably one of Polars' most significant innovations (on the syntax side) with respect
Expand Down
Empty file added narwhals/_polars/__init__.py
Empty file.
188 changes: 188 additions & 0 deletions narwhals/_polars/dataframe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
from __future__ import annotations

from typing import TYPE_CHECKING
from typing import Any

from narwhals._polars.namespace import PolarsNamespace
from narwhals._polars.utils import extract_args_kwargs
from narwhals._polars.utils import translate_dtype
from narwhals.dependencies import get_polars

if TYPE_CHECKING:
from typing_extensions import Self


class PolarsDataFrame:
def __init__(self, df: Any, *, backend_version: tuple[int, ...]) -> None:
self._native_dataframe = df
self._backend_version = backend_version

def __repr__(self) -> str: # pragma: no cover
return "PolarsDataFrame"

def __narwhals_dataframe__(self) -> Self:
return self

def __narwhals_namespace__(self) -> PolarsNamespace:
return PolarsNamespace(backend_version=self._backend_version)

def __native_namespace__(self) -> Any:
return get_polars()

def _from_native_dataframe(self, df: Any) -> Self:
return self.__class__(df, backend_version=self._backend_version)

def _from_native_object(self, obj: Any) -> Any:
pl = get_polars()
if isinstance(obj, pl.Series):
from narwhals._polars.series import PolarsSeries

return PolarsSeries(obj, backend_version=self._backend_version)
if isinstance(obj, pl.DataFrame):
return self._from_native_dataframe(obj)
# scalar
return obj

def __getattr__(self, attr: str) -> Any:
if attr == "collect": # pragma: no cover
raise AttributeError

def func(*args: Any, **kwargs: Any) -> Any:
args, kwargs = extract_args_kwargs(args, kwargs) # type: ignore[assignment]
return self._from_native_object(
getattr(self._native_dataframe, attr)(*args, **kwargs)
)

return func

@property
def schema(self) -> dict[str, Any]:
schema = self._native_dataframe.schema
return {name: translate_dtype(dtype) for name, dtype in schema.items()}

def collect_schema(self) -> dict[str, Any]:
if self._backend_version < (1,): # pragma: no cover
schema = self._native_dataframe.schema
else:
schema = dict(self._native_dataframe.collect_schema())
return {name: translate_dtype(dtype) for name, dtype in schema.items()}

@property
def shape(self) -> tuple[int, int]:
return self._native_dataframe.shape # type: ignore[no-any-return]

def __getitem__(self, item: Any) -> Any:
pl = get_polars()
result = self._native_dataframe.__getitem__(item)
if isinstance(result, pl.Series):
from narwhals._polars.series import PolarsSeries

return PolarsSeries(result, backend_version=self._backend_version)
return self._from_native_object(result)

def get_column(self, name: str) -> Any:
from narwhals._polars.series import PolarsSeries

return PolarsSeries(
self._native_dataframe.get_column(name), backend_version=self._backend_version
)

def is_empty(self) -> bool:
return len(self._native_dataframe) == 0

@property
def columns(self) -> list[str]:
return self._native_dataframe.columns # type: ignore[no-any-return]

def lazy(self) -> PolarsLazyFrame:
return PolarsLazyFrame(
self._native_dataframe.lazy(), backend_version=self._backend_version
)

def to_dict(self, *, as_series: bool) -> Any:
df = self._native_dataframe

if as_series:
from narwhals._polars.series import PolarsSeries

return {
name: PolarsSeries(col, backend_version=self._backend_version)
for name, col in df.to_dict(as_series=True).items()
}
else:
return df.to_dict(as_series=False)

def group_by(self, by: list[str]) -> Any:
from narwhals._polars.group_by import PolarsGroupBy

return PolarsGroupBy(self, by)

def with_row_index(self, name: str) -> Any:
if self._backend_version < (0, 20, 4): # pragma: no cover
return self._from_native_dataframe(
self._native_dataframe.with_row_count(name)
)
return self._from_native_dataframe(self._native_dataframe.with_row_index(name))


class PolarsLazyFrame:
def __init__(self, df: Any, *, backend_version: tuple[int, ...]) -> None:
self._native_dataframe = df
self._backend_version = backend_version

def __repr__(self) -> str: # pragma: no cover
return "PolarsLazyFrame"

def __narwhals_lazyframe__(self) -> Self:
return self

def __narwhals_namespace__(self) -> PolarsNamespace:
return PolarsNamespace(backend_version=self._backend_version)

def __native_namespace__(self) -> Any: # pragma: no cover
return get_polars()

def _from_native_dataframe(self, df: Any) -> Self:
return self.__class__(df, backend_version=self._backend_version)

def __getattr__(self, attr: str) -> Any:
def func(*args: Any, **kwargs: Any) -> Any:
args, kwargs = extract_args_kwargs(args, kwargs) # type: ignore[assignment]
return self._from_native_dataframe(
getattr(self._native_dataframe, attr)(*args, **kwargs)
)

return func

@property
def columns(self) -> list[str]:
return self._native_dataframe.columns # type: ignore[no-any-return]

@property
def schema(self) -> dict[str, Any]:
schema = self._native_dataframe.schema
return {name: translate_dtype(dtype) for name, dtype in schema.items()}

def collect_schema(self) -> dict[str, Any]:
if self._backend_version < (1,): # pragma: no cover
schema = self._native_dataframe.schema
else:
schema = dict(self._native_dataframe.collect_schema())
return {name: translate_dtype(dtype) for name, dtype in schema.items()}

def collect(self) -> PolarsDataFrame:
return PolarsDataFrame(
self._native_dataframe.collect(), backend_version=self._backend_version
)

def group_by(self, by: list[str]) -> Any:
from narwhals._polars.group_by import PolarsLazyGroupBy

return PolarsLazyGroupBy(self, by)

def with_row_index(self, name: str) -> Any:
if self._backend_version < (0, 20, 4): # pragma: no cover
return self._from_native_dataframe(
self._native_dataframe.with_row_count(name)
)
return self._from_native_dataframe(self._native_dataframe.with_row_index(name))
Loading

0 comments on commit fb80986

Please sign in to comment.