Skip to content

Commit

Permalink
add docs page
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcoGorelli committed Mar 16, 2024
1 parent 0170195 commit 9ce6f21
Show file tree
Hide file tree
Showing 11 changed files with 396 additions and 0 deletions.
32 changes: 32 additions & 0 deletions .github/workflows/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: mkdocs

on:
push:
branches:
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV


- uses: actions/cache@v3
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install -r docs/requirements-docs.txt -e . pandas polars

- run: mkdocs gh-deploy --force
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
*.pyc
todo.md
.coverage
site/
94 changes: 94 additions & 0 deletions docs/basics/column.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Column

In [dataframe.md](dataframe.md), you learned how to write a dataframe-agnostic function.

We only used DataFrame methods there - but what if we need to operate on its columns?

## Extracting a column


## Example 1: filter based on a column's values

```python exec="1" source="above" session="ex1"
import narwhals as nw

def my_func(df):
df_s = nw.DataFrame(df)
df_s = df_s.filter(nw.col('a') > 0)
return nw.to_native(df_s)
```

=== "pandas"
```python exec="true" source="material-block" result="python" session="ex1"
import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
```

=== "Polars"
```python exec="true" source="material-block" result="python" session="ex1"
import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
```


## Example 2: multiply a column's values by a constant

Let's write a dataframe-agnostic function which multiplies the values in column
`'a'` by 2.

```python exec="1" source="above" session="ex2"
import narwhals as nw

def my_func(df):
df_s = nw.DataFrame(df)
df_s = df_s.with_columns(nw.col('a')*2)
return nw.to_native(df_s)
```

=== "pandas"
```python exec="true" source="material-block" result="python" session="ex2"
import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
```

=== "Polars"
```python exec="true" source="material-block" result="python" session="ex2"
import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
```

Note that column `'a'` was overwritten. If we had wanted to add a new column called `'c'` containing column `'a'`'s
values multiplied by 2, we could have used `Column.rename`:

```python exec="1" source="above" session="ex2.1"
import narwhals as nw

def my_func(df):
df_s = nw.DataFrame(df)
df_s = df_s.with_columns((nw.col('a')*2).alias('c'))
return nw.to_native(df_s)
```

=== "pandas"
```python exec="true" source="material-block" result="python" session="ex2.1"
import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
```

=== "Polars"
```python exec="true" source="material-block" result="python" session="ex2.1"
import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
```
106 changes: 106 additions & 0 deletions docs/basics/complete_example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Complete example

We're going to write a dataframe-agnostic "Standard Scaler". This class will have
`fit` and `transform` methods (like `scikit-learn` transformers), and will work
agnosstically for pandas and Polars.

We'll need to write two methods:

- `fit`: find the mean and standard deviation for each column from a given training set;
- `transform`: scale a given dataset with the mean and standard deviations calculated
during `fit`.

The `fit` method is a bit complicated, so let's start with `transform`.
Suppose we've already calculated the mean and standard deviation of each column, and have
stored them in attributes `self.means` and `self.std_devs`.

## Transform method

The general strategy will be:

1. Initialise a Narwhals DataFrame by passing your dataframe to `nw.DataFrame`.
2. Express your logic using the subset of the Polars API supported by Narwhals.
3. If you need to return a dataframe to the user in its original library, call `narwhals.to_native`.

```python
import narwhals as nw

class StandardScalar:
def transform(self, df):
df = nw.DataFrame(df)
df = df.with_columns(
(nw.col(col) - self._means[col]) / self._std_devs[col]
for col in df.columns
)
return nw.to_native(df)
```

Note that all the calculations here can stay lazy if the underlying library permits it.
For Polars, the return value is a `polars.LazyFrame` - it is the caller's responsibility to
call `.collect()` on the result if they want to materialise its values.

## Fit method

Unlike the `transform` method, `fit` cannot stay lazy, as we need to compute concrete values
for the means and standard deviations.

To be able to get `Series` out of our `DataFrame`, we'll need the `DataFrame` to be an
eager one, as Polars doesn't have a concept of lazy `Series`.
To do that, when we instantiate our `narwhals.DataFrame`, we pass `features=['eager']`,
which lets us access eager-only features.

```python
import narwhals as nw

class StandardScalar:
def fit(self, df):
df = nw.DataFrame(df, features=['eager'])
self._means = {df[col].mean() for col in df.columns}
self._std_devs = {df[col].std() for col in df.columns}
```

## Putting it all together

Here is our dataframe-agnostic standard scaler:
```python exec="1" source="above" session="tute-ex1"
import narwhals as nw

class StandardScaler:
def fit(self, df):
df = nw.DataFrame(df, features=["eager"])
self._means = {col: df[col].mean() for col in df.columns}
self._std_devs = {col: df[col].std() for col in df.columns}

def transform(self, df):
df = nw.DataFrame(df)
df = df.with_columns(
(nw.col(col) - self._means[col]) / self._std_devs[col]
for col in df.columns
)
return nw.to_native(df)
```

Next, let's try running it. Notice how, as `transform` doesn't use
`features=['lazy']`, we can pass a `polars.LazyFrame` to it without issues!

=== "pandas"
```python exec="true" source="material-block" result="python" session="tute-ex1"
import pandas as pd

df_train = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 7]})
df_test = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 7]})
scaler = StandardScaler()
scaler.fit(df_train)
print(scaler.transform(df_test))
```

=== "Polars"
```python exec="true" source="material-block" result="python" session="tute-ex1"
import polars as pl

df_train = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 7]})
df_test = pl.LazyFrame({'a': [1, 2, 3], 'b': [4, 5, 7]})
scaler = StandardScaler()
scaler.fit(df_train)
print(scaler.transform(df_test).collect())
```
41 changes: 41 additions & 0 deletions docs/basics/dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# DataFrame

To write a dataframe-agnostic function, the steps you'll want to follow are:

1. Initialise a Narwhals DataFrame by passing your dataframe to `nw.DataFrame`.
2. Express your logic using the subset of the Polars API supported by Narwhals.
3. If you need to return a dataframe to the user in its original library, call `narwhals.to_native`.

Let's try writing a simple example.

## Example 1: group-by and mean

Make a Python file `t.py` with the following content:
```python exec="1" source="above" session="df_ex1"
import narwhals as nw

def func(df):
# 1. Create a Narwhals dataframe
df_s = nw.DataFrame(df)
# 2. Use the subset of the Polars API supported by Narwhals
df_s = df_s.group_by('a').agg(nw.col('b').mean())
# 3. Return a library from the user's original library
return nw.to_native(df_s)
```
Let's try it out:

=== "pandas"
```python exec="true" source="material-block" result="python" session="df_ex1"
import pandas as pd

df = pd.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
print(func(df))
```

=== "Polars"
```python exec="true" source="material-block" result="python" session="df_ex1"
import polars as pl

df = pl.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
print(func(df))
```
19 changes: 19 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Narwhals

Extremely lightweight compatibility layer between pandas and Polars:

- ✅ No dependencies.
- ✅ Lightweight: wheel is smaller than 30 kB.
- ✅ Simple, minimal, and predictable.

No need to choose - support both with ease!

## Who's this for?

Anyone wishing to write a library/application/service which consumes dataframes, and wishing to make it
completely dataframe-agnostic.

## Let's get started!

- [Installation](installation.md)
- [Quick start](quick_start.md)
16 changes: 16 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Installation

First, make sure you have [created and activated](https://docs.python.org/3/library/venv.html) a Python3.8+ virtual environment.

Then, run
```console
python -m pip install narwhals
```

Then, if you start the Python REPL and see the following:
```python
>>> import narwhals
>>> narwhals
'0.4.1'
```
then installation worked correctly!
43 changes: 43 additions & 0 deletions docs/quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Quick start

## Prerequisites

Please start by following the [installation instructions](installation.md)

Then, please install the following:

- [pandas](https://pandas.pydata.org/docs/getting_started/install.html)
- [Polars](https://pola-rs.github.io/polars/user-guide/installation/)

## Simple example

Create a Python file `t.py` with the following content:

```python
import pandas as pd
import polars as pl
import narwhals as nw


def my_function(df_any):
df = nw.DataFrame(df_any)
column_names = df.column_names
return column_names


df_pandas = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_polars = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

print('pandas result: ', my_function(df_pandas))
print('Polars result: ', my_function(df_polars))
```

If you run `python t.py` and your output looks like this:
```
pandas result: ['a', 'b']
Polars result: ['a', 'b']
```

then all your installations worked perfectly.

Let's learn about what you just did, and what Narwhals can do for you.
12 changes: 12 additions & 0 deletions docs/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Reference

Here are some related projects.

## Dataframe Interchange Protocol

Standardised way of interchanging data between libraries, see
[here](https://data-apis.org/dataframe-protocol/latest/index.html).

## Array API

Array counterpart to the DataFrame API, see [here](https://data-apis.org/array-api/2022.12/index.html).
5 changes: 5 additions & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
markdown-exec[ansi]
mkdocs
mkdocs-material
mkdocstrings
mkdocstrings[python]
Loading

0 comments on commit 9ce6f21

Please sign in to comment.