Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Update concatenation page to include relaxed and changes to rechunk #16775

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion docs/development/contributing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,10 @@ The snippet is delimited by `--8<-- [start:<snippet_name>]` and `--8<-- [end:<sn

#### Linting

Before committing, install `dprint` (see above) and run `dprint fmt` from the `docs` directory to lint the markdown files.
Before committing:

- install `dprint` (see above) and run `dprint fmt` from the `docs` directory to lint the markdown files
- run `cargo fmt` for the `docs` directory to format the Rust code snippets

### API reference

Expand Down
39 changes: 39 additions & 0 deletions docs/src/python/user-guide/transformations/concatenation.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,30 @@
print(df_vertical_concat)
# --8<-- [end:vertical]

# --8<-- [start:vertical_relaxed]
df_v1 = pl.DataFrame(
{
"a": [1.0],
"b": [3],
},
)
df_v2 = pl.DataFrame(
{
"a": [2],
"b": [4],
},
)
df_vertical_relaxed_concat = pl.concat(
[
df_v1,
df_v2,
],
how="vertical_relaxed",
)
print(df_vertical_relaxed_concat)
# --8<-- [end:vertical_relaxed]


# --8<-- [start:horizontal]
df_h1 = pl.DataFrame(
{
Expand Down Expand Up @@ -73,6 +97,21 @@
print(df_horizontal_concat)
# --8<-- [end:horizontal_different_lengths]

# --8<-- [start:horizontal_align]
df_h1 = pl.DataFrame({"a": ["a", "b", "d", "e", "e"], "b": [1, 2, 4, 5, 6]})
df_h2 = pl.DataFrame({"a": ["a", "b", "c", "d", "e"], "d": ["w", "x", "y", "z", None]})
df_align = pl.concat(
[
df_h1,
df_h2,
],
how="align",
)
print(df_align)

# --8<-- [end:horizontal_align]


# --8<-- [start:cross]
df_d1 = pl.DataFrame(
{
Expand Down
22 changes: 22 additions & 0 deletions docs/src/rust/user-guide/transformations/concatenation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,23 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("{}", &df_vertical_concat);
// --8<-- [end:vertical]

// --8<-- [start:vertical_relaxed]
let df_v1 = df!(
"a"=> &[1.0],
"b"=> &[3],
)?;
let df_v2 = df!(
"a"=> &[2],
"b"=> &[4],
)?;
let df_vertical_relaxed_concat = concat(
[df_v1.clone().lazy(), df_v2.clone().lazy()],
UnionArgs::default(),
)?
.collect()?;
println!("{}", &df_vertical_relaxed_concat);
// --8<-- [end:vertical_relaxed]

// --8<-- [start:horizontal]
let df_h1 = df!(
"l1"=> &[1, 2],
Expand Down Expand Up @@ -47,6 +64,11 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("{}", &df_horizontal_concat);
// --8<-- [end:horizontal_different_lengths]

// --8<-- [start:horizontal_align]
println!("Not available in Rust");

// --8<-- [end:horizontal_align]

// --8<-- [start:cross]
let df_d1 = df!(
"a"=> &[1],
Expand Down
26 changes: 21 additions & 5 deletions docs/user-guide/transformations/concatenation.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,15 @@ In a vertical concatenation you combine all of the rows from a list of `DataFram
--8<-- "python/user-guide/transformations/concatenation.py:vertical"
```

Vertical concatenation fails when the dataframes do not have the same column names.
Vertical concatenation fails when the dataframes do not have the same column names and dtypes.

For certain differences in dtypes, Polars can do a relaxed vertical concatenation where the differences in dtype are resolved by casting all columns with the same name but different dtypes to a _supertype_. For example, if column `'a'` in the first `DataFrame` is `Float32` but column `'a'` in the second `DataFrame` is `Int64`, then both columns are cast to their supertype `Float64` before concatenation. If the set of dtypes for a column do not have a supertype, the concatenation fails. The supertype mappings are defined internally in Polars.

{{code_block('user-guide/transformations/concatenation','vertical_relaxed',['concat'])}}

```python exec="on" result="text" session="user-guide/transformations/concatenation"
--8<-- "python/user-guide/transformations/concatenation.py:vertical_relaxed"
```

## Horizontal concatenation - getting wider

Expand All @@ -40,21 +48,29 @@ columns will be padded with `null` values at the end up to the maximum length.
--8<-- "python/user-guide/transformations/concatenation.py:horizontal_different_lengths"
```

An alternative horizontal concatenation method is `align` where Polars combines frames horizontally by determining the common key columns and aligning rows.
{{code_block('user-guide/transformations/concatenation','horizontal_align',['concat'])}}

```python exec="on" result="text" session="user-guide/transformations/concatenation"
--8<-- "python/user-guide/transformations/concatenation.py:horizontal_align"
```

## Diagonal concatenation - getting longer, wider and `null`ier

In a diagonal concatenation you combine all of the row and columns from a list of `DataFrames` into a single longer and/or wider `DataFrame`.
In a diagonal concatenation you combine all of the rows and columns from a list of `DataFrames` into a single longer and/or wider `DataFrame`.

{{code_block('user-guide/transformations/concatenation','cross',['concat'])}}

```python exec="on" result="text" session="user-guide/transformations/concatenation"
--8<-- "python/user-guide/transformations/concatenation.py:cross"
```

Diagonal concatenation generates nulls when the column names do not overlap.
Diagonal concatenation generates nulls when the column names do not overlap but fails if the dtypes do not match for columns with the same name. As with vertical concatenation there is an alternative `diagonal_relaxed` method that tries to cast columns to a supertype if columns with the same name have different dtypes.

When the dataframe shapes do not match and we have an overlapping semantic key then [we can join the dataframes](joins.md) instead of concatenating them.

## Rechunking

Before a concatenation we have two dataframes `df1` and `df2`. Each column in `df1` and `df2` is in one or more chunks in memory. By default, during concatenation the chunks in each column are copied to a single new chunk - this is known as **rechunking**. Rechunking is an expensive operation, but is often worth it because future operations will be faster.
If you do not want Polars to rechunk the concatenated `DataFrame` you specify `rechunk = False` when doing the concatenation.
We have a `list` of `DataFrames` and we want to concatenate them. Each column in each `DataFrame` is stored in one or more chunks in memory. When we concatenate the `DataFrames` then the data from each column in each `DataFrame` can be copied to a single location in memory - this is known as **rechunking**. Rechunking is an expensive process as it requires copying data from one location to another. However, rechunking can make subsequent operations faster as the data is in a single location in memory.

By default when we do a concatenation in eager mode rechunking does not happen. If we want Polars to rechunk the concatenated `DataFrame` then specify `rechunk = True` when doing the concatenation. In lazy mode the query optimizer assesses whether to do rechunking based on the query plan.