feat: Add `str.normalize()` #20483

etiennebacher · 2024-12-27T20:18:27Z

Contributing to the Rust part for the first time so there are probably some quirks here and there. I used the suggestion in #11455 to use the unicode_normalization crate and mostly followed #12878. I don't know if you want to add this function or to implement it that way but it was good training for me anyway.

Note that I'm not very familiar with this method so double-checking the output and maybe adding more corner cases to the test suite would be nice.

Quick performance check after make build-release:

import polars as pl
import time
import pandas as pd

N = 20_000_000
txt = ["01²3", "株式会社", "ሎ", "ＫＡＤＯＫＡＷＡ Ｆｕｔｕｒｅ"]

ser = pd.Series(txt * N)
start = time.time()
ser.str.normalize('NFKC')
print("Pandas:", time.time() - start)

ser = pl.Series(txt* N)
start = time.time()
ser.str.normalize('NFKC')
print("Polars:", time.time() - start)

Pandas: 11.836752653121948
Polars: 11.922921657562256

A bit disappointed with the performance, maybe I missed something obvious. There are also a couple of issues on performance in the Rust crate used: https://github.com/unicode-rs/unicode-normalization/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen+performance

Fixes #5799
Fixes #11455

codecov · 2024-12-27T21:22:07Z

Codecov Report

Attention: Patch coverage is 84.05797% with 11 lines in your changes missing coverage. Please review.

Project coverage is 78.85%. Comparing base (c4b704b) to head (13c49f0).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
.../polars-python/src/lazyframe/visitor/expr_nodes.rs	0.00%	8 Missing ⚠️
.../polars-ops/src/chunked_array/strings/normalize.rs	91.30%	2 Missing ⚠️
...rates/polars-plan/src/dsl/function_expr/strings.rs	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20483      +/-   ##
==========================================
- Coverage   78.96%   78.85%   -0.12%     
==========================================
  Files        1557     1559       +2     
  Lines      220743   221124     +381     
  Branches     2527     2527              
==========================================
+ Hits       174318   174363      +45     
- Misses      45847    46183     +336     
  Partials      578      578

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

orlp · 2024-12-27T21:34:43Z

This kernel should not be written by collecting to a temporary String for each string. It should instead be something like this to re-use the allocation:

pub fn normalize_with<F: Fn(&str, &mut String)>(ca: &StringChunked, normalizer: F) -> StringChunked {
    let mut buffer = String::new();
    let mut builder = StringChunkedBuilder::new(ca.name().clone(), ca.len());
    for opt_s in ca.iter() {
        if let Some(s) = opt_s {
            buffer.clear();
            normalizer(s, &mut buffer);
            builder.append_value(&buffer);
        } else {
            builder.append_null();
        }
    }
    builder.finish()
}

pub fn normalize(ca: &StringChunked, form: UnicodeForm) -> StringChunked {
    match form {
        UnicodeForm::NFC => normalize_with(ca, |s, b| b.extend(s.nfc())),
        UnicodeForm::NFKC => normalize_with(ca, |s, b| b.extend(s.nfkc())),
        UnicodeForm::NFD => normalize_with(ca, |s, b| b.extend(s.nfd())),
        UnicodeForm::NFKD => normalize_with(ca, |s, b| b.extend(s.nfkd())),
    }
}

etiennebacher · 2024-12-27T23:18:48Z

Thanks @orlp, I naively followed unicode_normalization's example but should have given more thought to this.

Updated benchmark:

Pandas: 20.463711977005005
Polars: 16.712544441223145

(Can't really explain the change in magnitude compared to the first one but the gap between polars and pandas now is consistently there)

ritchie46 · 2024-12-29T09:34:45Z

Thanks for your first contributions @etiennebacher. Before implementing features, we should first decide if we want them. (This is shown by the accepted tag).

For one, I am not entirely sure that we do want this in the main library. It seems quite a large dependency (with all the unicode tables), which might be better suited for a plugin.

Let me get back to this, I want to see how much this dependency adds and how important of a feature this is.

etiennebacher · 2024-12-29T12:49:31Z

Sure, no problem with letting this be a plugin functionality.

I don't mind this being closed, but no matter the outcome the two issues mentioned in the original post should be updated.

drumtorben · 2025-01-06T09:31:59Z

This functionality is already in the polars-ds extension:
https://polars-ds-extension.readthedocs.io/en/latest/string.html#polars_ds.string.normalize_string

ritchie46

Alright, I have looked at the wheel size, and at how core this is and I think this is worth it.

The PR looks great @etiennebacher. Thanks. 👍

ritchie46 · 2025-01-11T14:43:01Z

Ah, I see there is 1 mypy lint. Can that be fixed.

… into str_normalize

etiennebacher · 2025-01-11T15:41:47Z

Thanks @ritchie46, the mypy failure is fixed

init

9ca433c

etiennebacher requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa, wence- and orlp as code owners December 27, 2024 20:18

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Dec 27, 2024

etiennebacher added 4 commits December 27, 2024 21:20

fmt

2eb7ae2

lint

9ac69ce

try fix mypy

bff1a9e

fix series docs

dae704c

improve following comment

18e903c

etiennebacher marked this pull request as draft December 28, 2024 08:19

This comment was marked as outdated.

Sign in to view

better docs

cb023f8

etiennebacher marked this pull request as ready for review December 28, 2024 09:12

etiennebacher added 4 commits January 10, 2025 14:55

init

666cdb4

fmt

3544cec

lint

b4bff5b

try fix mypy

95ce74a

etiennebacher added 3 commits January 10, 2025 14:55

fix series docs

6b41fef

improve following comment

29fb5d1

better docs

8497bb9

ritchie46 force-pushed the str_normalize branch from cb023f8 to 8497bb9 Compare January 10, 2025 13:55

ritchie46 approved these changes Jan 11, 2025

View reviewed changes

etiennebacher added 2 commits January 11, 2025 15:48

Merge branch 'str_normalize' of https://github.com/etiennebacher/polars…

4a3f28c

… into str_normalize

fix mypy

13c49f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `str.normalize()` #20483

feat: Add `str.normalize()` #20483

etiennebacher commented Dec 27, 2024

codecov bot commented Dec 27, 2024 •

edited

Loading

orlp commented Dec 27, 2024 •

edited

Loading

etiennebacher commented Dec 27, 2024

This comment was marked as outdated.

ritchie46 commented Dec 29, 2024 •

edited

Loading

etiennebacher commented Dec 29, 2024

drumtorben commented Jan 6, 2025

ritchie46 left a comment

ritchie46 commented Jan 11, 2025

etiennebacher commented Jan 11, 2025

feat: Add str.normalize() #20483

Are you sure you want to change the base?

feat: Add str.normalize() #20483

Conversation

etiennebacher commented Dec 27, 2024

codecov bot commented Dec 27, 2024 • edited Loading

Codecov Report

orlp commented Dec 27, 2024 • edited Loading

etiennebacher commented Dec 27, 2024

This comment was marked as outdated.

ritchie46 commented Dec 29, 2024 • edited Loading

etiennebacher commented Dec 29, 2024

drumtorben commented Jan 6, 2025

ritchie46 left a comment

Choose a reason for hiding this comment

ritchie46 commented Jan 11, 2025

etiennebacher commented Jan 11, 2025

feat: Add `str.normalize()` #20483

feat: Add `str.normalize()` #20483

codecov bot commented Dec 27, 2024 •

edited

Loading

orlp commented Dec 27, 2024 •

edited

Loading

ritchie46 commented Dec 29, 2024 •

edited

Loading