Polars Extension for General Data Science Use

Currently in Alpha. Feel free to submit feature requests in the issues section of the repo.

The goal for this package is to provide data scientists/analysts/engineers/quants more tools to manipulate, transform, and make sense of data, without the need to leave DataFrame land (aka Wonderland).

Performance is a focus, but sometimes it's impossible to beat NumPy/SciPy performance for a single operation on a single array. There can be many reasons: Interop cost, null checks, lack of support for complex number (e.g We have to do multiple copies in the FFT implementation), or we haven't found the most optimized way to write some algorithm, etc.

However, there are greater benefits for staying in DataFrame land:

Works with Polars expression engine and more expressions can be executed in parallel. E.g. running fft for 1 series may be slower than NumPy, but if you are running some fft, together with some other non-trivial operations, the story changes completely.
Works in group_by context. E.g. run multiple linear regressions in parallel in a group_by context.
Staying in DataFrame land typically keeps code cleaner and less confusing.

df.group_by("dummy").agg(
    pl.col("y").num_ext.lstsq(pl.col("a"), pl.col("b"), add_bias = True)
)

The package right now contains two extensions:

Numeric Extension

Existing Features

GCD, LCM for integers
harmonic mean, geometric mean, other common, simple metrics used in industry.
Common loss functions, e.g. L1, L2, L infinity, huber loss, MAPE, SMAPE, wMAPE, etc.
Common mini-models, lstsq, condition entropy.
Discrete Fourier Transform, returning the real and complex part of the new series.

String Extension

Existing Features

Levenshtein distance, Hamming distance, str Jaccard similarity
Simple Tokenize
Stemming (Right now only Snowball stemmer for English)

Todo list

Longest common subsequence as string distance metric
Vectorizers (Count + TFIDF)?
Similarity version of the distances, and more variations and parameters.

Other Extensions ?

E.g. stats_ext, dist_ext (L^p distance for vectors (scalar version is implemented) etc.) etc.

Simple unsupervised clusters can also be done. It is simply a matter of willingness and market demand.

Disclaimer

Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Polars Extension for General Data Science Use

Numeric Extension

Existing Features

String Extension

Existing Features

Todo list

Other Extensions ?

Disclaimer

Files

README.md

Latest commit

History

README.md

File metadata and controls

Polars Extension for General Data Science Use

Numeric Extension

Existing Features

String Extension

Existing Features

Todo list

Other Extensions ?

Disclaimer