Set operations with a fixed set #10908

mmantyla · 2023-09-04T16:39:17Z

Problem description

Set operations available for lists can be handy for many NLP tasks. However, the assumption is that the sets/lists have the same length. I would like to use a fixed list in set operations.

Below we have NLP data.

>>> import polars as pl
>>> sentences = pl.DataFrame(
...     {
...         "sentence": [
...             "This is my first sentence.",
...             "This cannot be true but we need another sentence",
...             "What is this sentence business",
...             "These sentences could be tweets or even paragraphs of text",
...             "This is my final sentence",
...         ],
...     }
... )
>>> #Convert sentences to word list 
>>> 
>>> sentences = sentences.with_columns(pl.col("sentence").str.split(by=" ").alias("words"))
>>> print (sentences)
shape: (5, 2)
┌───────────────────────────────────┬──────────────────────────────────┐
│ sentence                          ┆ words                            │
│ ---                               ┆ ---                              │
│ str                               ┆ list[str]                        │
╞═══════════════════════════════════╪══════════════════════════════════╡
│ This is my first sentence.        ┆ ["This", "is", … "sentence."]    │
│ This cannot be true but we need … ┆ ["This", "cannot", … "sentence"] │
│ What is this sentence business    ┆ ["What", "is", … "business"]     │
│ These sentences could be tweets … ┆ ["These", "sentences", … "text"] │
│ This is my final sentence         ┆ ["This", "is", … "sentence"]     │
└───────────────────────────────────┴──────────────────────────────────┘

Next, we have some other source of NLP data that gives us a word list we would like to compare our sentences with. This external word list could be a stop-word list or list of words with positive / negative sentiment or many other things as well. In the example I use it as I would use a stop word list.

>> stop_words = ["This", "is", "stop" "word", "list"]
>>> stop_words = pl.Series(stop_words, dtype =pl.Utf8)
>>> stop_words = stop_words.reshape([1,len(stop_words)])
>>> print(stop_words)
shape: (1,)
Series: '' [list[str]]
[
        ["This", "is", … "list"]
]

If we want to remove stop words, we would do it as follows (assuming set operations support a fixed set).

>>> sentences.select(pl.col("words").list.set_difference(stop_words))
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `6`,
 right: `2`', /home/runner/work/polars/polars/crates/polars-ops/src/chunked_array/list/sets.rs:176:5

The above does not work as set operations expect series to have same length. This can be fixed by adding a fixed column.

>>> sentences = sentences.with_columns(pl.lit(stop_words).alias("stop_words"))
>>> sentences = sentences.with_columns(pl.col("words").list.set_difference("stop_words").alias("words sw filtered"))
>>> print(sentences)
shape: (5, 4)
┌───────────────────────────────────┬──────────────────────────────────┬──────────────────────────┬───────────────────────────────────┐
│ sentence                          ┆ words                            ┆ stop_words               ┆ words sw filtered                 │
│ ---                               ┆ ---                              ┆ ---                      ┆ ---                               │
│ str                               ┆ list[str]                        ┆ list[str]                ┆ list[str]                         │
╞═══════════════════════════════════╪══════════════════════════════════╪══════════════════════════╪═══════════════════════════════════╡
│ This is my first sentence.        ┆ ["This", "is", … "sentence."]    ┆ ["This", "is", … "list"] ┆ ["sentence.", "first", "my"]      │
│ This cannot be true but we need … ┆ ["This", "cannot", … "sentence"] ┆ ["This", "is", … "list"] ┆ ["sentence", "cannot", … "anothe… │
│ What is this sentence business    ┆ ["What", "is", … "business"]     ┆ ["This", "is", … "list"] ┆ ["What", "business", … "sentence… │
│ These sentences could be tweets … ┆ ["These", "sentences", … "text"] ┆ ["This", "is", … "list"] ┆ ["These", "sentences", … "text"]  │
│ This is my final sentence         ┆ ["This", "is", … "sentence"]     ┆ ["This", "is", … "list"] ┆ ["sentence", "final", "my"]       │
└───────────────────────────────────┴──────────────────────────────────┴──────────────────────────┴───────────────────────────────────┘

However, adding a fixed column is not very elegant and even worse it is slow for large datasets.

The the time of the fixed column addition (pl.lit(stop_words)) increases linearly or faster as the number of words grows.

10M rows and adding a fixed column of 10 words -> 4s
10M rows and adding a fixed column of 100 words -> 60s

Set operation (list.set_difference) is better.

10M rows with a fixed column of 10 words -> 7s
10M rows with a fixed column of 100 words -> 20s

We have datasets up to 200M rows and the number of words in the fixed word list can be up to 1k. Sentences are in the normal length.

orlp · 2023-09-04T16:53:40Z

I can't reproduce:

>>> sentences.select(pl.col("words").list.set_difference(stop_words))
shape: (5, 1)
┌───────────────────────────────────┐
│ words                             │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["sentence.", "first", "my"]      │
│ ["sentence", "cannot", … "anothe… │
│ ["What", "business", … "sentence… │
│ ["These", "sentences", … "text"]  │
│ ["sentence", "final", "my"]       │
└───────────────────────────────────┘

Can you post your pl.show_versions()?

mmantyla · 2023-09-04T17:32:23Z

For this happens in two different environments. HPC VM that I use for heavy computation

>>> pl.show_versions()
--------Version info---------
Polars:              0.18.15
Index type:          UInt32
Platform:            Linux-5.4.0-150-generic-x86_64-with-glibc2.29
Python:              3.8.10 (default, May 26 2023, 14:05:08) 
[GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          3.5.3
numpy:               1.23.2
pandas:              1.4.2
pyarrow:             <not installed>
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

Also happens at home PC.

>>> pl.show_versions()
--------Version info---------
Polars:              0.18.15
Index type:          UInt32
Platform:            Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Python:              3.8.10 (default, Mar 13 2023, 10:26:41) 
[GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.3.0
matplotlib:          3.7.1
numpy:               1.24.2
pandas:              1.5.3
pyarrow:             <not installed>
pydantic:            1.10.6
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

cmdlineluser · 2023-09-04T18:21:19Z

Looks like it was fixed in #10668 which was part of the 0.19.0 release.

(also looks like #9764 can be closed)

orlp · 2023-09-04T18:45:58Z

Yeah, if you upgrade to the latest version this should just work.

mmantyla added the enhancement New feature or an improvement of an existing feature label Sep 4, 2023

orlp closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set operations with a fixed set #10908

Set operations with a fixed set #10908

mmantyla commented Sep 4, 2023

orlp commented Sep 4, 2023

mmantyla commented Sep 4, 2023

cmdlineluser commented Sep 4, 2023 •

edited

Loading

orlp commented Sep 4, 2023

Set operations with a fixed set #10908

Set operations with a fixed set #10908

Comments

mmantyla commented Sep 4, 2023

Problem description

orlp commented Sep 4, 2023

mmantyla commented Sep 4, 2023

cmdlineluser commented Sep 4, 2023 • edited Loading

orlp commented Sep 4, 2023

cmdlineluser commented Sep 4, 2023 •

edited

Loading