Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set operations with a fixed set #10908

Closed
mmantyla opened this issue Sep 4, 2023 · 4 comments
Closed

Set operations with a fixed set #10908

mmantyla opened this issue Sep 4, 2023 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mmantyla
Copy link

mmantyla commented Sep 4, 2023

Problem description

Set operations available for lists can be handy for many NLP tasks. However, the assumption is that the sets/lists have the same length. I would like to use a fixed list in set operations.

Below we have NLP data.

>>> import polars as pl
>>> sentences = pl.DataFrame(
...     {
...         "sentence": [
...             "This is my first sentence.",
...             "This cannot be true but we need another sentence",
...             "What is this sentence business",
...             "These sentences could be tweets or even paragraphs of text",
...             "This is my final sentence",
...         ],
...     }
... )
>>> #Convert sentences to word list 
>>> 
>>> sentences = sentences.with_columns(pl.col("sentence").str.split(by=" ").alias("words"))
>>> print (sentences)
shape: (5, 2)
┌───────────────────────────────────┬──────────────────────────────────┐
│ sentencewords                            │
│ ------                              │
│ strlist[str]                        │
╞═══════════════════════════════════╪══════════════════════════════════╡
│ This is my first sentence.        ┆ ["This", "is", … "sentence."]    │
│ This cannot be true but we need … ┆ ["This", "cannot", … "sentence"] │
│ What is this sentence business    ┆ ["What", "is", … "business"]     │
│ These sentences could be tweets … ┆ ["These", "sentences", … "text"] │
│ This is my final sentence         ┆ ["This", "is", … "sentence"]     │
└───────────────────────────────────┴──────────────────────────────────┘

Next, we have some other source of NLP data that gives us a word list we would like to compare our sentences with. This external word list could be a stop-word list or list of words with positive / negative sentiment or many other things as well. In the example I use it as I would use a stop word list.

>> stop_words = ["This", "is", "stop" "word", "list"]
>>> stop_words = pl.Series(stop_words, dtype =pl.Utf8)
>>> stop_words = stop_words.reshape([1,len(stop_words)])
>>> print(stop_words)
shape: (1,)
Series: '' [list[str]]
[
        ["This", "is", … "list"]
]

If we want to remove stop words, we would do it as follows (assuming set operations support a fixed set).

>>> sentences.select(pl.col("words").list.set_difference(stop_words))
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `6`,
 right: `2`', /home/runner/work/polars/polars/crates/polars-ops/src/chunked_array/list/sets.rs:176:5

The above does not work as set operations expect series to have same length. This can be fixed by adding a fixed column.

>>> sentences = sentences.with_columns(pl.lit(stop_words).alias("stop_words"))
>>> sentences = sentences.with_columns(pl.col("words").list.set_difference("stop_words").alias("words sw filtered"))
>>> print(sentences)
shape: (5, 4)
┌───────────────────────────────────┬──────────────────────────────────┬──────────────────────────┬───────────────────────────────────┐
│ sentencewordsstop_wordswords sw filtered                 │
│ ------------                               │
│ strlist[str]                        ┆ list[str]                ┆ list[str]                         │
╞═══════════════════════════════════╪══════════════════════════════════╪══════════════════════════╪═══════════════════════════════════╡
│ This is my first sentence.        ┆ ["This", "is", … "sentence."]    ┆ ["This", "is", … "list"] ┆ ["sentence.", "first", "my"]      │
│ This cannot be true but we need … ┆ ["This", "cannot", … "sentence"] ┆ ["This", "is", … "list"] ┆ ["sentence", "cannot", … "anothe… │
│ What is this sentence business    ┆ ["What", "is", … "business"]     ┆ ["This", "is", … "list"] ┆ ["What", "business", … "sentence… │
│ These sentences could be tweets … ┆ ["These", "sentences", … "text"] ┆ ["This", "is", … "list"] ┆ ["These", "sentences", … "text"]  │
│ This is my final sentence         ┆ ["This", "is", … "sentence"]     ┆ ["This", "is", … "list"] ┆ ["sentence", "final", "my"]       │
└───────────────────────────────────┴──────────────────────────────────┴──────────────────────────┴───────────────────────────────────┘

However, adding a fixed column is not very elegant and even worse it is slow for large datasets.

The the time of the fixed column addition (pl.lit(stop_words)) increases linearly or faster as the number of words grows.

  • 10M rows and adding a fixed column of 10 words -> 4s
  • 10M rows and adding a fixed column of 100 words -> 60s

Set operation (list.set_difference) is better.

  • 10M rows with a fixed column of 10 words -> 7s
  • 10M rows with a fixed column of 100 words -> 20s

We have datasets up to 200M rows and the number of words in the fixed word list can be up to 1k. Sentences are in the normal length.

@mmantyla mmantyla added the enhancement New feature or an improvement of an existing feature label Sep 4, 2023
@orlp
Copy link
Collaborator

orlp commented Sep 4, 2023

I can't reproduce:

>>> sentences.select(pl.col("words").list.set_difference(stop_words))
shape: (5, 1)
┌───────────────────────────────────┐
│ words                             │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["sentence.", "first", "my"]      │
│ ["sentence", "cannot", … "anothe… │
│ ["What", "business", … "sentence… │
│ ["These", "sentences", … "text"]  │
│ ["sentence", "final", "my"]       │
└───────────────────────────────────┘

Can you post your pl.show_versions()?

@mmantyla
Copy link
Author

mmantyla commented Sep 4, 2023

For this happens in two different environments. HPC VM that I use for heavy computation

>>> pl.show_versions()
--------Version info---------
Polars:              0.18.15
Index type:          UInt32
Platform:            Linux-5.4.0-150-generic-x86_64-with-glibc2.29
Python:              3.8.10 (default, May 26 2023, 14:05:08) 
[GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          3.5.3
numpy:               1.23.2
pandas:              1.4.2
pyarrow:             <not installed>
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

Also happens at home PC.

>>> pl.show_versions()
--------Version info---------
Polars:              0.18.15
Index type:          UInt32
Platform:            Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Python:              3.8.10 (default, Mar 13 2023, 10:26:41) 
[GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.3.0
matplotlib:          3.7.1
numpy:               1.24.2
pandas:              1.5.3
pyarrow:             <not installed>
pydantic:            1.10.6
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Sep 4, 2023

Looks like it was fixed in #10668 which was part of the 0.19.0 release.

(also looks like #9764 can be closed)

@orlp
Copy link
Collaborator

orlp commented Sep 4, 2023

Yeah, if you upgrade to the latest version this should just work.

@orlp orlp closed this as completed Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants