-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set operations with a fixed set #10908
Labels
enhancement
New feature or an improvement of an existing feature
Comments
mmantyla
added
the
enhancement
New feature or an improvement of an existing feature
label
Sep 4, 2023
I can't reproduce: >>> sentences.select(pl.col("words").list.set_difference(stop_words))
shape: (5, 1)
┌───────────────────────────────────┐
│ words │
│ --- │
│ list[str] │
╞═══════════════════════════════════╡
│ ["sentence.", "first", "my"] │
│ ["sentence", "cannot", … "anothe… │
│ ["What", "business", … "sentence… │
│ ["These", "sentences", … "text"] │
│ ["sentence", "final", "my"] │
└───────────────────────────────────┘ Can you post your |
For this happens in two different environments. HPC VM that I use for heavy computation >>> pl.show_versions()
--------Version info---------
Polars: 0.18.15
Index type: UInt32
Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.29
Python: 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
matplotlib: 3.5.3
numpy: 1.23.2
pandas: 1.4.2
pyarrow: <not installed>
pydantic: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed> Also happens at home PC. >>> pl.show_versions()
--------Version info---------
Polars: 0.18.15
Index type: UInt32
Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Python: 3.8.10 (default, Mar 13 2023, 10:26:41)
[GCC 9.4.0]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.3.0
matplotlib: 3.7.1
numpy: 1.24.2
pandas: 1.5.3
pyarrow: <not installed>
pydantic: 1.10.6
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed> |
Yeah, if you upgrade to the latest version this should just work. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem description
Set operations available for lists can be handy for many NLP tasks. However, the assumption is that the sets/lists have the same length. I would like to use a fixed list in set operations.
Below we have NLP data.
Next, we have some other source of NLP data that gives us a word list we would like to compare our sentences with. This external word list could be a stop-word list or list of words with positive / negative sentiment or many other things as well. In the example I use it as I would use a stop word list.
If we want to remove stop words, we would do it as follows (assuming set operations support a fixed set).
The above does not work as set operations expect series to have same length. This can be fixed by adding a fixed column.
However, adding a fixed column is not very elegant and even worse it is slow for large datasets.
The the time of the fixed column addition (pl.lit(stop_words)) increases linearly or faster as the number of words grows.
Set operation (list.set_difference) is better.
We have datasets up to 200M rows and the number of words in the fixed word list can be up to 1k. Sentences are in the normal length.
The text was updated successfully, but these errors were encountered: