Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python, rust!): Add disable_string_cache #11020

Merged
merged 19 commits into from
Sep 25, 2023
Merged

feat(python, rust!): Add disable_string_cache #11020

merged 19 commits into from
Sep 25, 2023

Conversation

stinodego
Copy link
Member

@stinodego stinodego commented Sep 9, 2023

Closes #10425

Changes

  • Deprecate the boolean argument for enable_string_cache. The function now always enables the string cache.
  • Add a new function disable_string_cache. This disables the string cache.
  • Fix the implementation of the StringCache context manager.
  • On the Rust side, rename IUseStringCache to StringCacheHolder. IUse... seemed like a weird name to me - but maybe I am missing some convention here.
  • Some internal refactoring.

API is now consistent across the Rust and Python side - with the exception that Python offers a context manager instead of a RAII object.

@stinodego stinodego added the deprecation Add a deprecation warning to outdated functionality label Sep 9, 2023
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Sep 9, 2023
@stinodego stinodego force-pushed the string-cache branch 3 times, most recently from f744184 to a7ad283 Compare September 9, 2023 20:07
@stinodego stinodego marked this pull request as ready for review September 9, 2023 20:16
@orlp
Copy link
Collaborator

orlp commented Sep 11, 2023

Why two functions rather than one use_string_cache(True | False?

@stinodego
Copy link
Member Author

Why two functions rather than one use_string_cache(True | False?

Some discussion on that in #8631 and in the linked issue.

I'm open to alternatives. But use_string_cache is confusable with using_string_cache. I like that the pair of functions enable/disable is super clear.

@orlp
Copy link
Collaborator

orlp commented Sep 11, 2023

Alright, I don't have strong opinions about it.

@ritchie46
Copy link
Member

reset_ is being renamed to disable_. Maybe this change is to make the enable/disable symmetry more clear. However, this makes me realize that enable_ here is a misnomer, as it increments the string cache counter.

I'd rather call it something that reflects that action. increment_string_cache_refcount is what we do.

@stinodego
Copy link
Member Author

Basically, a few things can happen:

  • Increment string cache refcount by 1
  • Decrease string cache refcount by 1
  • Set string cache refcount to 0
  • Actually clear the string cache

The refcount is an implementation detail that we shouldn't show to users. So we have three functions:

  • enable_string_cache - Turns on the cache. On the background, increases refcount by 1. But might as well increase it by 10, it doesn't really matter.
  • disable_string_cache - Turns off the cache. Both sets refcount to 0 and clears the cache.
  • Some context manager that enables the cache while its active. On the background, increases refcount by 1 on initialization and decreases by 1 when it's dropped.

This is what I implemented. I don't see a more user-friendly way to do this.

@ritchie46
Copy link
Member

I think that we do need to document and make clear the implementation details as you can get race conditions. If a contextmanager (thread A) incremented the cache and another thread calls disable_ (thread B), the program isn't correct anymore as the string cache was cleared under thread A's nose.

@orlp
Copy link
Collaborator

orlp commented Sep 11, 2023

@ritchie46 That's a good point. I think what we should do to solve that is the following (in pseudocode):

# Initial condition: string cache enabled
global_string_cache_enabled = True
global_refcount = 1

def enable_string_cache():
    if not global_string_cache_enabled:
        global_string_cache_enabled = True
        _start_using_cache()
        
def disable_string_cache():
    if global_string_cache_enabled:
        _stop_using_cache()
        global_string_cache_enabled = False

def _start_using_cache():
    global_refcount += 1

def _stop_using_cache():
    global_refcount -= 1
    if global_refcount == 0:
         _clear_cache() 

Not shown here is that the context managers also use _start_using_cache/_stop_using_cache. This way if anyone is still using the string cache it doesn't get deleted under their nose.

@stinodego
Copy link
Member Author

stinodego commented Sep 12, 2023

The way I see it, the global string cache isn't thread-safe anyway.

If one thread enables the global string cache, it will disrupt any other threads that are trying to construct local categoricals.

So I don't really get the discussion about thread safety, because it's not designed to be thread-safe (it's global...). But maybe I am misunderstanding things here.

We can add a warning somewhere in the docs to document this behavior, of course.

@ritchie46
Copy link
Member

ritchie46 commented Sep 12, 2023

So I don't really get the discussion about thread safety, because it's not designed to be thread-safe (it's global...). But maybe I am misunderstanding things here.

Sure, but enabling the string cache doesn't break a program, disabling does. We use the atomic reference counting to ensure a thread that holds the string cache will have the string cache for that lifetime. I think we must guarantee that, unless you call a very low level function like disable_/reset_, but I think the name of that function should stress that caveat.

We do atomic reference counting to ensure it is thread safe with regard to holding the string cache.

enable_string_cache - Turns on the cache. On the background, increases refcount by 1. But might as well increase it by 10, it doesn't really matter.

It does matter, as then the string cache will never get cleared. With the context manager and on the rust side the StringCacheHolder we ensure the string cache is cleared when everybody dropped their context.

@stinodego
Copy link
Member Author

@orlp Not sure how that extra boolean state would help matters here? Maybe we can discuss on Thursday.

Discussed the design with Ritchie at the office and I updated the PR with a warning that disable_string_cache is not thread safe. I think this can be merged and we can possibly improve thread-safety later with additional tech.

@stinodego stinodego marked this pull request as draft September 14, 2023 09:02
@stinodego stinodego force-pushed the string-cache branch 5 times, most recently from d3a3e22 to 24a08b3 Compare September 25, 2023 12:54
@stinodego stinodego changed the title feat(python): Add disable_string_cache feat(python, rust!): Add disable_string_cache Sep 25, 2023
@github-actions github-actions bot added the breaking rust Change that breaks backwards compatibility for the Rust crate label Sep 25, 2023
@github-actions github-actions bot added the rust Related to Rust Polars label Sep 25, 2023
@stinodego stinodego marked this pull request as ready for review September 25, 2023 14:11
@stinodego
Copy link
Member Author

This PR is now done. @orlp mind reviewing to check if I implemented the pseudocode correctly?

@stinodego stinodego merged commit 843ea92 into main Sep 25, 2023
22 of 25 checks passed
@stinodego stinodego deleted the string-cache branch September 25, 2023 18:02
romanovacca pushed a commit to romanovacca/polars that referenced this pull request Oct 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking rust Change that breaks backwards compatibility for the Rust crate deprecation Add a deprecation warning to outdated functionality enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

py-polars.enable_string_cache() does not simply turn on/off
4 participants