Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/better caching strategy #34

Closed

Conversation

juanjoDiaz
Copy link
Collaborator

Suggestion to close #33

Please take a look at the idea @adbar.
I know that it's a big change but I think that makes the whole library easier to manage and to evolve.

I think that there is still room for improvement but I wanted to give some feedback.
I'm happy finalizing this and updating the docs including a migration section from people jumping from the old version to the new one 🙂

@juanjoDiaz juanjoDiaz marked this pull request as draft January 18, 2023 16:39
@juanjoDiaz juanjoDiaz force-pushed the refactor/better_caching_strategy branch 2 times, most recently from ee255a9 to 5a2a29e Compare January 18, 2023 17:05
@adbar
Copy link
Owner

adbar commented Jan 18, 2023

Hi @juanjoDiaz, thanks for the PR. It looks interesting even if I'll need some time to review it. Breaking up the main script into several parts is a very good idea BTW.

Could you please review code formatting (with black) so that we can run the tests?

@juanjoDiaz juanjoDiaz force-pushed the refactor/better_caching_strategy branch from 5a2a29e to 147459b Compare January 18, 2023 20:13
@juanjoDiaz juanjoDiaz force-pushed the refactor/better_caching_strategy branch from 147459b to ff6e82b Compare January 18, 2023 20:49
@juanjoDiaz
Copy link
Collaborator Author

All good now. Formatting got incorrect during rebase.
Now it's rebased, and formatting and typings are valid.

@juanjoDiaz juanjoDiaz force-pushed the refactor/better_caching_strategy branch 2 times, most recently from 0509ed1 to 806c334 Compare January 19, 2023 12:25
@juanjoDiaz juanjoDiaz force-pushed the refactor/better_caching_strategy branch from 806c334 to 5b475f8 Compare January 19, 2023 12:28
@juanjoDiaz
Copy link
Collaborator Author

Alright.

This have now become a full revamp.

I've also re-done the docs. I think that if you start from there, it would be easier for you to understand the proposal.

@osma
Copy link
Contributor

osma commented Jan 19, 2023

Commenting as a fellow user of simplemma:

I think these are great improvements in roughly the right direction. The memory usage is a concern for us as well as we are currently adding language detection based on Simplemma into the Annif REST API. The current behaviour, where Simplemma forgets already loaded models (as explained in #33) when you request a different set of languages is surprising and less than ideal from our perspective.

The idea of using a LRU cache for models seems good, but I think you've misunderstood the way the maxsize parameter works when wrapping the _load_dictionary_from_disk function. It's not a limit on the total memory usage in bytes, but of total entries, in this case language models. So a value of 1048576 (the current default) would allow keeping more than a million language models in memory - probably not what you intended! A better default would be something like 3 or 5.

Regarding API changes: I didn't look too closely, but it seems to me that this change will increase the amount of boilerplate needed to use simplemma (e.g. having to create class instances) even in simple cases such as Jupyter notebooks for data mangling. Would it be possible to keep at least the core parts of the old API in place (especially the lemmatize function, but perhaps also simple_tokenizer and the language detection functions), perhaps as convenience functions with sane defaults? My view on this is that in an ideal API, simple things should be really simple to do and more complex things (e.g. specifying the number of language models to keep in memory) should be possible.

Sidenote: This PR has grown quite large. I hope it's not a problem for @adbar who has to review it all (I can help, I already did some reviewing above!). If this were my own project, I would prefer smaller, incremental changes.

@codecov-commenter
Copy link

codecov-commenter commented Jan 19, 2023

Codecov Report

Merging #34 (62bd96b) into main (b65a390) will increase coverage by 0.02%.
The diff coverage is 97.36%.

@@            Coverage Diff             @@
##             main      #34      +/-   ##
==========================================
+ Coverage   95.90%   95.92%   +0.02%     
==========================================
  Files           5        9       +4     
  Lines         488      515      +27     
==========================================
+ Hits          468      494      +26     
- Misses         20       21       +1     
Impacted Files Coverage Δ
simplemma/dictionary_pickler.py 92.10% <92.10%> (ø)
simplemma/utils.py 95.65% <95.65%> (ø)
simplemma/dictionary_factory.py 97.43% <97.43%> (ø)
simplemma/lemmatizer.py 98.84% <98.84%> (ø)
simplemma/__init__.py 100.00% <100.00%> (ø)
simplemma/constants.py 100.00% <100.00%> (ø)
simplemma/language_detector.py 100.00% <100.00%> (ø)
simplemma/tokenizer.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@juanjoDiaz
Copy link
Collaborator Author

Hi @osma,

Thanks for the feedback.

The idea of using a LRU cache for models seems good, but I think you've misunderstood the way the maxsize parameter works when wrapping the _load_dictionary_from_disk function. It's not a limit on the total memory usage in bytes, but of total entries, in this case language models. So a value of 1048576 (the current default) would allow keeping more than a million language models in memory - probably not what you intended! A better default would be something like 3 or 5.

You are absolutely right.
I've amended the LRU cache for the dictionaries to default to 5 and had fixed the docs.

Regarding API changes: I didn't look too closely, but it seems to me that this change will increase the amount of boilerplate needed to use simplemma (e.g. having to create class instances) even in simple cases such as Jupyter notebooks for data mangling. Would it be possible to keep at least the core parts of the old API in place (especially the lemmatize function, but perhaps also simple_tokenizer and the language detection functions), perhaps as convenience functions with sane defaults? My view on this is that in an ideal API, simple things should be really simple to do and more complex things (e.g. specifying the number of language models to keep in memory) should be possible.

I would argue that Lemmatizer().lemmatize(text) is not much more complex than lemmatize(text) 😅

If we really would like to, this could be done easily:

# Global lemmatizer.
# Caches are global across calls.
# Makes it impossible to clear the cache
_lemmatizer = Lemmatizer()
lemmatize = _lemmatizer.lemmatize

# Local lemmatizer.
# Cache is only valid during execution
# Reduce the effectiveness of caches across calls
def lemmatize:
    lemmatize = Lemmatizer()
    lemmatizer.lemmatize

However, I see little value.
In one of my library, json2csv, I did that and after a few versions I ended up dropping the convenience methods because of lack of usage and additional maintenance burden.

Sidenote: This PR has grown quite large. I hope it's not a problem for @adbar who has to review it all (I can help, I already did some reviewing above!). If this were my own project, I would prefer smaller, incremental changes.

I know that the PR has grown quite large.
It might be easier to review it commit by commit.
Although, breaking the monolith into modules and converting to classes is hard to do in a small PR 😅

@adbar
Copy link
Owner

adbar commented Jan 19, 2023

Hi @juanjoDiaz, thank you for investing more time on this and for adding all the necessary tests!

I agree with @osma that the PR has gotten quite large and that it makes the code incompatible with previous versions. I also agree with the remark on the cache.

The PR should at least keep the existing functionality intact, I'm afraid we'll lose users if already known functions stop working. I wouldn't transform lemmatize into lemmatizer.lemmatize_token without keeping a working copy of the existing lemmatize() for example.

I'm definitely willing to integrate part or all of your PR into the repository so it's worth taking some time to assess it. Here is what I understand, from the straightforward changes to the more complex ones:

  1. You split the existing information into more files (e.g. constants.py, utils.py)
  2. You created a Tokenizer class
  3. You revamped the way language data is generated (DictionaryFactory along with validators etc.)
  4. You created classes for lemmatization and language detection (the latter should read LaNguageDetector btw)

Please correct me if I'm wrong: The goal you set out for was (4) in order to address the problem described in #33. In the end the PR is much more complex than expected.

If I understand correctly each point above would deserve its own pull request:

  • If it's OK for you and if the distinction above makes sense, could you leave this PR open and open a corresponding PR for each step?
  • We would also need to find a way to keep the additions without breaking the existing code, I'm open to suggestions.

What do you think?

@juanjoDiaz
Copy link
Collaborator Author

Agreed.

Let's do this incrementally and keep the existing methods for backward compatibility at all times.

This are the things that I have already implemented. I can create a every time that you merge the previous one:

  • Split in modules & add Dictionary cache with minimal code changes (Refactor/separate logic into modules #38)
  • Split tests to match the classes (1 test file == 1 class file)
  • Improve the DictionaryCache class to be DictionaryFactory (more efficient and clean)
  • Rename instance methods to better names (no changes in the legacy methods)
  • Improve settings to be passed as objects instead of arguments so adding arguments doesn't break things.
  • Add possibility to pass Tokenizer to Lemmatizer (so custom tokenizers can be used)
  • Add possibility to preload some dictionaries and keep those loaded (apart from the LRU cache)
  • Update Readme

I have other improvement ideas around clean code and performance that we can discuss once we get all this out of the way 🙂

@adbar
Copy link
Owner

adbar commented Jan 20, 2023

Hi @juanjoDiaz, thanks, I think it's much better if we split the changes into chunks. Let's go step by step and discuss the changes along the way. As long as the functions already in use stay available it should be fine to add new classes.

For the record I would have prefered to see at most one PR for each item above but your new PR #38 also makes sense by grouping the items (1) and (3), correct?

We can leave this PR open for now and work on the other ones first.

@juanjoDiaz
Copy link
Collaborator Author

PR #38 only covers point 1.
It breaks simplemma into multiple files, it adds the LRU to the dictionary and it convert the methods into classes so they can be configured.

The legacy.py module keeps the old module so backwards compatibility is guaranteed 🙂

@juanjoDiaz
Copy link
Collaborator Author

Btw, my idea is not to merge the PR and release inmmediate.
I think that it makes sense to do all the work and then do the release with all the changes so we just do one "major" release and minimize changes between releases.

@adbar
Copy link
Owner

adbar commented Jan 20, 2023

I agree, that's why I just consolidated the changes I made in the last weeks.

I now plan to work with you on the code and release a version 1 at some point, also to make clear by semantic versioning that substantial changes have been made.

@adbar
Copy link
Owner

adbar commented Jan 31, 2023

@juanjoDiaz I think the method coupling issues to pull requests works well. Feel free to elaborate on existing issues or to add new ones if you have other changes in mind.

@juanjoDiaz juanjoDiaz closed this Feb 5, 2023
@juanjoDiaz juanjoDiaz deleted the refactor/better_caching_strategy branch May 23, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more control on how dictionaries are loaded to memory
4 participants