Refactor/better caching strategy #34

juanjoDiaz · 2023-01-18T16:38:25Z

Suggestion to close #33

Please take a look at the idea @adbar.
I know that it's a big change but I think that makes the whole library easier to manage and to evolve.

I think that there is still room for improvement but I wanted to give some feedback.
I'm happy finalizing this and updating the docs including a migration section from people jumping from the old version to the new one 🙂

adbar · 2023-01-18T18:18:05Z

Hi @juanjoDiaz, thanks for the PR. It looks interesting even if I'll need some time to review it. Breaking up the main script into several parts is a very good idea BTW.

Could you please review code formatting (with black) so that we can run the tests?

juanjoDiaz · 2023-01-18T20:50:53Z

All good now. Formatting got incorrect during rebase.
Now it's rebased, and formatting and typings are valid.

… to pass custom Tokenizer to Lemmatizer

juanjoDiaz · 2023-01-19T12:58:27Z

Alright.

This have now become a full revamp.

I've also re-done the docs. I think that if you start from there, it would be easier for you to understand the proposal.

…from Tokenizer

osma · 2023-01-19T16:13:06Z

Commenting as a fellow user of simplemma:

I think these are great improvements in roughly the right direction. The memory usage is a concern for us as well as we are currently adding language detection based on Simplemma into the Annif REST API. The current behaviour, where Simplemma forgets already loaded models (as explained in #33) when you request a different set of languages is surprising and less than ideal from our perspective.

The idea of using a LRU cache for models seems good, but I think you've misunderstood the way the maxsize parameter works when wrapping the _load_dictionary_from_disk function. It's not a limit on the total memory usage in bytes, but of total entries, in this case language models. So a value of 1048576 (the current default) would allow keeping more than a million language models in memory - probably not what you intended! A better default would be something like 3 or 5.

Regarding API changes: I didn't look too closely, but it seems to me that this change will increase the amount of boilerplate needed to use simplemma (e.g. having to create class instances) even in simple cases such as Jupyter notebooks for data mangling. Would it be possible to keep at least the core parts of the old API in place (especially the lemmatize function, but perhaps also simple_tokenizer and the language detection functions), perhaps as convenience functions with sane defaults? My view on this is that in an ideal API, simple things should be really simple to do and more complex things (e.g. specifying the number of language models to keep in memory) should be possible.

Sidenote: This PR has grown quite large. I hope it's not a problem for @adbar who has to review it all (I can help, I already did some reviewing above!). If this were my own project, I would prefer smaller, incremental changes.

codecov-commenter · 2023-01-19T18:15:43Z

Codecov Report

Merging #34 (62bd96b) into main (b65a390) will increase coverage by 0.02%.
The diff coverage is 97.36%.

@@            Coverage Diff             @@
##             main      #34      +/-   ##
==========================================
+ Coverage   95.90%   95.92%   +0.02%     
==========================================
  Files           5        9       +4     
  Lines         488      515      +27     
==========================================
+ Hits          468      494      +26     
- Misses         20       21       +1

Impacted Files	Coverage Δ
simplemma/dictionary_pickler.py	`92.10% <92.10%> (ø)`
simplemma/utils.py	`95.65% <95.65%> (ø)`
simplemma/dictionary_factory.py	`97.43% <97.43%> (ø)`
simplemma/lemmatizer.py	`98.84% <98.84%> (ø)`
simplemma/__init__.py	`100.00% <100.00%> (ø)`
simplemma/constants.py	`100.00% <100.00%> (ø)`
simplemma/language_detector.py	`100.00% <100.00%> (ø)`
simplemma/tokenizer.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

juanjoDiaz · 2023-01-19T18:36:25Z

Hi @osma,

Thanks for the feedback.

The idea of using a LRU cache for models seems good, but I think you've misunderstood the way the maxsize parameter works when wrapping the _load_dictionary_from_disk function. It's not a limit on the total memory usage in bytes, but of total entries, in this case language models. So a value of 1048576 (the current default) would allow keeping more than a million language models in memory - probably not what you intended! A better default would be something like 3 or 5.

You are absolutely right.
I've amended the LRU cache for the dictionaries to default to 5 and had fixed the docs.

Regarding API changes: I didn't look too closely, but it seems to me that this change will increase the amount of boilerplate needed to use simplemma (e.g. having to create class instances) even in simple cases such as Jupyter notebooks for data mangling. Would it be possible to keep at least the core parts of the old API in place (especially the lemmatize function, but perhaps also simple_tokenizer and the language detection functions), perhaps as convenience functions with sane defaults? My view on this is that in an ideal API, simple things should be really simple to do and more complex things (e.g. specifying the number of language models to keep in memory) should be possible.

I would argue that Lemmatizer().lemmatize(text) is not much more complex than lemmatize(text) 😅

If we really would like to, this could be done easily:

# Global lemmatizer.
# Caches are global across calls.
# Makes it impossible to clear the cache
_lemmatizer = Lemmatizer()
lemmatize = _lemmatizer.lemmatize

# Local lemmatizer.
# Cache is only valid during execution
# Reduce the effectiveness of caches across calls
def lemmatize:
    lemmatize = Lemmatizer()
    lemmatizer.lemmatize

However, I see little value.
In one of my library, json2csv, I did that and after a few versions I ended up dropping the convenience methods because of lack of usage and additional maintenance burden.

Sidenote: This PR has grown quite large. I hope it's not a problem for @adbar who has to review it all (I can help, I already did some reviewing above!). If this were my own project, I would prefer smaller, incremental changes.

I know that the PR has grown quite large.
It might be easier to review it commit by commit.
Although, breaking the monolith into modules and converting to classes is hard to do in a small PR 😅

adbar · 2023-01-19T18:37:30Z

Hi @juanjoDiaz, thank you for investing more time on this and for adding all the necessary tests!

I agree with @osma that the PR has gotten quite large and that it makes the code incompatible with previous versions. I also agree with the remark on the cache.

The PR should at least keep the existing functionality intact, I'm afraid we'll lose users if already known functions stop working. I wouldn't transform lemmatize into lemmatizer.lemmatize_token without keeping a working copy of the existing lemmatize() for example.

I'm definitely willing to integrate part or all of your PR into the repository so it's worth taking some time to assess it. Here is what I understand, from the straightforward changes to the more complex ones:

You split the existing information into more files (e.g. constants.py, utils.py)
You created a Tokenizer class
You revamped the way language data is generated (DictionaryFactory along with validators etc.)
You created classes for lemmatization and language detection (the latter should read LaNguageDetector btw)

Please correct me if I'm wrong: The goal you set out for was (4) in order to address the problem described in #33. In the end the PR is much more complex than expected.

If I understand correctly each point above would deserve its own pull request:

If it's OK for you and if the distinction above makes sense, could you leave this PR open and open a corresponding PR for each step?
We would also need to find a way to keep the additions without breaking the existing code, I'm open to suggestions.

What do you think?

juanjoDiaz · 2023-01-19T19:53:12Z

Agreed.

Let's do this incrementally and keep the existing methods for backward compatibility at all times.

This are the things that I have already implemented. I can create a every time that you merge the previous one:

Split in modules & add Dictionary cache with minimal code changes (Refactor/separate logic into modules #38)
Split tests to match the classes (1 test file == 1 class file)
Improve the DictionaryCache class to be DictionaryFactory (more efficient and clean)
Rename instance methods to better names (no changes in the legacy methods)
Improve settings to be passed as objects instead of arguments so adding arguments doesn't break things.
Add possibility to pass Tokenizer to Lemmatizer (so custom tokenizers can be used)
Add possibility to preload some dictionaries and keep those loaded (apart from the LRU cache)
Update Readme

I have other improvement ideas around clean code and performance that we can discuss once we get all this out of the way 🙂

adbar · 2023-01-20T14:07:25Z

Hi @juanjoDiaz, thanks, I think it's much better if we split the changes into chunks. Let's go step by step and discuss the changes along the way. As long as the functions already in use stay available it should be fine to add new classes.

For the record I would have prefered to see at most one PR for each item above but your new PR #38 also makes sense by grouping the items (1) and (3), correct?

We can leave this PR open for now and work on the other ones first.

juanjoDiaz · 2023-01-20T14:26:08Z

PR #38 only covers point 1.
It breaks simplemma into multiple files, it adds the LRU to the dictionary and it convert the methods into classes so they can be configured.

The legacy.py module keeps the old module so backwards compatibility is guaranteed 🙂

juanjoDiaz · 2023-01-20T18:00:56Z

Btw, my idea is not to merge the PR and release inmmediate.
I think that it makes sense to do all the work and then do the release with all the changes so we just do one "major" release and minimize changes between releases.

adbar · 2023-01-20T18:05:50Z

I agree, that's why I just consolidated the changes I made in the last weeks.

I now plan to work with you on the code and release a version 1 at some point, also to make clear by semantic versioning that substantial changes have been made.

adbar · 2023-01-31T17:15:50Z

@juanjoDiaz I think the method coupling issues to pull requests works well. Feel free to elaborate on existing issues or to add new ones if you have other changes in mind.

juanjoDiaz marked this pull request as draft January 18, 2023 16:39

juanjoDiaz force-pushed the refactor/better_caching_strategy branch 2 times, most recently from ee255a9 to 5a2a29e Compare January 18, 2023 17:05

juanjoDiaz force-pushed the refactor/better_caching_strategy branch from 5a2a29e to 147459b Compare January 18, 2023 20:13

juanjoDiaz added 4 commits January 18, 2023 21:47

refactor: separate logic in multiple modules

f4ffd0d

feat: add LRU caching to DictionaryCache

d1176e2

feat: create classes for main modules

4202f4a

refactor: allow user to configure cache sizes

ff6e82b

juanjoDiaz force-pushed the refactor/better_caching_strategy branch from 147459b to ff6e82b Compare January 18, 2023 20:49

juanjoDiaz force-pushed the refactor/better_caching_strategy branch 2 times, most recently from 0509ed1 to 806c334 Compare January 19, 2023 12:25

feat: rename method names, create Tokenizer class and add possibility…

5b475f8

… to pass custom Tokenizer to Lemmatizer

juanjoDiaz force-pushed the refactor/better_caching_strategy branch from 806c334 to 5b475f8 Compare January 19, 2023 12:28

docs: update Readme

e24f457

refactor: rename dictionary factory tests and rename tokenize method …

854a146

…from Tokenizer

fix: reduce dictionary cache default max to 5 and fix docs

62bd96b

juanjoDiaz mentioned this pull request Jan 19, 2023

Refactor/separate logic into modules #38

Closed

juanjoDiaz closed this Feb 5, 2023

juanjoDiaz deleted the refactor/better_caching_strategy branch May 23, 2023 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/better caching strategy #34

Refactor/better caching strategy #34

juanjoDiaz commented Jan 18, 2023

adbar commented Jan 18, 2023

juanjoDiaz commented Jan 18, 2023

juanjoDiaz commented Jan 19, 2023

osma commented Jan 19, 2023

codecov-commenter commented Jan 19, 2023 •

edited

Loading

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 19, 2023

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 20, 2023

juanjoDiaz commented Jan 20, 2023

juanjoDiaz commented Jan 20, 2023

adbar commented Jan 20, 2023

adbar commented Jan 31, 2023

Refactor/better caching strategy #34

Refactor/better caching strategy #34

Conversation

juanjoDiaz commented Jan 18, 2023

adbar commented Jan 18, 2023

juanjoDiaz commented Jan 18, 2023

juanjoDiaz commented Jan 19, 2023

osma commented Jan 19, 2023

codecov-commenter commented Jan 19, 2023 • edited Loading

Codecov Report

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 19, 2023

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 20, 2023

juanjoDiaz commented Jan 20, 2023

juanjoDiaz commented Jan 20, 2023

adbar commented Jan 20, 2023

adbar commented Jan 31, 2023

codecov-commenter commented Jan 19, 2023 •

edited

Loading