-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor/better caching strategy #34
Refactor/better caching strategy #34
Conversation
ee255a9
to
5a2a29e
Compare
Hi @juanjoDiaz, thanks for the PR. It looks interesting even if I'll need some time to review it. Breaking up the main script into several parts is a very good idea BTW. Could you please review code formatting (with black) so that we can run the tests? |
5a2a29e
to
147459b
Compare
147459b
to
ff6e82b
Compare
All good now. Formatting got incorrect during rebase. |
0509ed1
to
806c334
Compare
… to pass custom Tokenizer to Lemmatizer
806c334
to
5b475f8
Compare
Alright. This have now become a full revamp. I've also re-done the docs. I think that if you start from there, it would be easier for you to understand the proposal. |
Commenting as a fellow user of simplemma: I think these are great improvements in roughly the right direction. The memory usage is a concern for us as well as we are currently adding language detection based on Simplemma into the Annif REST API. The current behaviour, where Simplemma forgets already loaded models (as explained in #33) when you request a different set of languages is surprising and less than ideal from our perspective. The idea of using a LRU cache for models seems good, but I think you've misunderstood the way the Regarding API changes: I didn't look too closely, but it seems to me that this change will increase the amount of boilerplate needed to use simplemma (e.g. having to create class instances) even in simple cases such as Jupyter notebooks for data mangling. Would it be possible to keep at least the core parts of the old API in place (especially the Sidenote: This PR has grown quite large. I hope it's not a problem for @adbar who has to review it all (I can help, I already did some reviewing above!). If this were my own project, I would prefer smaller, incremental changes. |
Codecov Report
@@ Coverage Diff @@
## main #34 +/- ##
==========================================
+ Coverage 95.90% 95.92% +0.02%
==========================================
Files 5 9 +4
Lines 488 515 +27
==========================================
+ Hits 468 494 +26
- Misses 20 21 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Hi @osma, Thanks for the feedback.
You are absolutely right.
I would argue that If we really would like to, this could be done easily: # Global lemmatizer.
# Caches are global across calls.
# Makes it impossible to clear the cache
_lemmatizer = Lemmatizer()
lemmatize = _lemmatizer.lemmatize
# Local lemmatizer.
# Cache is only valid during execution
# Reduce the effectiveness of caches across calls
def lemmatize:
lemmatize = Lemmatizer()
lemmatizer.lemmatize However, I see little value.
I know that the PR has grown quite large. |
Hi @juanjoDiaz, thank you for investing more time on this and for adding all the necessary tests! I agree with @osma that the PR has gotten quite large and that it makes the code incompatible with previous versions. I also agree with the remark on the cache. The PR should at least keep the existing functionality intact, I'm afraid we'll lose users if already known functions stop working. I wouldn't transform I'm definitely willing to integrate part or all of your PR into the repository so it's worth taking some time to assess it. Here is what I understand, from the straightforward changes to the more complex ones:
Please correct me if I'm wrong: The goal you set out for was (4) in order to address the problem described in #33. In the end the PR is much more complex than expected. If I understand correctly each point above would deserve its own pull request:
What do you think? |
Agreed. Let's do this incrementally and keep the existing methods for backward compatibility at all times. This are the things that I have already implemented. I can create a every time that you merge the previous one:
I have other improvement ideas around clean code and performance that we can discuss once we get all this out of the way 🙂 |
Hi @juanjoDiaz, thanks, I think it's much better if we split the changes into chunks. Let's go step by step and discuss the changes along the way. As long as the functions already in use stay available it should be fine to add new classes. For the record I would have prefered to see at most one PR for each item above but your new PR #38 also makes sense by grouping the items (1) and (3), correct? We can leave this PR open for now and work on the other ones first. |
PR #38 only covers point 1. The |
Btw, my idea is not to merge the PR and release inmmediate. |
I agree, that's why I just consolidated the changes I made in the last weeks. I now plan to work with you on the code and release a version 1 at some point, also to make clear by semantic versioning that substantial changes have been made. |
@juanjoDiaz I think the method coupling issues to pull requests works well. Feel free to elaborate on existing issues or to add new ones if you have other changes in mind. |
Suggestion to close #33
Please take a look at the idea @adbar.
I know that it's a big change but I think that makes the whole library easier to manage and to evolve.
I think that there is still room for improvement but I wanted to give some feedback.
I'm happy finalizing this and updating the docs including a migration section from people jumping from the old version to the new one 🙂