-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor/separate logic into modules #38
Refactor/separate logic into modules #38
Conversation
00f3be9
to
2411cdc
Compare
Thanks! Since I had experiments going on with the core algorithm I need to tidy things up a bit (language data and readme) but I'm planning to review and eventually accept the PR next week at the latest. |
Hi @juanjoDiaz, The following elements are ready to merge from my point of view:
As a side note, your description above is not accurate, I see no DictionaryFactory. The main problem resides in code complexity, this issue has also been pointed out here by @osma: #34 (comment) I have trouble understanding the interest of the classes: if users want to use write classes to use the functions they can do it on their side. As such your code adds more lines and even problematic elements:
I understand that you want to build on those classes but we appear to have different approaches, so far I've tried to keep the code as simple as possible. What do you think? |
You are right. At the moment it's simply a cache.
This is actually the only thing that I need to use this library in my production system and the thing that Osma or someone else also requested. When using the library to power a REST API, you can get many request in many different languages. An LRU is a very simple approach to mitigate this.
The value of the classes and the value of that specific line is precisely that you can pass your own DictionaryCache instead of forcing the user to use the one hardcoded in the file.
This is the same thing. It offers configurability to the user. if dictionaryCache == None:
dictionaryCache = DictionaryCache()
assert isinstance(dictionaryCache, DictionaryCache)
self.dictionaryCache: DictionaryCache = dictionaryCache Allows the user to provide his own LRU size for the lemmatize function or use the sensible default if not. self.lemmatize = lru_cache(maxsize=lemmatization_distance_cache_max_size)(
self._lemmatize
) Allows the user to provide his own LRU size for the levenshtein_dist function or use the sensible default if not. self.levenshtein_dist = lru_cache(maxsize=levenshtein_distance_cache_max_size)(
levenshtein_dist
)
When the user instantiate the class, he can pass the configuration (DictionaryCache, LRU sizes, etc.) and get all the normal methods but configure to his own needs. In the future we could expand it so the user can also provide his own tokenizer, for example. In short, it makes the code extensible and flexible. I can see three options:
I've opted for the former, since the latter quickly becomes very verbose and hard to maintain. Does that make sense? |
Thanks for the detailed answer. I understand the general idea and I suggest to make the following changes in order to move on: you revamped the code and switched the existing functions to the legacy submodule but what if we do the opposite? We could leave the existing functions untouched and use them in the classes, that way you could experiment further on the classes, for example by adding a cache or additional parameters. The other option would be to offer decorators but since you've started with classes we can leave it that way. In the docs we could point users wanting to use a functional approach to the reference functions and the users intestered by the object-oriented approach (or by additional functionality) to the newer ones. A further advantage would be that the already existing functions would work the same. I'm still concerned that future functions could modify their behavior, be it only in terms of memory usage. Concretely speaking I suggest the following changes:
Does this compromise work for you? |
I think that we are making progress 🙁 You proposal was the first thing that I thought. Unfortunately, it won't work because the values are hardcoded. It's trivial to instantiate a configurable class in a certain way by configuring the settings as local variable in the constructor. But it's near imposible to override global variables behavior. Imagine that you want to have a lemmatizer with a smaller LRU or without LRU because you expect little repettitions but you want to have another with a larger LRU for a different use case. With classes you can do that. With global variables it's just not possible. Using decorators is also something difficult because of the nested nature of the stack. I'm flexible to other solutions. Don't take it as I am obsesses with the classes approach. |
My two cents:
|
Thanks for you feedback. I definitely agree with you both that the hardcoded values for LRU caches are a roadblock. It's a beneficial change to let the users set the LRU cache size. We could replace the existing @juanjoDiaz Once this concern is addressed I guess you could proceed with the addition of classes in a series of smaller steps, right? @osma I also think that If we all agree on this I suggest the following way to move on:
Does that make sense to you? |
I honestly think that doing this in functions and building classes around this is doing things backwards. The LRU cache will be simply impossible to configure with functions.
How would you do it to configure the cache?
And it's not just LRU for those couple of methods that use it. It's every possible configurable aspect of the library. How do we configure how dictionaries are loaded and cached if we rely on the global variable Also, at my workplace, we need to use our own tokenizer instead of the Imagine that in the future, we want to enable user-define rules. Once again, the only way to change it is to override the method as a global. Pretty bad and pretty inflexible. Imagine that we want to give the user some choice on the sampling mechanism for detecting languages. Once again, the only way to change the Sorry for the repetition. I just wanted to emphasize that building on hardcoded functions instead of classes limits the extensibility and thus the potential of this library. |
Now we performed the steps 1 & 2 described in my latest answer. @juanjoDiaz, you made good points about writing classes to make certain functions configurable (cache, tokenizer, etc.). Would that work or are there further obstacles along the way? |
Just to clarify: I understand there will still be things to fix or to adapt along the way, for example the However you should now be able to use some of the existing functions as building blocks inside your (new) class structures, for example with the lemmatization. Please correct me if I'm wrong. As for the general question whether a object-oriented or a functional approach is best I tend to use the functional approach but both are not mutually exclusive, we just need to make sure they're compatible. |
Smaller PR to close #33.
This is a subpart of #34