Refactor/separate logic into modules #38

juanjoDiaz · 2023-01-19T19:21:33Z

Smaller PR to close #33.

Creates the new DictionaryFactory
Separate the library into modules
Create clases for the different components
Keep legacy methods for backwards compatibility

This is a subpart of #34

adbar · 2023-01-20T15:04:14Z

Thanks! Since I had experiments going on with the core algorithm I need to tidy things up a bit (language data and readme) but I'm planning to review and eventually accept the PR next week at the latest.

adbar · 2023-01-23T13:50:55Z

Hi @juanjoDiaz,

The following elements are ready to merge from my point of view:

split the library into modules
keep legacy methods

As a side note, your description above is not accurate, I see no DictionaryFactory.

The main problem resides in code complexity, this issue has also been pointed out here by @osma: #34 (comment)

I have trouble understanding the interest of the classes: if users want to use write classes to use the functions they can do it on their side. As such your code adds more lines and even problematic elements:

The _load_pickle() function in dictionary cache can end up using a lot of memory without additional benefit IMHO:
https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/dictionaries.py#L46
The lines using the classes don't seem to bring anything compared to the existing code:
- language detection: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/language_detector.py#L39
- lemmatizer init, it's a lot of code, there has to be a better way to do it: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/lemmatizer.py#L63

I understand that you want to build on those classes but we appear to have different approaches, so far I've tried to keep the code as simple as possible. What do you think?

juanjoDiaz · 2023-01-23T17:36:59Z

As a side note, your description above is not accurate, I see no DictionaryFactory.

You are right. At the moment it's simply a cache.
The conversion to a factory is in a separate commit that was left for a future PR when I reduce the commits from #34.

The _load_pickle() function in dictionary cache can end up using a lot of memory without additional benefit IMHO:
https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/dictionaries.py#L46

This is actually the only thing that I need to use this library in my production system and the thing that Osma or someone else also requested.

When using the library to power a REST API, you can get many request in many different languages.
The current approach to load the dictionaries every time makes the library practically unusable for such use cases.

An LRU is a very simple approach to mitigate this.
I have also another commit ready that extends the DictionaryFactory to preload a set of languages and always keep those in memory.
I also thought about implementing an alternative dictionary that instead of LRU cache uses a cache of the most called languages.
The key thing here is that the user can pass his own DictionaryFactory with its own logic on how to load dictionaries and keep them in memory according to his own needs instead of being forced to whatever is hardcoded in the library.

The lines using the classes don't seem to bring anything compared to the existing code:
language detection: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/language_detector.py#L39

The value of the classes and the value of that specific line is precisely that you can pass your own DictionaryCache instead of forcing the user to use the one hardcoded in the file.
And alternative to not use classes is that the user has to pass the DictionaryCache on every method.

The lines using the classes don't seem to bring anything compared to the existing code:
language detection: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/language_detector.py#L39
lemmatizer init, it's a lot of code, there has to be a better way to do it: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/lemmatizer.py#L63

This is the same thing. It offers configurability to the user.
Allows the user to provides a dictionary and set up a default one if not

        if dictionaryCache == None:
            dictionaryCache = DictionaryCache()
        assert isinstance(dictionaryCache, DictionaryCache)
        self.dictionaryCache: DictionaryCache = dictionaryCache

Allows the user to provide his own LRU size for the lemmatize function or use the sensible default if not.

        self.lemmatize = lru_cache(maxsize=lemmatization_distance_cache_max_size)(
            self._lemmatize
        )

Allows the user to provide his own LRU size for the levenshtein_dist function or use the sensible default if not.

        self.levenshtein_dist = lru_cache(maxsize=levenshtein_distance_cache_max_size)(
            levenshtein_dist
        )

I have trouble understanding the interest of the classes: if users want to use write classes to use the functions they can do it on their side.

When the user instantiate the class, he can pass the configuration (DictionaryCache, LRU sizes, etc.) and get all the normal methods but configure to his own needs.

In the future we could expand it so the user can also provide his own tokenizer, for example.
Or can provide his own dictionaries using his own class extending the DictionaryCache.
Or can provide his own rules.

In short, it makes the code extensible and flexible.

I can see three options:

Object oriente approach: A class wraps the methods configured in a certain way
Pure functional approach: Every single method should receive every possible parameter as argument.
Using globals: A bad practice in every possible way and asking for trouble down the road 😅

I've opted for the former, since the latter quickly becomes very verbose and hard to maintain.

Does that make sense?

adbar · 2023-01-23T19:19:08Z

Thanks for the detailed answer. I understand the general idea and I suggest to make the following changes in order to move on: you revamped the code and switched the existing functions to the legacy submodule but what if we do the opposite?

We could leave the existing functions untouched and use them in the classes, that way you could experiment further on the classes, for example by adding a cache or additional parameters. The other option would be to offer decorators but since you've started with classes we can leave it that way.

In the docs we could point users wanting to use a functional approach to the reference functions and the users intestered by the object-oriented approach (or by additional functionality) to the newer ones.

A further advantage would be that the already existing functions would work the same. I'm still concerned that future functions could modify their behavior, be it only in terms of memory usage.

Concretely speaking I suggest the following changes:

rename legacy to core (for the sake of clarity)
transfer main functions from dictionaries.py, lemmatizer.py and language_detector.py back to the core
further use and develop your classes as wrappers around the functions (e.g. DictionaryCache wrapping update_lang_data())

Does this compromise work for you?

juanjoDiaz · 2023-01-23T19:55:01Z

I think that we are making progress 🙁

You proposal was the first thing that I thought. Unfortunately, it won't work because the values are hardcoded.
There is a global array of dictionaries and the update_lang_data updates it every time.
The LRU sizes are also hardcoded and can't be modified.

It's trivial to instantiate a configurable class in a certain way by configuring the settings as local variable in the constructor. But it's near imposible to override global variables behavior.

Imagine that you want to have a lemmatizer with a smaller LRU or without LRU because you expect little repettitions but you want to have another with a larger LRU for a different use case. With classes you can do that. With global variables it's just not possible.

Using decorators is also something difficult because of the nested nature of the stack.
It's impossible to set the LRU cache setting of the levenstain distance calculator which is then used deep inside a stack of method calls.

I'm flexible to other solutions. Don't take it as I am obsesses with the classes approach.
I just would like to make the system very configurable and global variables are a problem for that.

osma · 2023-01-24T11:30:16Z

My two cents:

@juanjoDiaz makes good points about global variables being harmful. I think having classes that allow setting options like LRU cache size is the way to go for a more configurable API - but with simple functions available that can be used when all that isn't necessary from an API user point of view.
It feels strange to have a module called legacy in a project that is still in the 0.x version range, where basically all kinds of API changes are allowed according to semver. If this is something that needs to stay in the API (and I think it is), come up with a different name such as functional or simple_api or whatever.
This still feels like a quite big PR that wants to reorganize and also reimplement everything in a single sweep. I think it would be better to have a sequence of smaller PRs that only perform one refactoring at a time. For example splitting the library into modules - I think ideally there should be no changes to the code itself, just moving it around. In practice that's probably not possible (for example imports need to be adjusted) but anyway the functionality should remain the same, just that the functions etc. would be in different files than before. The diffs are going to be big in this case, so it's safer to not change any code because otherwise unintended bugs may sneak in and those are going to be difficult to trace afterwards.

adbar · 2023-01-24T12:38:56Z

Thanks for you feedback. I definitely agree with you both that the hardcoded values for LRU caches are a roadblock. It's a beneficial change to let the users set the LRU cache size. We could replace the existing lru_cache decorator by a light workaround and also add a transparent way to clear the cache.

@juanjoDiaz Once this concern is addressed I guess you could proceed with the addition of classes in a series of smaller steps, right?

@osma I also think that legacy is not a good name, especially if we keep the core functionality there. However this is just a naming convention, the biggest problem as of now is indeed the size of the diffs and our impression that too much code is being moved around.

If we all agree on this I suggest the following way to move on:

Perform a light code re-organization (constants etc.)
Remove the fixed LRU size (by finding a way around it, probably a small class that does just that)
Implement the rest of the classes as wrappers to the core functions

Does that make sense to you?

juanjoDiaz · 2023-01-24T18:40:49Z

Thanks for you feedback. I definitely agree with you both that the hardcoded values for LRU caches are a roadblock. It's a beneficial change to let the users set the LRU cache size. We could replace the existing lru_cache decorator by a light workaround and also add a transparent way to clear the cache.
@juanjoDiaz Once this concern is addressed I guess you could proceed with the addition of classes in a series of smaller steps, right?

I honestly think that doing this in functions and building classes around this is doing things backwards.
And it's hindering the future of this library from the get-go.

The LRU cache will be simply impossible to configure with functions.
_levenshtein_dist is used deep inside the stack of calls.
For example:

lemmatize
_return_lemma
_affix_search
_decompose
_greedy_search
_levenshtein_dist

How would you do it to configure the cache?
As said before, there are only 2 options, and both are incompatible with the current method:

Object oriente approach: A class wraps the methods configured in a certain way
Pure functional approach: Every single method should receive every possible parameter as argument.

And it's not just LRU for those couple of methods that use it. It's every possible configurable aspect of the library.

How do we configure how dictionaries are loaded and cached if we rely on the global variable LANG_DATA and the global method _update_lang_data? _update_lang_data is used also inside other methods, so the only way to change it is to override the method as a global. Pretty bad and pretty inflexible.

Also, at my workplace, we need to use our own tokenizer instead of the simple_tokenizer. Once again, the only way to change it is to override the method as a global. Pretty bad and pretty inflexible.

Imagine that in the future, we want to enable user-define rules. Once again, the only way to change it is to override the method as a global. Pretty bad and pretty inflexible.

Imagine that we want to give the user some choice on the sampling mechanism for detecting languages. Once again, the only way to change the prepare_text is to override the method as a global. Pretty bad and pretty inflexible.

Sorry for the repetition. I just wanted to emphasize that building on hardcoded functions instead of classes limits the extensibility and thus the potential of this library.

adbar · 2023-01-25T16:12:39Z

Now we performed the steps 1 & 2 described in my latest answer.

@juanjoDiaz, you made good points about writing classes to make certain functions configurable (cache, tokenizer, etc.).
My only condition would be that you strive to offer new functionality: the new classes or functions should come on top of the rest and if necessary be documented separately (e.g. for advanced users who wish to customize the software's behavior). If you cannot integrate the existing functions directly we could break them further apart.

Would that work or are there further obstacles along the way?

adbar · 2023-01-25T19:26:06Z

Just to clarify: I understand there will still be things to fix or to adapt along the way, for example the update_lang_data() function (#33). I suggest making a list of the remaining issues (as I just did with #44), we can then solve them one by one.

However you should now be able to use some of the existing functions as building blocks inside your (new) class structures, for example with the lemmatization. Please correct me if I'm wrong.

As for the general question whether a object-oriented or a functional approach is best I tend to use the functional approach but both are not mutually exclusive, we just need to make sure they're compatible.

juanjoDiaz added 7 commits January 18, 2023 21:47

refactor: separate logic in multiple modules

f4ffd0d

feat: add LRU caching to DictionaryCache

d1176e2

feat: create classes for main modules

4202f4a

refactor: allow user to configure cache sizes

ff6e82b

fix: fix Language detector name

e6317ea

fix: add back legacy method to don't break compatibility

75ad339

refactor: rename files to match classes

2411cdc

juanjoDiaz force-pushed the refactor/separate_logic_into_modules branch from 00f3be9 to 2411cdc Compare January 19, 2023 19:22

fix: set dictionaries cache size to 5

2d6f286

juanjoDiaz mentioned this pull request Jan 19, 2023

Refactor/better caching strategy #34

Closed

adbar mentioned this pull request Jan 23, 2023

Sort dict before pickling #41

Merged

juanjoDiaz mentioned this pull request Jan 24, 2023

refactor: split files into modules #42

Merged

adbar mentioned this pull request Jan 25, 2023

Replace lru_cache decorators by freely configurable alternative #43

Closed

adbar mentioned this pull request Jan 26, 2023

Add more control on how dictionaries are loaded to memory #33

Closed

juanjoDiaz closed this Feb 5, 2023

juanjoDiaz deleted the refactor/separate_logic_into_modules branch May 23, 2023 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/separate logic into modules #38

Refactor/separate logic into modules #38

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 20, 2023

adbar commented Jan 23, 2023

juanjoDiaz commented Jan 23, 2023

adbar commented Jan 23, 2023

juanjoDiaz commented Jan 23, 2023

osma commented Jan 24, 2023

adbar commented Jan 24, 2023

juanjoDiaz commented Jan 24, 2023

adbar commented Jan 25, 2023

adbar commented Jan 25, 2023

Refactor/separate logic into modules #38

Refactor/separate logic into modules #38

Conversation

juanjoDiaz commented Jan 19, 2023

adbar commented Jan 20, 2023

adbar commented Jan 23, 2023

juanjoDiaz commented Jan 23, 2023

adbar commented Jan 23, 2023

juanjoDiaz commented Jan 23, 2023

osma commented Jan 24, 2023

adbar commented Jan 24, 2023

juanjoDiaz commented Jan 24, 2023

adbar commented Jan 25, 2023

adbar commented Jan 25, 2023