Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/separate logic into modules #38

Closed

Conversation

juanjoDiaz
Copy link
Collaborator

Smaller PR to close #33.

  • Creates the new DictionaryFactory
  • Separate the library into modules
  • Create clases for the different components
  • Keep legacy methods for backwards compatibility

This is a subpart of #34

@juanjoDiaz juanjoDiaz force-pushed the refactor/separate_logic_into_modules branch from 00f3be9 to 2411cdc Compare January 19, 2023 19:22
@adbar
Copy link
Owner

adbar commented Jan 20, 2023

Thanks! Since I had experiments going on with the core algorithm I need to tidy things up a bit (language data and readme) but I'm planning to review and eventually accept the PR next week at the latest.

@adbar adbar mentioned this pull request Jan 23, 2023
@adbar
Copy link
Owner

adbar commented Jan 23, 2023

Hi @juanjoDiaz,

The following elements are ready to merge from my point of view:

  • split the library into modules
  • keep legacy methods

As a side note, your description above is not accurate, I see no DictionaryFactory.

The main problem resides in code complexity, this issue has also been pointed out here by @osma: #34 (comment)

I have trouble understanding the interest of the classes: if users want to use write classes to use the functions they can do it on their side. As such your code adds more lines and even problematic elements:

I understand that you want to build on those classes but we appear to have different approaches, so far I've tried to keep the code as simple as possible. What do you think?

@juanjoDiaz
Copy link
Collaborator Author

As a side note, your description above is not accurate, I see no DictionaryFactory.

You are right. At the moment it's simply a cache.
The conversion to a factory is in a separate commit that was left for a future PR when I reduce the commits from #34.

The _load_pickle() function in dictionary cache can end up using a lot of memory without additional benefit IMHO:
https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/dictionaries.py#L46

This is actually the only thing that I need to use this library in my production system and the thing that Osma or someone else also requested.

When using the library to power a REST API, you can get many request in many different languages.
The current approach to load the dictionaries every time makes the library practically unusable for such use cases.

An LRU is a very simple approach to mitigate this.
I have also another commit ready that extends the DictionaryFactory to preload a set of languages and always keep those in memory.
I also thought about implementing an alternative dictionary that instead of LRU cache uses a cache of the most called languages.
The key thing here is that the user can pass his own DictionaryFactory with its own logic on how to load dictionaries and keep them in memory according to his own needs instead of being forced to whatever is hardcoded in the library.

The lines using the classes don't seem to bring anything compared to the existing code:
language detection: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/language_detector.py#L39

The value of the classes and the value of that specific line is precisely that you can pass your own DictionaryCache instead of forcing the user to use the one hardcoded in the file.
And alternative to not use classes is that the user has to pass the DictionaryCache on every method.

The lines using the classes don't seem to bring anything compared to the existing code:
language detection: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/language_detector.py#L39
lemmatizer init, it's a lot of code, there has to be a better way to do it: https://github.com/juanjoDiaz/simplemma/blob/2d6f28650c660a964a6f79bb6380b9ae685d54d0/simplemma/lemmatizer.py#L63

This is the same thing. It offers configurability to the user.
Allows the user to provides a dictionary and set up a default one if not

        if dictionaryCache == None:
            dictionaryCache = DictionaryCache()
        assert isinstance(dictionaryCache, DictionaryCache)
        self.dictionaryCache: DictionaryCache = dictionaryCache

Allows the user to provide his own LRU size for the lemmatize function or use the sensible default if not.

        self.lemmatize = lru_cache(maxsize=lemmatization_distance_cache_max_size)(
            self._lemmatize
        )

Allows the user to provide his own LRU size for the levenshtein_dist function or use the sensible default if not.

        self.levenshtein_dist = lru_cache(maxsize=levenshtein_distance_cache_max_size)(
            levenshtein_dist
        )

I have trouble understanding the interest of the classes: if users want to use write classes to use the functions they can do it on their side.

When the user instantiate the class, he can pass the configuration (DictionaryCache, LRU sizes, etc.) and get all the normal methods but configure to his own needs.

In the future we could expand it so the user can also provide his own tokenizer, for example.
Or can provide his own dictionaries using his own class extending the DictionaryCache.
Or can provide his own rules.

In short, it makes the code extensible and flexible.

I can see three options:

  • Object oriente approach: A class wraps the methods configured in a certain way
  • Pure functional approach: Every single method should receive every possible parameter as argument.
  • Using globals: A bad practice in every possible way and asking for trouble down the road 😅

I've opted for the former, since the latter quickly becomes very verbose and hard to maintain.

Does that make sense?

@adbar
Copy link
Owner

adbar commented Jan 23, 2023

Thanks for the detailed answer. I understand the general idea and I suggest to make the following changes in order to move on: you revamped the code and switched the existing functions to the legacy submodule but what if we do the opposite?

We could leave the existing functions untouched and use them in the classes, that way you could experiment further on the classes, for example by adding a cache or additional parameters. The other option would be to offer decorators but since you've started with classes we can leave it that way.

In the docs we could point users wanting to use a functional approach to the reference functions and the users intestered by the object-oriented approach (or by additional functionality) to the newer ones.

A further advantage would be that the already existing functions would work the same. I'm still concerned that future functions could modify their behavior, be it only in terms of memory usage.

Concretely speaking I suggest the following changes:

  • rename legacy to core (for the sake of clarity)
  • transfer main functions from dictionaries.py, lemmatizer.py and language_detector.py back to the core
  • further use and develop your classes as wrappers around the functions (e.g. DictionaryCache wrapping update_lang_data())

Does this compromise work for you?

@juanjoDiaz
Copy link
Collaborator Author

I think that we are making progress 🙁

You proposal was the first thing that I thought. Unfortunately, it won't work because the values are hardcoded.
There is a global array of dictionaries and the update_lang_data updates it every time.
The LRU sizes are also hardcoded and can't be modified.

It's trivial to instantiate a configurable class in a certain way by configuring the settings as local variable in the constructor. But it's near imposible to override global variables behavior.

Imagine that you want to have a lemmatizer with a smaller LRU or without LRU because you expect little repettitions but you want to have another with a larger LRU for a different use case. With classes you can do that. With global variables it's just not possible.

Using decorators is also something difficult because of the nested nature of the stack.
It's impossible to set the LRU cache setting of the levenstain distance calculator which is then used deep inside a stack of method calls.

I'm flexible to other solutions. Don't take it as I am obsesses with the classes approach.
I just would like to make the system very configurable and global variables are a problem for that.

@osma
Copy link
Contributor

osma commented Jan 24, 2023

My two cents:

  1. @juanjoDiaz makes good points about global variables being harmful. I think having classes that allow setting options like LRU cache size is the way to go for a more configurable API - but with simple functions available that can be used when all that isn't necessary from an API user point of view.
  2. It feels strange to have a module called legacy in a project that is still in the 0.x version range, where basically all kinds of API changes are allowed according to semver. If this is something that needs to stay in the API (and I think it is), come up with a different name such as functional or simple_api or whatever.
  3. This still feels like a quite big PR that wants to reorganize and also reimplement everything in a single sweep. I think it would be better to have a sequence of smaller PRs that only perform one refactoring at a time. For example splitting the library into modules - I think ideally there should be no changes to the code itself, just moving it around. In practice that's probably not possible (for example imports need to be adjusted) but anyway the functionality should remain the same, just that the functions etc. would be in different files than before. The diffs are going to be big in this case, so it's safer to not change any code because otherwise unintended bugs may sneak in and those are going to be difficult to trace afterwards.

@adbar
Copy link
Owner

adbar commented Jan 24, 2023

Thanks for you feedback. I definitely agree with you both that the hardcoded values for LRU caches are a roadblock. It's a beneficial change to let the users set the LRU cache size. We could replace the existing lru_cache decorator by a light workaround and also add a transparent way to clear the cache.

@juanjoDiaz Once this concern is addressed I guess you could proceed with the addition of classes in a series of smaller steps, right?

@osma I also think that legacy is not a good name, especially if we keep the core functionality there. However this is just a naming convention, the biggest problem as of now is indeed the size of the diffs and our impression that too much code is being moved around.

If we all agree on this I suggest the following way to move on:

  1. Perform a light code re-organization (constants etc.)
  2. Remove the fixed LRU size (by finding a way around it, probably a small class that does just that)
  3. Implement the rest of the classes as wrappers to the core functions

Does that make sense to you?

@juanjoDiaz
Copy link
Collaborator Author

Thanks for you feedback. I definitely agree with you both that the hardcoded values for LRU caches are a roadblock. It's a beneficial change to let the users set the LRU cache size. We could replace the existing lru_cache decorator by a light workaround and also add a transparent way to clear the cache.
@juanjoDiaz Once this concern is addressed I guess you could proceed with the addition of classes in a series of smaller steps, right?

I honestly think that doing this in functions and building classes around this is doing things backwards.
And it's hindering the future of this library from the get-go.

The LRU cache will be simply impossible to configure with functions.
_levenshtein_dist is used deep inside the stack of calls.
For example:

lemmatize
_return_lemma
_affix_search
_decompose
_greedy_search
_levenshtein_dist

How would you do it to configure the cache?
As said before, there are only 2 options, and both are incompatible with the current method:

  • Object oriente approach: A class wraps the methods configured in a certain way
  • Pure functional approach: Every single method should receive every possible parameter as argument.

And it's not just LRU for those couple of methods that use it. It's every possible configurable aspect of the library.

How do we configure how dictionaries are loaded and cached if we rely on the global variable LANG_DATA and the global method _update_lang_data? _update_lang_data is used also inside other methods, so the only way to change it is to override the method as a global. Pretty bad and pretty inflexible.

Also, at my workplace, we need to use our own tokenizer instead of the simple_tokenizer. Once again, the only way to change it is to override the method as a global. Pretty bad and pretty inflexible.

Imagine that in the future, we want to enable user-define rules. Once again, the only way to change it is to override the method as a global. Pretty bad and pretty inflexible.

Imagine that we want to give the user some choice on the sampling mechanism for detecting languages. Once again, the only way to change the prepare_text is to override the method as a global. Pretty bad and pretty inflexible.

Sorry for the repetition. I just wanted to emphasize that building on hardcoded functions instead of classes limits the extensibility and thus the potential of this library.

@adbar
Copy link
Owner

adbar commented Jan 25, 2023

Now we performed the steps 1 & 2 described in my latest answer.

@juanjoDiaz, you made good points about writing classes to make certain functions configurable (cache, tokenizer, etc.).
My only condition would be that you strive to offer new functionality: the new classes or functions should come on top of the rest and if necessary be documented separately (e.g. for advanced users who wish to customize the software's behavior). If you cannot integrate the existing functions directly we could break them further apart.

Would that work or are there further obstacles along the way?

@adbar
Copy link
Owner

adbar commented Jan 25, 2023

Just to clarify: I understand there will still be things to fix or to adapt along the way, for example the update_lang_data() function (#33). I suggest making a list of the remaining issues (as I just did with #44), we can then solve them one by one.

However you should now be able to use some of the existing functions as building blocks inside your (new) class structures, for example with the lemmatization. Please correct me if I'm wrong.

As for the general question whether a object-oriented or a functional approach is best I tend to use the functional approach but both are not mutually exclusive, we just need to make sure they're compatible.

@juanjoDiaz juanjoDiaz closed this Feb 5, 2023
@juanjoDiaz juanjoDiaz deleted the refactor/separate_logic_into_modules branch May 23, 2023 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more control on how dictionaries are loaded to memory
3 participants