You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There have been a few cases now, where it could be interesting to add some additional processing to an incomming query before it is sent to the tokenizer. It would allow to add custom filters for nonsense queries, do some experiments with NLP pre-processing and it would be needed for the splitting of Japanese queries as proposed in #3158.
This should work in a very similar way to the sanitizers used during import, i.e. the ICU tokenizer allows to specify a list of modules with preprocessing functions that are run in sequence over the incomming query.
Configuration
The yaml for the configuration should look about the same as the sanitizer with the step key naming the module to use and any further keys setting the configuration.
This would execute three preprocessing modules: clean_by_pattern, normalize and split_key_japanese_phrases, normalize would be the step that runs the normalization rules over the query. This is currently hard-coded in the ICU tokenizer. However, conceptually, it is a simple preprocessing step, too. So we might as well make it explicit. It also means that the user has the choice if they want to run the preprocessing on the original input or on the normalized code. This might for example be relevant for Japanese key splitting already: normalization includes rules to change from simplified to traditional Chinese characters. This looses valuable information because simplified Chinese characters are a clear sign that the input is not Japanese.
Preprocessing modules
The preprocessing modules should go into nominatim/tokenizer/query_preprocessing. Most of this should work exactly like the sanitizer, see base.py.
Each module needs to export a create function, that creates the preprocessor:
QueryConfig can be an alias to dict for the moment. We might want to add additional convenience functions as in SanitizerConfig later.
QueryInfo should have as the only field: a List[Phrase]. This should be mutable by the preprocessor function. The indirection via a QueryInfo class allows us to later add more functionality to the preprocessing without breaking existing code.
The tricky part is getting the information from the yaml configuration. This needs access to the Configuration object, which is not available here. We should add this as a property to the SearchConnection class. It can be easily added from self.config when it is created here. Once this is done, something along the lines of self.conn.config.load_sub_configuration('icu_tokenizer.yaml', config='TOKENIZER_CONFIG')['query-preprocessing'] should do the trick.
Using the preprocessors
This is mostly done in PR 3158 already. The only difference would be that the list of functions is not hardcoded anymore and that the phrases are mutated inside a QueryInfo object instead of returning the mutated phrase from the function.
The text was updated successfully, but these errors were encountered:
There have been a few cases now, where it could be interesting to add some additional processing to an incomming query before it is sent to the tokenizer. It would allow to add custom filters for nonsense queries, do some experiments with NLP pre-processing and it would be needed for the splitting of Japanese queries as proposed in #3158.
This should work in a very similar way to the sanitizers used during import, i.e. the ICU tokenizer allows to specify a list of modules with preprocessing functions that are run in sequence over the incomming query.
Configuration
The yaml for the configuration should look about the same as the sanitizer with the
step
key naming the module to use and any further keys setting the configuration.Example:
This would execute three preprocessing modules:
clean_by_pattern
,normalize
andsplit_key_japanese_phrases
,normalize
would be the step that runs the normalization rules over the query. This is currently hard-coded in the ICU tokenizer. However, conceptually, it is a simple preprocessing step, too. So we might as well make it explicit. It also means that the user has the choice if they want to run the preprocessing on the original input or on the normalized code. This might for example be relevant for Japanese key splitting already: normalization includes rules to change from simplified to traditional Chinese characters. This looses valuable information because simplified Chinese characters are a clear sign that the input is not Japanese.Preprocessing modules
The preprocessing modules should go into
nominatim/tokenizer/query_preprocessing
. Most of this should work exactly like the sanitizer, see base.py.Each module needs to export a create function, that creates the preprocessor:
QueryConfig
can be an alias to dict for the moment. We might want to add additional convenience functions as in SanitizerConfig later.QueryInfo
should have as the only field: a List[Phrase]. This should be mutable by the preprocessor function. The indirection via aQueryInfo
class allows us to later add more functionality to the preprocessing without breaking existing code.Loading the preprocessors
It is important, that the preprocessor chain is loaded only once and then cached. The setup function is the right place to do that.
self.conn.get_cached_value
makes sure that a setup function like_make_transliterator
is only executed once. The equivalent code for setting up the Sanitizer chain is at https://github.com/osm-search/Nominatim/blob/master/nominatim/tokenizer/place_sanitizer.py#L28The tricky part is getting the information from the yaml configuration. This needs access to the
Configuration
object, which is not available here. We should add this as a property to the SearchConnection class. It can be easily added fromself.config
when it is created here. Once this is done, something along the lines ofself.conn.config.load_sub_configuration('icu_tokenizer.yaml', config='TOKENIZER_CONFIG')['query-preprocessing']
should do the trick.Using the preprocessors
This is mostly done in PR 3158 already. The only difference would be that the list of functions is not hardcoded anymore and that the phrases are mutated inside a
QueryInfo
object instead of returning the mutated phrase from the function.The text was updated successfully, but these errors were encountered: