Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

Open
YeonghyeonKO opened this issue Sep 18, 2024 · 2 comments
Open

Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

YeonghyeonKO opened this issue Sep 18, 2024 · 2 comments

Comments

@YeonghyeonKO
Copy link

YeonghyeonKO commented Sep 18, 2024

Description

From org.apache.lucene:lucene-analysis-common:9.11.1, the static variable DEFAULT_MAX_GRAM_SIZE of EdgeNGramTokenizer is ONE not TWO.

Logically, the maximum n-gram size must be >= minGramSize and it's not a problem but NOT PRACTICAL.

Since many libraries(git code: Elasticsearch, OpenSearch) use NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE not EdgeNGramTokenizer's.
Will there be any dependency problem in Lucene project as a result of my suggestion?

See the below codes:

public class EdgeNGramTokenizer extends NGramTokenizer {
    public static final int DEFAULT_MAX_GRAM_SIZE = 1;    /* How about changing '1' to '2'? */
    public static final int DEFAULT_MIN_GRAM_SIZE = 1;

    public EdgeNGramTokenizer(int minGram, int maxGram) {
        super(minGram, maxGram, true);
    }

    public EdgeNGramTokenizer(AttributeFactory factory, int minGram, int maxGram) {
        super(factory, minGram, maxGram, true);
    }
}
@jpountz
Copy link
Contributor

jpountz commented Sep 18, 2024

I agree that these defaults are not practical. The only purpose of this tokenizer that I'm familiar with is speeding up prefix queries. This makes me wonder if we should have an even higher max gram size, e.g. 8 or 10.

@YeonghyeonKO
Copy link
Author

YeonghyeonKO commented Sep 20, 2024

In actual Elasticsearch settings, it is common to use values ​​of 8 or 10 for MAX_NGRAM_SIZE as @jpountz said. Of course, people may have different preferences, but I think it is not a good idea to set both max and min values ​​to 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants