Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove MorphemeConsumerAttribute #127

Merged
merged 2 commits into from
May 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 17 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ If you want to update Sudachi that is included in a plugin you have installed, d

# Analyzer

An analyzer named "sudachi" is provided.
An analyzer `sudachi` is provided.
This is equivalent to the following custom analyzer.

```json
Expand Down Expand Up @@ -92,6 +92,8 @@ See following sections for the detail of the tokenizer and each filters.

# Tokenizer

The `sudachi_tokenizer` tokenizer tokenizes input texts using Sudachi.

- split_mode: Select splitting mode of Sudachi. (A, B, C) (string, default: C)
- C: Extracts named entities
- Ex) 選挙管理委員会
Expand Down Expand Up @@ -168,7 +170,7 @@ dictionary settings

## sudachi\_split

This filter works like `mode` of kuromoji.
The `sudachi_split` token filter works like `mode` of kuromoji.

- mode
- "search": Additional segmentation useful for search. (Use C and A mode)
Expand Down Expand Up @@ -258,7 +260,7 @@ Which responds with:

## sudachi\_part\_of\_speech

The sudachi\_part\_of\_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:
The `sudachi_part_of_speech` token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

The `stopatgs` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.

Expand Down Expand Up @@ -348,7 +350,7 @@ Which responds with:

## sudachi\_ja\_stop

The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.
The `sudachi_ja_stop` token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.

### PUT sudachi_sample

Expand Down Expand Up @@ -426,7 +428,9 @@ Which responds with:

## sudachi\_baseform

The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.
The `sudachi_baseform` token filter replaces terms with their Sudachi dictionary form. This acts as a lemmatizer for verbs and adjectives.

This will be overridden by `sudachi_split`, `sudachi_normalizedform` or `sudachi_readingform` token filters.

### PUT sudachi_sample
```json
Expand Down Expand Up @@ -479,9 +483,10 @@ Which responds with:

## sudachi\_normalizedform

The sudachi\_normalizedform token filter replaces terms with their SudachiNormalizedFormAttribute. This acts as a normalizer for spelling variants.
The `sudachi_normalizedform` token filter replaces terms with their Sudachi normalized form. This acts as a normalizer for spelling variants.
This filter lemmatizes verbs and adjectives too. You don't need to use `sudachi_baseform` filter with this filter.

This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_baseform filter with this filter.
This will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_readingform` token filters.

### PUT sudachi_sample

Expand Down Expand Up @@ -535,14 +540,14 @@ Which responds with:

## sudachi\_readingform

Convert to katakana or romaji reading.
The sudachi\_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:
The `sudachi_readingform` token filter replaces the terms with their reading form in either katakana or romaji.

### use_romaji
This will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_normalizedform` token filters.

Whether romaji reading form should be output instead of katakana. Defaults to false.
Accepts the following setting:

When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
- use_romaji
- Whether romaji reading form should be output instead of katakana. Defaults to false.

### PUT sudachi_sample

Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@
package com.worksap.nlp.lucene.sudachi.ja

import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeConsumerAttribute
import com.worksap.nlp.sudachi.Morpheme
import org.apache.logging.log4j.LogManager
import org.apache.lucene.analysis.TokenFilter
import org.apache.lucene.analysis.TokenStream
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
Expand All @@ -39,8 +37,6 @@ abstract class MorphemeFieldFilter(input: TokenStream) : TokenFilter(input) {
@JvmField protected val morphemeAtt = existingAttribute<MorphemeAttribute>()
@JvmField protected val keywordAtt = addAttribute<KeywordAttribute>()
@JvmField protected val termAtt = addAttribute<CharTermAttribute>()
@JvmField
protected val consumer = addAttribute<MorphemeConsumerAttribute> { it.currentConsumer = this }

/**
* Override this method to customize returned value. This method will not be called if
Expand All @@ -64,16 +60,4 @@ abstract class MorphemeFieldFilter(input: TokenStream) : TokenFilter(input) {

return true
}

override fun reset() {
super.reset()
if (!consumer.shouldConsume(this)) {
logger.warn(
"an instance of ${javaClass.name} is a no-op, it is not a filter which produces terms in one of your filter chains")
}
}

companion object {
private val logger = LogManager.getLogger(MorphemeFieldFilter::class.java)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ public int offset() {
private final PositionIncrementAttribute posIncAtt;
private final PositionLengthAttribute posLengthAtt;
private final MorphemeAttribute morphemeAtt;
private final MorphemeConsumerAttribute consumerAttribute;
private ListIterator<Morpheme> aUnitIterator;
private final OovChars oovChars = new OovChars();

Expand All @@ -102,8 +101,6 @@ public SudachiSplitFilter(TokenStream input, Mode mode, Tokenizer.SplitMode spli
posIncAtt = addAttribute(PositionIncrementAttribute.class);
posLengthAtt = addAttribute(PositionLengthAttribute.class);
morphemeAtt = addAttribute(MorphemeAttribute.class);
consumerAttribute = addAttribute(MorphemeConsumerAttribute.class);
consumerAttribute.setCurrentConsumer(this);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
package com.worksap.nlp.lucene.sudachi.ja

import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeConsumerAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.SudachiAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.SudachiAttributeFactory
import org.apache.lucene.analysis.Tokenizer
Expand All @@ -37,7 +36,6 @@ class SudachiTokenizer(
private val offsetAtt = addAttribute<OffsetAttribute>()
private val posIncAtt = addAttribute<PositionIncrementAttribute>()
private val posLenAtt = addAttribute<PositionLengthAttribute>()
private val consumer = addAttribute<MorphemeConsumerAttribute> { it.currentConsumer = this }

init {
addAttribute<SudachiAttribute> { it.dictionary = tokenizer.dictionary }
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023 Works Applications Co., Ltd.
* Copyright (c) 2023-2024 Works Applications Co., Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -25,7 +25,6 @@ class SudachiAttributeFactory(private val parent: AttributeFactory) : AttributeF
override fun createAttributeInstance(attClass: Class<out Attribute>?): AttributeImpl {
return when (attClass) {
MorphemeAttribute::class.java -> MorphemeAttributeImpl()
MorphemeConsumerAttribute::class.java -> MorphemeConsumerAttributeImpl()
SudachiAttribute::class.java -> SudachiAttributeImpl()
else -> parent.createAttributeInstance(attClass)
}
Expand Down
Loading