-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CODEC-308: change NYSIIS encoding to not remove the first character i… #189
Draft
Ben-Waters
wants to merge
2
commits into
apache:master
Choose a base branch
from
Ben-Waters:CODEC-308-NYSIIS-ASH
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ben-Waters
Thank you for your PR.
Who is right? This or Fuzzy?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garydgregory
Based on this other implementation it would be just A.
According to the algorithm, the final character should be removed if it is an 'S' so I would think it should be removed.
The current commons-codec implementation is removing it as well as the final 'A' but is ignoring the part about "The first character of the NYSIIS code is the first character of the name."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used fuzzy before but looking at their implementation I don't see where they remove the trailing S or A.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not used it before today either. I was looking for something easy to compare this PR's behavior.
Let's see if anybody else has feedback.
Do you know of any other software that is easily testable for this use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty frustrating but it seems like the results are inconsistent depending on where you go.
Looking at this link the Java version returns AS but the python returns A. For my use-case, either one of those would be fine but it seems hard to get a consistent answer. Commons-codec is currently returning nothing though and I haven't seen another do that yet.
I don't know of a good official implementation to run it against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that returning something like "A" is better than "". Let's see if anyone else suggests anything. @kinow ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a C++ implementation nor a good source from Wikipedia. So I checked another implementation,
phonics
in R. It has a nysiis algorithm implementation that allows for "modified" key. For "A" it gives "" too, but the modified version gives "AS".Their docs (PDF) documents the basic algorithm, but links to this PDF that explains the whole package a lot better: James P. Howard, II, Phonetic Spelling Algorithm Implementations for R
Searching more about those papers after the text above, I found this issue that describes the same thing I just said 😬 : CODEC-235
So my think we should document that the current version in Commons Codec is the one from the first paper, and then maybe add the other implementation as a separate class/method and let users to pick which one they want to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question that comes up for me is: Is our current implementation "plain" NYSIIS, or, is any non-standard or any behavior from the "modified" L&A 1977 algorithm present in our current implementation? If our current code is "plain", then I could see us creating a
ModifiedNysiis
class, if not, then we are in a bit of a pickle.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the results seem to indicate we implement the plain NYSIIS, someone has indeed to verify it reading our code and the algorithm. A good point, +1!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the steps from wikipedia, the core of the issue is that in step 4, we are setting the pointer to the first character instead of the 2nd character. The 9th step on wikipedia is a bit ambiguous though as to whether that step includes the first character or not. If it does, then you would get a blank string like commons-codec currently gives. If not, then you would get 'A'.
Unfortunately I don't have access to the original source book for the algorithm to try to clarify. It looks like berkeley law has it but I don't have an account for that.