-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change to make compound ampersand words easier to match #463
Conversation
oh right, integration tests. |
settings.js
Outdated
"ampersand_splitter": { | ||
"type": "pattern_replace", | ||
"pattern": "&", | ||
"replacement": " and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order these are executed is:
- "char_filter" rules are run on the strings before they are split into words
- "tokenizer" runs next, splitting on delimiter pattern (whitespace etc)
- token "filter" rules are then run on individual tokens
so that means you can get away with simply replacing '&' with ' & '.
you shouldn't need to use the English form "and" here, the benefit of this is you don't need to make any changes to 'synonyms/punctuation/ampersand.txt' because it'll 'just work' 🧙
A pattern of "&" will expand something like "Johnson & Johnson" to "Johnson & Johnson" (double spaces), this isn't probably a big deal but I think you can probably tighten up the pattern a bit by only matching an ampersand sandwiched between two characters, such as:
[a-zA-Z]&[a-zA-Z]
[^\s]&[^\s]
something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably prototype a fix for the "Mc Williams" issue using a similar technique
"pattern": "(mc)\s+([^\s])"
"replace": "$1$2"
one thing to be aware of with this is that it's running before lowercase, so it's case-sensitive whereas the token filters which run later are case-insensitive.
this would have the effect of removing any spaces immediately following the characters "mc".
270512b
to
a2acae5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR!
Code looks clean and ready to merge.
Probably worth doing a full planet build to check for any unexpected side effects before merging.
Kicked off a build for this, will update with some results. Personally, my biggest potential concern is that there will be lots of new short token matches, which could impact either query precision or response time. But we'll only know for sure after the build :) |
Okay so, I reviewed our test results ran against this build the other day, and as far as I can tell it's just noise from having compared two builds run on slightly different days/data. @blackmad do you have any good examples of pelias compare links that show cases where we aren't doing well with ampersands currently? would be great to try to find a test case in open data. I'm going to do a little examining to see how many records this would affect (I suspect very few), and unless that investigation suggests that there might realistically be a performance impact to adding more smaller tokens to the index, this should be good to go. |
on dev, results in nyc: but when I switch it to "h and m" I lose all those results - this change should theoretically fix this? |
These are meant for testing changes like those in pelias/schema#463 regarding `&` characters in queries. They make heavy use of `boundary.gid` for precise queries that examine behavior changes in matching. These aren't neccisarily acceptance tests we want to keep long term. They're just for exploration.
Okay, I've set up some simple, exploratory acceptance tests in pelias/acceptance-tests#534 for this. It looks like this PR causes a mix of improvements and regressions (baseline on left, this branch on the right): Improvements
Regressions
|
follow-up: try adding regex pattern for ampersand at end to fix query autocomplete normalization |
I think this is the correct change to make it so "A&P Deli" can be matched by any of these queries:
"A&P" "A & P" and "A and P"
I alias "and" to "und" because otherwise it seems like this change wouldn't work for german venues.
But then again I am still not great at schema changes. Working on building an index with this locally now. Unittests & manual testing seem to tell me this change works.