Change to make compound ampersand words easier to match #463

blackmad · 2020-09-03T15:21:46Z

I think this is the correct change to make it so "A&P Deli" can be matched by any of these queries:
"A&P" "A & P" and "A and P"

I alias "and" to "und" because otherwise it seems like this change wouldn't work for german venues.

But then again I am still not great at schema changes. Working on building an index with this locally now. Unittests & manual testing seem to tell me this change works.

blackmad · 2020-09-03T15:29:59Z

oh right, integration tests.

missinglink · 2020-09-03T15:40:16Z

settings.js

+        "ampersand_splitter": {
+          "type": "pattern_replace",
+          "pattern": "&",
+          "replacement": " and "


The order these are executed is:

"char_filter" rules are run on the strings before they are split into words

"tokenizer" runs next, splitting on delimiter pattern (whitespace etc)

token "filter" rules are then run on individual tokens

so that means you can get away with simply replacing '&' with ' & '.

you shouldn't need to use the English form "and" here, the benefit of this is you don't need to make any changes to 'synonyms/punctuation/ampersand.txt' because it'll 'just work' 🧙

A pattern of "&" will expand something like "Johnson & Johnson" to "Johnson & Johnson" (double spaces), this isn't probably a big deal but I think you can probably tighten up the pattern a bit by only matching an ampersand sandwiched between two characters, such as:

[a-zA-Z]&[a-zA-Z] [^\s]&[^\s]

something like that

You can probably prototype a fix for the "Mc Williams" issue using a similar technique

"pattern": "(mc)\s+([^\s])" "replace": "$1$2"

one thing to be aware of with this is that it's running before lowercase, so it's case-sensitive whereas the token filters which run later are case-insensitive.

this would have the effect of removing any spaces immediately following the characters "mc".

missinglink

Nice PR!
Code looks clean and ready to merge.

Probably worth doing a full planet build to check for any unexpected side effects before merging.

orangejulius · 2020-09-10T15:44:18Z

Kicked off a build for this, will update with some results.

Personally, my biggest potential concern is that there will be lots of new short token matches, which could impact either query precision or response time. But we'll only know for sure after the build :)

orangejulius · 2020-09-24T14:21:23Z

Okay so, I reviewed our test results ran against this build the other day, and as far as I can tell it's just noise from having compared two builds run on slightly different days/data.

@blackmad do you have any good examples of pelias compare links that show cases where we aren't doing well with ampersands currently? would be great to try to find a test case in open data.

I'm going to do a little examining to see how many records this would affect (I suspect very few), and unless that investigation suggests that there might realistically be a performance impact to adding more smaller tokens to the index, this should be good to go.

blackmad · 2020-09-24T15:00:21Z

on dev, results in nyc:
https://pelias.github.io/compare/#/v1/search?sources=oa%2Cosm&focus.point.lat=40.74&focus.point.lon=-74&text=h%26m

but when I switch it to "h and m" I lose all those results - this change should theoretically fix this?
https://pelias.github.io/compare/#/v1/search?sources=oa%2Cosm&focus.point.lat=40.74&focus.point.lon=-74&text=h+and+m

These are meant for testing changes like those in pelias/schema#463 regarding `&` characters in queries. They make heavy use of `boundary.gid` for precise queries that examine behavior changes in matching. These aren't neccisarily acceptance tests we want to keep long term. They're just for exploration.

orangejulius · 2020-09-24T21:12:41Z

Okay, I've set up some simple, exploratory acceptance tests in pelias/acceptance-tests#534 for this.

It looks like this PR causes a mix of improvements and regressions (baseline on left, this branch on the right):

Improvements

H & M in /v1/search
H & M in /v1/autocomplete

Regressions

H&M in /v1/autocomplete

blackmad · 2020-09-29T14:37:14Z

follow-up: try adding regex pattern for ampersand at end to fix query autocomplete normalization

blackmad requested a review from missinglink September 3, 2020 15:21

missinglink reviewed Sep 3, 2020

View reviewed changes

simplify + test ampersand decompounding rule

a2acae5

blackmad force-pushed the compound-ampersand-fix branch 2 times, most recently from 270512b to a2acae5 Compare September 3, 2020 20:56

missinglink approved these changes Sep 3, 2020

View reviewed changes

fix tests

738ca1d

orangejulius mentioned this pull request Sep 24, 2020

Add ampersand query test cases pelias/acceptance-tests#534

Draft

blackmad closed this May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change to make compound ampersand words easier to match #463

Change to make compound ampersand words easier to match #463

blackmad commented Sep 3, 2020

blackmad commented Sep 3, 2020

missinglink Sep 3, 2020 •

edited

Loading

missinglink Sep 3, 2020 •

edited

Loading

missinglink left a comment

orangejulius commented Sep 10, 2020

orangejulius commented Sep 24, 2020

blackmad commented Sep 24, 2020

orangejulius commented Sep 24, 2020

blackmad commented Sep 29, 2020

Change to make compound ampersand words easier to match #463

Change to make compound ampersand words easier to match #463

Conversation

blackmad commented Sep 3, 2020

blackmad commented Sep 3, 2020

missinglink Sep 3, 2020 • edited Loading

Choose a reason for hiding this comment

missinglink Sep 3, 2020 • edited Loading

Choose a reason for hiding this comment

missinglink left a comment

Choose a reason for hiding this comment

orangejulius commented Sep 10, 2020

orangejulius commented Sep 24, 2020

blackmad commented Sep 24, 2020

orangejulius commented Sep 24, 2020

Improvements

Regressions

blackmad commented Sep 29, 2020

missinglink Sep 3, 2020 •

edited

Loading

missinglink Sep 3, 2020 •

edited

Loading