Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to make compound ampersand words easier to match #463

Closed
wants to merge 2 commits into from

Conversation

blackmad
Copy link
Contributor

@blackmad blackmad commented Sep 3, 2020

I think this is the correct change to make it so "A&P Deli" can be matched by any of these queries:
"A&P" "A & P" and "A and P"

I alias "and" to "und" because otherwise it seems like this change wouldn't work for german venues.

But then again I am still not great at schema changes. Working on building an index with this locally now. Unittests & manual testing seem to tell me this change works.

@blackmad blackmad requested a review from missinglink September 3, 2020 15:21
@blackmad
Copy link
Contributor Author

blackmad commented Sep 3, 2020

oh right, integration tests.

settings.js Outdated
"ampersand_splitter": {
"type": "pattern_replace",
"pattern": "&",
"replacement": " and "
Copy link
Member

@missinglink missinglink Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order these are executed is:

  1. "char_filter" rules are run on the strings before they are split into words
  2. "tokenizer" runs next, splitting on delimiter pattern (whitespace etc)
  3. token "filter" rules are then run on individual tokens

so that means you can get away with simply replacing '&' with ' & '.

you shouldn't need to use the English form "and" here, the benefit of this is you don't need to make any changes to 'synonyms/punctuation/ampersand.txt' because it'll 'just work' 🧙

A pattern of "&" will expand something like "Johnson & Johnson" to "Johnson & Johnson" (double spaces), this isn't probably a big deal but I think you can probably tighten up the pattern a bit by only matching an ampersand sandwiched between two characters, such as:

[a-zA-Z]&[a-zA-Z]

[^\s]&[^\s]

something like that

Copy link
Member

@missinglink missinglink Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably prototype a fix for the "Mc Williams" issue using a similar technique

"pattern": "(mc)\s+([^\s])"
"replace": "$1$2"

one thing to be aware of with this is that it's running before lowercase, so it's case-sensitive whereas the token filters which run later are case-insensitive.

this would have the effect of removing any spaces immediately following the characters "mc".

@blackmad blackmad force-pushed the compound-ampersand-fix branch 2 times, most recently from 270512b to a2acae5 Compare September 3, 2020 20:56
Copy link
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR!
Code looks clean and ready to merge.

Probably worth doing a full planet build to check for any unexpected side effects before merging.

@orangejulius
Copy link
Member

Kicked off a build for this, will update with some results.

Personally, my biggest potential concern is that there will be lots of new short token matches, which could impact either query precision or response time. But we'll only know for sure after the build :)

@orangejulius
Copy link
Member

Okay so, I reviewed our test results ran against this build the other day, and as far as I can tell it's just noise from having compared two builds run on slightly different days/data.

@blackmad do you have any good examples of pelias compare links that show cases where we aren't doing well with ampersands currently? would be great to try to find a test case in open data.

I'm going to do a little examining to see how many records this would affect (I suspect very few), and unless that investigation suggests that there might realistically be a performance impact to adding more smaller tokens to the index, this should be good to go.

@blackmad
Copy link
Contributor Author

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Sep 24, 2020
These are meant for testing changes like those in
pelias/schema#463 regarding `&` characters in
queries.

They make heavy use of `boundary.gid` for precise queries that examine
behavior changes in matching.

These aren't neccisarily acceptance tests we want to keep long term.
They're just for exploration.
@orangejulius
Copy link
Member

Okay, I've set up some simple, exploratory acceptance tests in pelias/acceptance-tests#534 for this.

It looks like this PR causes a mix of improvements and regressions (baseline on left, this branch on the right):
Screenshot_2020-09-24_17-08-41

Improvements

  • H & M in /v1/search
  • H & M in /v1/autocomplete

Regressions

  • H&M in /v1/autocomplete

@blackmad
Copy link
Contributor Author

follow-up: try adding regex pattern for ampersand at end to fix query autocomplete normalization

@blackmad blackmad closed this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants