improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate #1032

ndushay · 2023-10-11T17:03:50Z

Searchworks has this approach for english language fields:

        title_245a_exact_search^1000
        title_245a_unstem_search^500
        title_245a_search^75

for our natural english language fields, such as

title
collection title
author (NOT stemmed flavor, but yes case folded) - searching on author should get results sul-dlss/argo#4231
topic
tags (NOT stemmed, but yes, case folded!)
apo title? (ask Andrew)
(anything with rights? ask Andrew)
ask Andrew if any others

We need to

create the appropriate dynamic fields (and fieldTypes) in schema.xml for argo
index to the appropriate fields (~~use copy fields~~): ~~schema.xml for copy fields~~, dor_indexing_app

wait for new fields to populate
change solrconfig.xml to use new fields (or would it be Argo catalog_controller or blacklight_config_helper???)
stop indexing to old fields, if there are any
remove old fields from index, if there are any

The text was updated successfully, but these errors were encountered:

ndushay · 2023-10-18T00:52:24Z

use a more aggressive english stemmer - porter snowball, like searchworks?

exact-ish matching
(precision: exactish, then non-stemmed, then stemmed)
text_ws may not be used anywhere???

ndushay · 2023-10-18T22:07:47Z

Questions:

which tokenizer?

remotely reasonable candidates

ICUTokenizer - deals with multilingual text
KeywordTokenizer - for single token
Standard - whitespace and punctuation delimiters; some exceptions
Classic - whitespace and punctuation as delimiters; some exceptions
LetterTokenizer - tokens are strings of contiguous letters (delimiters non-letters)
WhiteSpaceTokenizer (any punctuation will be included in the tokens)

which stemmer?

EnglishMinimalStemFilter only removes english plurals
EnglishPossessiveFilter only removes 's
KStem less agressive than porter
Porter - only english; rules based (not dictionary); 4x faster than snowball
SnowballPorter - not as accurate as table based stemmer, but faster

which sort field type / filters

there is an "ICUCollationField" type.
CollationKeyFilter for language sensitivity

which case folding?

ICUFoldingFilter for case folding for more scripts?
- This filter is a better substitute for the combined behavior of the ASCII Folding Filter, Lower Case Filter, and ICU Normalizer 2 Filter.
ASCIIFoldingFilter for normalizing to ascii for better recall across scripts?
LowerCase not sure how it deals with non-ascii

what other filters?

WordDelimiterGraphFilter (must be followed by FlattenGraphFilter) ... and maybe RemoveDuplicates??
ClassicFilter strips periods from acroynms, and 's
HyphenatedWords - combines words separated by a hyphen and whitespace (eg cut and paste text from something formatted)
Trim - leading and trailing whitespace removed; only needed if tokenization leaves whitespace
RemoveDuplicates - only removes same terms at the same position - so after filters that create multiple words

NOT:

stopwords
synonyms
shingles, such as CommonGrams, EdgeNGrams
phonetic matching: BeiterMorse (jewish names), DaitchMokotoffSoundex, DoubleMetaphone
payloads
boosts
token types

ndushay · 2023-10-18T22:15:26Z

FieldTypes:

https://solr.apache.org/guide/8_11/field-types-included-with-solr.html

BoolField
DatePointField, DateRangeField
ICUCollationField, CollationField
IntPointField
SortableTextField
TextField
UUIDField

Deprecated Field Types

"All Trie* numeric and date field types have been deprecated in favor of *Point field types. Point field types are better at range queries (speed, memory, disk), however simple field:value queries underperform relative to Trie. Either accept this, or continue to use Trie fields. This shortcoming may be addressed in a future release. " - https://solr.apache.org/guide/8_11/field-types-included-with-solr.html

Trie types are deprecated! https://solr.apache.org/guide/8_11/field-types-included-with-solr.html#deprecated-field-types

ndushay · 2023-10-18T22:47:52Z

https://solr.apache.org/guide/8_11/field-properties-by-use-case.html

ndushay · 2023-11-01T18:15:43Z

~~UUID field type~~ don't need it

~~Sort fields: ICUCollation field type? "SortableTextField"? "TextField"?~~ Argo only sorts results by druid or by relevance.

~~docValues for faceting, sorting, highlighting; NOT for searching.~~

Trie fields are deprecated

ndushay · 2023-12-12T16:33:19Z

closing this in favor of existing tickets; the new fields have been set up in schema.xml

ndushay changed the title ~~improve relevancy with exactish, unstemmed and stemmed field flavors where appropriate~~ improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate Oct 11, 2023

ndushay mentioned this issue Oct 12, 2023

consider some text analysis changes for better relevancy ranking of results #1036

Closed

ndushay self-assigned this Oct 18, 2023

ndushay added the argo discovery label Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate #1032

improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate #1032

ndushay commented Oct 11, 2023 •

edited

Loading

ndushay commented Oct 18, 2023

ndushay commented Oct 18, 2023 •

edited

Loading

ndushay commented Oct 18, 2023

ndushay commented Oct 18, 2023

ndushay commented Nov 1, 2023 •

edited

Loading

ndushay commented Dec 12, 2023

improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate #1032

improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate #1032

Comments

ndushay commented Oct 11, 2023 • edited Loading

ndushay commented Oct 18, 2023

ndushay commented Oct 18, 2023 • edited Loading

which tokenizer?

which stemmer?

which sort field type / filters

which case folding?

what other filters?

ndushay commented Oct 18, 2023

FieldTypes:

Deprecated Field Types

ndushay commented Oct 18, 2023

ndushay commented Nov 1, 2023 • edited Loading

ndushay commented Dec 12, 2023

ndushay commented Oct 11, 2023 •

edited

Loading

ndushay commented Oct 18, 2023 •

edited

Loading

ndushay commented Nov 1, 2023 •

edited

Loading