Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate #1032

Open
1 of 2 tasks
ndushay opened this issue Oct 11, 2023 · 6 comments
Assignees

Comments

@ndushay
Copy link
Contributor

ndushay commented Oct 11, 2023

Searchworks has this approach for english language fields:

        title_245a_exact_search^1000
        title_245a_unstem_search^500
        title_245a_search^75        

for our natural english language fields, such as

We need to

  • create the appropriate dynamic fields (and fieldTypes) in schema.xml for argo
  • index to the appropriate fields (use copy fields): schema.xml for copy fields, dor_indexing_app
  1. wait for new fields to populate
  2. change solrconfig.xml to use new fields (or would it be Argo catalog_controller or blacklight_config_helper???)
  3. stop indexing to old fields, if there are any
  4. remove old fields from index, if there are any
@ndushay ndushay changed the title improve relevancy with exactish, unstemmed and stemmed field flavors where appropriate improve relevancy with exactish, tokenized, unstemmed and stemmed field flavors where appropriate Oct 11, 2023
@ndushay
Copy link
Contributor Author

ndushay commented Oct 18, 2023

  1. use a more aggressive english stemmer - porter snowball, like searchworks?
image
  1. exact-ish matching
    (precision: exactish, then non-stemmed, then stemmed)

  2. text_ws may not be used anywhere???

@ndushay ndushay self-assigned this Oct 18, 2023
@ndushay
Copy link
Contributor Author

ndushay commented Oct 18, 2023

Questions:

which tokenizer?

remotely reasonable candidates

  • ICUTokenizer - deals with multilingual text
  • KeywordTokenizer - for single token
  • Standard - whitespace and punctuation delimiters; some exceptions
  • Classic - whitespace and punctuation as delimiters; some exceptions
  • LetterTokenizer - tokens are strings of contiguous letters (delimiters non-letters)
  • WhiteSpaceTokenizer (any punctuation will be included in the tokens)

which stemmer?

  • EnglishMinimalStemFilter only removes english plurals
  • EnglishPossessiveFilter only removes 's
  • KStem less agressive than porter
  • Porter - only english; rules based (not dictionary); 4x faster than snowball
  • SnowballPorter - not as accurate as table based stemmer, but faster

which sort field type / filters

  • there is an "ICUCollationField" type.
  • CollationKeyFilter for language sensitivity

which case folding?

  • ICUFoldingFilter for case folding for more scripts?
  • ASCIIFoldingFilter for normalizing to ascii for better recall across scripts?
  • LowerCase not sure how it deals with non-ascii

what other filters?

  • WordDelimiterGraphFilter (must be followed by FlattenGraphFilter) ... and maybe RemoveDuplicates??
  • ClassicFilter strips periods from acroynms, and 's
  • HyphenatedWords - combines words separated by a hyphen and whitespace (eg cut and paste text from something formatted)
  • Trim - leading and trailing whitespace removed; only needed if tokenization leaves whitespace
  • RemoveDuplicates - only removes same terms at the same position - so after filters that create multiple words

NOT:

  • stopwords
  • synonyms
  • shingles, such as CommonGrams, EdgeNGrams
  • phonetic matching: BeiterMorse (jewish names), DaitchMokotoffSoundex, DoubleMetaphone
  • payloads
  • boosts
  • token types

@ndushay
Copy link
Contributor Author

ndushay commented Oct 18, 2023

FieldTypes:

https://solr.apache.org/guide/8_11/field-types-included-with-solr.html

  • BoolField
  • DatePointField, DateRangeField
  • ICUCollationField, CollationField
  • IntPointField
  • SortableTextField
  • TextField
  • UUIDField

Deprecated Field Types

"All Trie* numeric and date field types have been deprecated in favor of *Point field types. Point field types are better at range queries (speed, memory, disk), however simple field:value queries underperform relative to Trie. Either accept this, or continue to use Trie fields. This shortcoming may be addressed in a future release. " - https://solr.apache.org/guide/8_11/field-types-included-with-solr.html

@ndushay
Copy link
Contributor Author

ndushay commented Oct 18, 2023

@ndushay
Copy link
Contributor Author

ndushay commented Nov 1, 2023

UUID field type don't need it

Sort fields: ICUCollation field type? "SortableTextField"? "TextField"? Argo only sorts results by druid or by relevance.

docValues for faceting, sorting, highlighting; NOT for searching.

Trie fields are deprecated

@ndushay
Copy link
Contributor Author

ndushay commented Dec 12, 2023

closing this in favor of existing tickets; the new fields have been set up in schema.xml

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant