You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Log message commonly contain identifies and ids that are inherently (pseudo)random and contain characters that would be interpreted as punctuation in prose, such as UUIDs or addresses. In many cases the analyzer splits these into several small and diverse tokens that overwhelm the distance metric of the categorization aggregation. To reduce the likelihood of that we can filter tokens that look like strings consisting only of hexadecimal values before categorization and limit the number of tokens compared.
📓 Summary
Log message commonly contain identifies and ids that are inherently (pseudo)random and contain characters that would be interpreted as punctuation in prose, such as UUIDs or addresses. In many cases the analyzer splits these into several small and diverse tokens that overwhelm the distance metric of the categorization aggregation. To reduce the likelihood of that we can filter tokens that look like strings consisting only of hexadecimal values before categorization and limit the number of tokens compared.
✔️ Acceptance criteria
categorization_analyzer
is configured such that thechar_filter
ignores hexadecimal tokens.categorization_analyzer
is configured such that alimit
filter
sets a reasonable maximum token count.The text was updated successfully, but these errors were encountered: