[PROPOSAL] Better support for LTR latency issues #50

JohannesDaniel · 2024-10-17T10:43:31Z

Why do we need better support?

Latency is difficult to deal with, as such issues heavily depend on environments and contexts. For LTR, this is even more complex, as needs of data scientists (precise data, model quality) need to be negotiated with needs of backend engineers (latency, stability).

Latncy issues therefore need to be tackled and supported on multiple levels:

Technical features and solutions
Conceptual considerations and negotiations in the documentation
Direct consulting (out of scope for this RFC)

Suggestions to improve support

Profiling model latency

Currently, OpenSearch users are able to profile an sltr query and its subqueries, but are unable to recognize how much the model execution contributed to the latency of it. Adding this option might help to get better insights how much the model(s) contribute to latency in comparison to subqueries.

Options to decouple production traffic and feature logging

Feature logging seems to have a significant impact on latency, as reported in several issues:
o19s/elasticsearch-learning-to-rank#354
o19s/elasticsearch-learning-to-rank#398
#30

In the documentation we recommend to use logging over large feature sets and that it is needed to know the exact feature values at a certain time in order to be able to properly correlate them to user behavior. However, this passage should contain more details about trade-offs between data and stability needs. The need to log (and store) feature values for every single request vastly depends on contextual aspects (e.g. proportion of short-head and repetitiveness of queries). Furthermore, logging feature values on a scheduled basis (e.g., once a day) by running a well-throttled batch job might still provide sufficient data while having much less impact on search performance.

Such a batch job could even be provided as a feature in combination with other OpenSearch enhancements such as https://github.com/opensearch-project/user-behavior-insights.

Technical challenges

Is it possible to modify / enhance the profile response object in a plugin?
A solution to simplify logging of feature values needs to be integrated with user behavior and storing.

Next steps

Implementation of profiling model latency
Adjusting documentation by discussing tradeoffs
Implementation of batch job

andrross · 2024-10-23T16:16:19Z

Would emitting OpenTelemetry metrics be a possible path here? We do provide a hook for plugins to emit OTEL data via TelemetryAwarePlugin.

github-project-automation bot added this to Search Project Board Oct 17, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Oct 17, 2024

github-actions bot added the untriaged label Oct 17, 2024

JohannesDaniel changed the title ~~[PROPOSAL] Better support for LTR performance issues~~ [PROPOSAL] Better support for LTR latency issues Oct 17, 2024

andrross removed the untriaged label Oct 23, 2024

JohannesDaniel mentioned this issue Dec 2, 2024

[FEATURE] Discuss feature logging architecture in documentation #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Better support for LTR latency issues #50

[PROPOSAL] Better support for LTR latency issues #50

JohannesDaniel commented Oct 17, 2024 •

edited

Loading

andrross commented Oct 23, 2024

[PROPOSAL] Better support for LTR latency issues #50

[PROPOSAL] Better support for LTR latency issues #50

Comments

JohannesDaniel commented Oct 17, 2024 • edited Loading

Why do we need better support?

Suggestions to improve support

Profiling model latency

Options to decouple production traffic and feature logging

Technical challenges

Next steps

andrross commented Oct 23, 2024

JohannesDaniel commented Oct 17, 2024 •

edited

Loading