Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Better support for LTR latency issues #50

Open
JohannesDaniel opened this issue Oct 17, 2024 · 1 comment
Open

[PROPOSAL] Better support for LTR latency issues #50

JohannesDaniel opened this issue Oct 17, 2024 · 1 comment

Comments

@JohannesDaniel
Copy link
Collaborator

JohannesDaniel commented Oct 17, 2024

Why do we need better support?

Latency is difficult to deal with, as such issues heavily depend on environments and contexts. For LTR, this is even more complex, as needs of data scientists (precise data, model quality) need to be negotiated with needs of backend engineers (latency, stability).

Latncy issues therefore need to be tackled and supported on multiple levels:

  • Technical features and solutions
  • Conceptual considerations and negotiations in the documentation
  • Direct consulting (out of scope for this RFC)

Suggestions to improve support

Profiling model latency

Currently, OpenSearch users are able to profile an sltr query and its subqueries, but are unable to recognize how much the model execution contributed to the latency of it. Adding this option might help to get better insights how much the model(s) contribute to latency in comparison to subqueries.

Options to decouple production traffic and feature logging

Feature logging seems to have a significant impact on latency, as reported in several issues:
o19s/elasticsearch-learning-to-rank#354
o19s/elasticsearch-learning-to-rank#398
#30

In the documentation we recommend to use logging over large feature sets and that it is needed to know the exact feature values at a certain time in order to be able to properly correlate them to user behavior. However, this passage should contain more details about trade-offs between data and stability needs. The need to log (and store) feature values for every single request vastly depends on contextual aspects (e.g. proportion of short-head and repetitiveness of queries). Furthermore, logging feature values on a scheduled basis (e.g., once a day) by running a well-throttled batch job might still provide sufficient data while having much less impact on search performance.

Such a batch job could even be provided as a feature in combination with other OpenSearch enhancements such as https://github.com/opensearch-project/user-behavior-insights.

Technical challenges

  • Is it possible to modify / enhance the profile response object in a plugin?
  • A solution to simplify logging of feature values needs to be integrated with user behavior and storing.

Next steps

  • Implementation of profiling model latency
  • Adjusting documentation by discussing tradeoffs
  • Implementation of batch job
@JohannesDaniel JohannesDaniel changed the title [PROPOSAL] Better support for LTR performance issues [PROPOSAL] Better support for LTR latency issues Oct 17, 2024
@andrross
Copy link
Member

Would emitting OpenTelemetry metrics be a possible path here? We do provide a hook for plugins to emit OTEL data via TelemetryAwarePlugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants