-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Obs AI Assistant] Add uuid to knowledge base entries to avoid overwriting accidentally #191043
base: main
Are you sure you want to change the base?
[Obs AI Assistant] Add uuid to knowledge base entries to avoid overwriting accidentally #191043
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
…iting accidentally
614ee57
to
e627b85
Compare
0cc07e8
to
a9ed9e9
Compare
I've not looked through the code so maybe you took this into account, but we also have the documents that we pre-load into the knowledge base. Those should not have dynamically generated uuids, but predetermined IDs. |
@@ -79,9 +79,10 @@ export type ConversationUpdateRequest = ConversationRequestBase & { | |||
|
|||
export interface KnowledgeBaseEntry { | |||
'@timestamp': string; | |||
id: string; | |||
id: string; // unique ID | |||
doc_id?: string; // human readable ID generated by the LLM and used by the LLM to lookup and update existing entries. TODO: rename `doc_id` to `lookup_id` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id
is globally unique, doc_id
is only unique per user. Multiple entries can be assigned the same doc_id
if they are created for different users.
doc_id?: string; | ||
id?: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc_id
can be used by the LLM to lookup entries. I see no reason to expand that concept to instructions. instructions can still have pre-determined id's - they do not have to be UUIDs. See the lens docs for an example of this
a9ed9e9
to
14854d2
Compare
@@ -42,7 +42,7 @@ const chatCompleteBaseRt = t.type({ | |||
]), | |||
instructions: t.array( | |||
t.intersection([ | |||
t.partial({ doc_id: t.string }), | |||
t.partial({ id: t.string }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still possible to overwrite existing instructions by specifying the id
keyword: { | ||
type: 'keyword', | ||
ignore_above: 256, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding nested keyword in order to be able to sort on it. Using nested keyword is recommended over fielddata
as it is more performant (should have been used for doc_id
as well).
this.dependencies.logger.debug( | ||
`Adding ${operations.length} operations to queue. Queue size now: ${this._queue.length})` | ||
); | ||
this._queue.push(...operations); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict we had a bug here before: By calling this._queue.push
conditionally we were not adding operations to the queue when isModelReady=true
. This meant that anything imported after the model had been setup was being dropped 😱
In general I hope we can get rid of the queue, or separate the queuing logic from the knowledge base. Having the queue embedded makes it more complex to work with the KB than it needs to be.
...s/observability_solution/observability_ai_assistant/server/utils/recall/score_suggestions.ts
Show resolved
Hide resolved
@dgieselaar Perhaps see this comment #191043 (comment) |
God catch! Fixed in b3f7d3a |
…entries-to-avoid-overwriting
…entries-to-avoid-overwriting
If the prompt is a statement that should be stored in the knowledge base: | ||
- The document contains information that directly contradicts the user's prompt or previous statements, indicating that it may need to be updated or corrected. | ||
- The document contains outdated user preferences or information that the user indicate they want corrected or replaced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note
I added this in order for the LLM to include knowledge base entries that contradict the prompt. An example is a knowledge base entry that says "The user's favourite color is red" and the prompt says "My favourite color is blue".
Before adding these lines the LLM would not deem such a document relevant - now it does. The reason we want to include contradictory entries is to let the LLM update/overwrite them. It can only do that if it knows their doc_id
.
My only worry would be if this leads the LLM to include irrelevant documents in other scenarios.
Flaky Test Runner Stats🎉 All tests passed! - kibana-flaky-test-suite-runner#6840[✅] x-pack/test/observability_ai_assistant_functional/enterprise/config.ts: 25/25 tests passed. |
signal | ||
) => { | ||
// The LLM should be able to update an existing entry by providing the same doc_id | ||
// if no existing entry is found, we generate a uuid | ||
const id = await client.getUuidFromDocId(docId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LLM will (blindly) suggest a doc_id, without any information about existing entries. With the doc_id
we can retrieve the _id
. It does work but I don't like it very much because the LLM does not consistently produce the same doc_id's even when it should.
A better approach might be to get rid of doc_id
entirely. We already provide the LLM with relevant entries via recall. By improving the recall to also include contradicting entries (which I've done in this PR) the LLM should be able to get the _id
for the existing entry and use that in order to update it.
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Public APIs missing comments
Async chunks
History
To update your PR or re-run it, just comment with: |
…entries-to-avoid-overwriting # Conflicts: # x-pack/plugins/observability_solution/observability_ai_assistant/server/service/client/index.ts # x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts # x-pack/plugins/observability_solution/observability_ai_assistant/server/service/knowledge_base_service/index.ts
…entries-to-avoid-overwriting # Conflicts: # x-pack/plugins/observability_solution/observability_ai_assistant/server/service/client/index.ts # x-pack/plugins/observability_solution/observability_ai_assistant/server/service/types.ts # x-pack/plugins/observability_solution/observability_ai_assistant/server/service/util/get_system_message_from_instructions.ts
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Async chunks
History
|
Closes #184069
The Problem
The LLM decides the identifier (both
_id
anddoc_id
) for knowledge base entries. The_id
must be globally unique in Elasticsearch but the LLM can easily pick the same id for different users thereby overwriting one users learning with another users learning.Solution
The LLM should not pick the
_id
. With this PR a UUID is generated for new entries. The LLM can supply a "lookup_id" (stored asdoc_id
for backwards compatibility) so that if the entry already exists for the currently active user, the LLM will overwrite it.Another problem was that we conflated lookup id (aka
doc_id
) with a human readable title. This meant that when users gave entries titles, they would accidentally overwrite other users entries with the same title.To solve this, entries now have a dedicated
title
field. For backwards-compat we fall back to usingdoc_id
as title if notitle
is given.