Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Obs AI Assistant] Add uuid to knowledge base entries to avoid overwriting accidentally #191043

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

sorenlouv
Copy link
Member

@sorenlouv sorenlouv commented Aug 22, 2024

Closes #184069

The Problem
The LLM decides the identifier (both _id and doc_id) for knowledge base entries. The _id must be globally unique in Elasticsearch but the LLM can easily pick the same id for different users thereby overwriting one users learning with another users learning.

Solution
The LLM should not pick the _id. With this PR a UUID is generated for new entries. The LLM can supply a "lookup_id" (stored as doc_id for backwards compatibility) so that if the entry already exists for the currently active user, the LLM will overwrite it.

Another problem was that we conflated lookup id (aka doc_id) with a human readable title. This meant that when users gave entries titles, they would accidentally overwrite other users entries with the same title.
To solve this, entries now have a dedicated title field. For backwards-compat we fall back to using doc_id as title if no title is given.

@sorenlouv sorenlouv requested a review from a team as a code owner August 22, 2024 04:22
@obltmachine
Copy link

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@sorenlouv sorenlouv force-pushed the add-uuid-to-kb-entries-to-avoid-overwriting branch from 614ee57 to e627b85 Compare August 22, 2024 04:26
@sorenlouv sorenlouv added release_note:fix v8.16.0 and removed ci:project-deploy-observability Create an Observability project labels Aug 22, 2024
@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Aug 22, 2024
@sorenlouv sorenlouv force-pushed the add-uuid-to-kb-entries-to-avoid-overwriting branch from 0cc07e8 to a9ed9e9 Compare August 27, 2024 18:41
@dgieselaar
Copy link
Member

I've not looked through the code so maybe you took this into account, but we also have the documents that we pre-load into the knowledge base. Those should not have dynamically generated uuids, but predetermined IDs.

@@ -79,9 +79,10 @@ export type ConversationUpdateRequest = ConversationRequestBase & {

export interface KnowledgeBaseEntry {
'@timestamp': string;
id: string;
id: string; // unique ID
doc_id?: string; // human readable ID generated by the LLM and used by the LLM to lookup and update existing entries. TODO: rename `doc_id` to `lookup_id`
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is globally unique, doc_id is only unique per user. Multiple entries can be assigned the same doc_id if they are created for different users.

Comment on lines -102 to +103
doc_id?: string;
id?: string;
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc_id can be used by the LLM to lookup entries. I see no reason to expand that concept to instructions. instructions can still have pre-determined id's - they do not have to be UUIDs. See the lens docs for an example of this

@sorenlouv sorenlouv force-pushed the add-uuid-to-kb-entries-to-avoid-overwriting branch from a9ed9e9 to 14854d2 Compare August 27, 2024 20:44
@@ -42,7 +42,7 @@ const chatCompleteBaseRt = t.type({
]),
instructions: t.array(
t.intersection([
t.partial({ doc_id: t.string }),
t.partial({ id: t.string }),
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible to overwrite existing instructions by specifying the id

Comment on lines +38 to +41
keyword: {
type: 'keyword',
ignore_above: 256,
},
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding nested keyword in order to be able to sort on it. Using nested keyword is recommended over fielddata as it is more performant (should have been used for doc_id as well).

this.dependencies.logger.debug(
`Adding ${operations.length} operations to queue. Queue size now: ${this._queue.length})`
);
this._queue.push(...operations);
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict we had a bug here before: By calling this._queue.push conditionally we were not adding operations to the queue when isModelReady=true. This meant that anything imported after the model had been setup was being dropped 😱

In general I hope we can get rid of the queue, or separate the queuing logic from the knowledge base. Having the queue embedded makes it more complex to work with the KB than it needs to be.

@sorenlouv
Copy link
Member Author

I've not looked through the code so maybe you took this into account, but we also have the documents that we pre-load into the knowledge base. Those should not have dynamically generated uuids, but predetermined IDs.

@dgieselaar Perhaps see this comment #191043 (comment)

@sorenlouv
Copy link
Member Author

sorenlouv commented Aug 28, 2024

Tested and appears to be working correctly. One thing I noticed is often when I switch users to add for eg, a favorite color, it no longer overwrites the existing entry but it appears with the entries nested and the type changes from "assistant" to "system".

initially adding with elastic user: Screenshot 2024-08-28 at 11 12 45 AM

updating with test2 user: Screenshot 2024-08-28 at 11 14 54 AM

Screenshot 2024-08-28 at 11 18 20 AM It functions correctly retrieving the right entry when asked per user.

God catch! Fixed in b3f7d3a

Comment on lines +59 to +61
If the prompt is a statement that should be stored in the knowledge base:
- The document contains information that directly contradicts the user's prompt or previous statements, indicating that it may need to be updated or corrected.
- The document contains outdated user preferences or information that the user indicate they want corrected or replaced.
Copy link
Member Author

@sorenlouv sorenlouv Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

I added this in order for the LLM to include knowledge base entries that contradict the prompt. An example is a knowledge base entry that says "The user's favourite color is red" and the prompt says "My favourite color is blue".
Before adding these lines the LLM would not deem such a document relevant - now it does. The reason we want to include contradictory entries is to let the LLM update/overwrite them. It can only do that if it knows their doc_id.
My only worry would be if this leads the LLM to include irrelevant documents in other scenarios.

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6840

[✅] x-pack/test/observability_ai_assistant_functional/enterprise/config.ts: 25/25 tests passed.

see run history

signal
) => {
// The LLM should be able to update an existing entry by providing the same doc_id
// if no existing entry is found, we generate a uuid
const id = await client.getUuidFromDocId(docId);
Copy link
Member Author

@sorenlouv sorenlouv Aug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLM will (blindly) suggest a doc_id, without any information about existing entries. With the doc_id we can retrieve the _id. It does work but I don't like it very much because the LLM does not consistently produce the same doc_id's even when it should.

A better approach might be to get rid of doc_id entirely. We already provide the LLM with relevant entries via recall. By improving the recall to also include contradicting entries (which I've done in this PR) the LLM should be able to get the _id for the existing entry and use that in order to update it.

@kibana-ci
Copy link
Collaborator

kibana-ci commented Sep 2, 2024

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: 91c84d2
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-191043-91c84d2c14ba

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #6 / When rendering PolicySettingsLayout and user has Edit permissions should allow updates to be made

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
observabilityAIAssistant 284 286 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
observabilityAiAssistantManagement 91.5KB 91.9KB +479.0B
Unknown metric groups

API count

id before after diff
observabilityAIAssistant 286 288 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…entries-to-avoid-overwriting

# Conflicts:
#	x-pack/plugins/observability_solution/observability_ai_assistant/server/service/client/index.ts
#	x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts
#	x-pack/plugins/observability_solution/observability_ai_assistant/server/service/knowledge_base_service/index.ts
@sorenlouv sorenlouv added v8.17.0 backport:skip This commit does not require backporting and removed v8.16.0 backport:skip This commit does not require backporting labels Oct 30, 2024
…entries-to-avoid-overwriting

# Conflicts:
#	x-pack/plugins/observability_solution/observability_ai_assistant/server/service/client/index.ts
#	x-pack/plugins/observability_solution/observability_ai_assistant/server/service/types.ts
#	x-pack/plugins/observability_solution/observability_ai_assistant/server/service/util/get_system_message_from_instructions.ts
@sorenlouv sorenlouv added backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) and removed backport:skip This commit does not require backporting v8.17.0 labels Oct 31, 2024
@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 31, 2024

💚 Build Succeeded

  • Buildkite Build
  • Commit: 0aaf657
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-191043-0aaf657de211

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
observabilityAIAssistant 294 296 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
observabilityAiAssistantManagement 92.8KB 93.3KB +479.0B
Unknown metric groups

API count

id before after diff
observabilityAIAssistant 296 298 +2

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) ci:project-deploy-observability Create an Observability project release_note:fix Team:Obs AI Assistant
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Obs AI Assistant] Knowledge base entries with the same name overwrites each other without warning
7 participants