Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Index tags #10441

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

[core] Index tags #10441

wants to merge 17 commits into from

Conversation

tdraier
Copy link
Contributor

@tdraier tdraier commented Jan 31, 2025

fixes: https://github.com/dust-tt/tasks/issues/2047
fixes: https://github.com/dust-tt/tasks/issues/2046

Description

Update schema for tags in ES. Adds a new tags field with keyword, text/standard and text/edge_analyzer .
Update tags when upserting table or document.
Add a backfill for existing docs/tables.
Adds endpoint to query tags

Tests

run simple query to list tags values :

curl -u${ELASTICSEARCH_USERNAME}:${ELASTICSEARCH_PASSWORD} -H "Content-type: application/json" -XGET "http://localhost:9200/core.data_sources_nodes_3/_search?pretty" -d '{
    "query": {
        "bool": {
            "must": [
                {
                    "prefix": {
                        "tags.edge": "t"
                    }
                },
                {
                    "term": {
                        "data_source_id": "748f425e08b7c01c39dbd627b32a73bf7fa5f83b2499bb997b05f2551a7991e6"
                    }
                }
            ]
        }
    },
    "aggs": {
        "unique_tags": {
            "terms": {
                "field": "tags.keyword",
                "size": 10
            }
        }
    },
    "size": 0
}'

For the same query with core endpoint :

curl -XPOST http://localhost:3001/tags/search --header 'content-type: application/json' -d '{
  "query": "t",
  "query_type": "prefix",
  "data_sources": ["748f425e08b7c01c39dbd627b32a73bf7fa5f83b2499bb997b05f2551a7991e6"]
}'

Risk

Deploy Plan

run migration :

./admin/elasticsearch_run_command_from_file.sh src/search_stores/migrations/20250131_add_tags.http

deploy core

run backfills

cargo run --bin elasticsearch_backfill_document_tags_index -- --index-version 3 --query-type document
cargo run --bin elasticsearch_backfill_document_tags_index -- --index-version 3 --query-type table

@tdraier tdraier changed the title Thomas/tags endpoint [core] Index tags Jan 31, 2025
@tdraier tdraier requested a review from philipperolet January 31, 2025 15:23
Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋
We have an invariant that data_sources_nodes in ES and in core are in sync
I think we should keep this invariant, so we should move tags inside nodes. It will then, for free, be indexed by the regular process
IMO would make all the code a lot simpler too, remove some duplication, etc.

@tdraier
Copy link
Contributor Author

tdraier commented Feb 4, 2025

👋 We have an invariant that data_sources_nodes in ES and in core are in sync I think we should keep this invariant, so we should move tags inside nodes. It will then, for free, be indexed by the regular process IMO would make all the code a lot simpler too, remove some duplication, etc.

I agree but I remember we had a discussion a month ago about putting tags in node and we decided not to (not sure why exactly) ?

@tdraier tdraier requested a review from philipperolet February 5, 2025 08:06
@philipperolet
Copy link
Contributor

👋 We have an invariant that data_sources_nodes in ES and in core are in sync I think we should keep this invariant, so we should move tags inside nodes. It will then, for free, be indexed by the regular process IMO would make all the code a lot simpler too, remove some duplication, etc.

I agree but I remember we had a discussion a month ago about putting tags in node and we decided not to (not sure why exactly) ?

IMO because at the time it was not needed; Tags as filter just recently came around in go/product

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 sorry for the delay, thanks for moving tags to nodes :)
left a few other coms

@@ -45,6 +45,19 @@
"provider_visibility": {
"type": "keyword",
"index": false
},
"tags": {
"type": "text",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we'd need to understand if that's the right way of indexing
Will you never sort by tags, will you sometimes need exact tag match, will you sometimes filter rather than rank, etc.
Is there doc somewhere on the type of query that should be supported long term?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's exact match (so "keyword") and prefix; for prefix I think keyword is fine too although not sure
I see in the design doc there are questions about fuzzy search. Was this sorted out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to @flvndvd keywords cannot handle prefix directly, we need ngram (so edge) for that. Question about fuzzy search is not clear to me, I don't see the point of fuzzy searching on tags, but I've added the mapping in case we need it.

&self,
query: Option<String>,
query_type: Option<TagsQueryType>,
data_sources: Option<Vec<String>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse the datasourceviewfilter?
providing proper DSV filters is how we ensure permissionning in core, so I would stick to it if possible

@@ -0,0 +1,231 @@
use bb8::Pool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripts in bin/ are meant to be frequently reused; temporary backfill scripts should go in migrations

does this one really belong here in your opinion?

if it does, could it become part of backfill_nodes? it'd be better if there are not too many different scripts there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's temporary, i'll move it to migrations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants