Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Have predictable document ids for lineage and idempotency #13

Open
baitsguy opened this issue Sep 7, 2023 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@baitsguy
Copy link
Contributor

baitsguy commented Sep 7, 2023

We are currently using random uuids to generate unique ids for documents and elements during the ingestion process. This works fine in generally ensuring we're able to uniquely ingest documents into the search store, but makes it hard to track the lineage of a user facing raw document. Additionally, each run of the workflow treats the entire input dataset as new entries and doesn't allow for updates/skips.

We need to add the ability to attach a user-provided identifier (and probably more metadata optionally) to a document that is visible at each stage of the workflow. This will allow users to identify documents, and also allow us to build idempotency into the system. This can be in the form of an explicit manifest file, or just an attribute of the document (e.g. s3 path) to use.

Additionally, for cases where we need to generate ids (i.e. user doesn't provide one), we can consider using a document content hash for idempotency

@bsowell bsowell added the enhancement New feature or request label Sep 15, 2023
eric-anderson added a commit that referenced this issue Feb 2, 2024
* Run sort all via 'compose run' not 'compose up' the latter restarts all the containers and will
  reset env vars if they aren't set consistently.

* Document what to do on MacOS if opensearch isn't starting

* Give a pointer to the command to get version info when people are reaching out for help
eric-anderson added a commit that referenced this issue Feb 2, 2024
* Create sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Add files via upload

* Update my branch (#14)

* Update README.md

* Switch to using profiles for compose.yaml. (#6)

This change avoids the "orphaned containers" warning message we got with the old approach.
It leaves the complexity of running the commands unchanged.

Also fix a missing command in the clean up step (was missing the down before reset)

* Add support for crawling arbitrary websites (#7)

* Add support for downloading arbitrary websites via http

also add newlines between containers to make file more readable

* Add documentation on how to crawl an arbitrary website.

* Review fixes

* Add build stamping to our opensearch container. (#8)

* Add support for specifying the version for the containers. (#9)

Initial configuration defaults to the stable version which won't exist until we mark it, so
for now, people will need to use either:

VERSION=latest docker compose up
or
VERSION=latest_rc docker compose up

* Add .gitignore. Remove file that should have been ignored. (#10)

* Add --pull=always to make sure people get up-to-date images. (#11)

* Add --pull=always to make sure people get up-to-date images.

* Fix typos.

* Minor readme improvements found during final testing (#13)

* Run sort all via 'compose run' not 'compose up' the latter restarts all the containers and will
  reset env vars if they aren't set consistently.

* Document what to do on MacOS if opensearch isn't starting

* Give a pointer to the command to get version info when people are reaching out for help

---------

Co-authored-by: Eric Anderson <[email protected]>

* Update my branch (#15)

* Update README.md

* Switch to using profiles for compose.yaml. (#6)

This change avoids the "orphaned containers" warning message we got with the old approach.
It leaves the complexity of running the commands unchanged.

Also fix a missing command in the clean up step (was missing the down before reset)

* Add support for crawling arbitrary websites (#7)

* Add support for downloading arbitrary websites via http

also add newlines between containers to make file more readable

* Add documentation on how to crawl an arbitrary website.

* Review fixes

* Add build stamping to our opensearch container. (#8)

* Add support for specifying the version for the containers. (#9)

Initial configuration defaults to the stable version which won't exist until we mark it, so
for now, people will need to use either:

VERSION=latest docker compose up
or
VERSION=latest_rc docker compose up

* Add .gitignore. Remove file that should have been ignored. (#10)

* Add --pull=always to make sure people get up-to-date images. (#11)

* Add --pull=always to make sure people get up-to-date images.

* Fix typos.

* Minor readme improvements found during final testing (#13)

* Run sort all via 'compose run' not 'compose up' the latter restarts all the containers and will
  reset env vars if they aren't set consistently.

* Document what to do on MacOS if opensearch isn't starting

* Give a pointer to the command to get version info when people are reaching out for help

---------

Co-authored-by: Eric Anderson <[email protected]>

* Update README.md

* Update sycamore-local-development-example.md

* Update README.md

* Update README.md

* Update my branch (#17)

* Update README.md

* Switch to using profiles for compose.yaml. (#6)

This change avoids the "orphaned containers" warning message we got with the old approach.
It leaves the complexity of running the commands unchanged.

Also fix a missing command in the clean up step (was missing the down before reset)

* Add support for crawling arbitrary websites (#7)

* Add support for downloading arbitrary websites via http

also add newlines between containers to make file more readable

* Add documentation on how to crawl an arbitrary website.

* Review fixes

* Add build stamping to our opensearch container. (#8)

* Add support for specifying the version for the containers. (#9)

Initial configuration defaults to the stable version which won't exist until we mark it, so
for now, people will need to use either:

VERSION=latest docker compose up
or
VERSION=latest_rc docker compose up

* Add .gitignore. Remove file that should have been ignored. (#10)

* Add --pull=always to make sure people get up-to-date images. (#11)

* Add --pull=always to make sure people get up-to-date images.

* Fix typos.

* Minor readme improvements found during final testing (#13)

* Run sort all via 'compose run' not 'compose up' the latter restarts all the containers and will
  reset env vars if they aren't set consistently.

* Document what to do on MacOS if opensearch isn't starting

* Give a pointer to the command to get version info when people are reaching out for help

* Update README.md (#16)

Update README.md to provide an overview of the steps and remove a typo.

* Update README.md

Small changes

* Update README.md

Removed space

---------

Co-authored-by: Eric Anderson <[email protected]>

* Partial review by Eric. Up to the last failure.

* Update sycamore-local-development-example.md

* Update sycamore_local_dev_example.ipynb

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore_local_dev_example.ipynb

* Minor fixes:

* Fix /tmp path
* Fix font path on linux
* Fix typo

* Fix script up to step 3k.

The bug was that the variable names were used inconsistently, and as a result the initial
partitioning was dropped in the remainder of the processing.

Also add a bunch more documentation on what the output should be.

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update sycamore-local-development-example.md

* Update the remainig steps.

* Fixup how we adjust the script to match with earlier changes.
* Renumber to 5[a-e] so that all the steps have unique numbers.
* Fix typo

* Update sycamore-local-development-example.md

* ispell

* Update sycamore-local-development-example.md

* Delete sycamore_local_dev_example.ipynb

* Add files via upload

* Delete sycamore_local_dev_example.ipynb

* Add files via upload

* Cleanup step 5 instructions

* Cleanup step 5 instructions more

---------

Co-authored-by: Eric Anderson <[email protected]>
HenryL27 added a commit that referenced this issue Mar 20, 2024
* tweak opensearch endpoints and search calls to use reranking

Signed-off-by: HenryL27 <[email protected]>

* add feedback functionality to ui (#13)

Signed-off-by: HenryL27 <[email protected]>

* Filters + feedback + API updates to support OS 2.12

* Added missing dependencies.

* Update memory APIs

* Update references to new conversation/interaction ids

* Revert interactions loading to older search style API

* Load citation text from older interactions

* Show ntsb filters

* Store entire conversation with each feedback entry. Store query used in each interaction.

* Cleanup + disable delete conversation

* ability to disable filters. don't simplify answer if no documents

* Fix filtering

* Merge branches. Fix bugs

* Change back port to 9200

* Ability to rewrite questions

* Add/remove filters

* Ability to edit autogenerated filter values

* Ability to run json opensearch queries

* Refactor OS query box, allows you to submit adhoc OS query

* Some error handling, some defaults

* Fix length

* Merge 2.12 changes

* PR feedback

---------

Signed-off-by: HenryL27 <[email protected]>
Co-authored-by: HenryL27 <[email protected]>
Co-authored-by: Alex Meyer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants