Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic failure for test_pdf_to_opensearch #20

Open
bsowell opened this issue Sep 13, 2023 · 6 comments
Open

Non-deterministic failure for test_pdf_to_opensearch #20

bsowell opened this issue Sep 13, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@bsowell
Copy link
Contributor

bsowell commented Sep 13, 2023

Describe the bug
I have seen some non-deterministic failures in the test_pdf_to_opensearch test. The issue appears to be related to how guidance is returning results. In some cases, the returned map does not contain an answer key:

E                         File "/home/runner/work/sycamore/sycamore/sycamore/execution/transforms/entity/entity_extractor.py", line 41, in extract_entity
E                           document.properties.update({f"{self._entity_to_extract}": entities["answer"]})
E                                                                                     ~~~~~~~~^^^^^^^^^^
E                         File "/home/runner/.cache/pypoetry/virtualenvs/sycamore-QdKr-uPZ-py3.11/lib/python3.11/site-packages/guidance/_program.py", line 470, in __getitem__
E                           return self._variables[key]
E                                  ~~~~~~~~~~~~~~~^^^^^
E                       KeyError: 'answer'

From https://github.com/aryn-ai/sycamore/actions/runs/6176295926/job/16765129778#step:10:312

This looks to be non-deterministic, because the very next run of the exact same workflow succeeded:

The code is failing from here:

document.properties.update({f"{self._entity_to_extract}": entities["answer"]})

We will likely need to add some additional logging to get to the bottom of this.

@bsowell bsowell added the bug Something isn't working label Sep 15, 2023
@bsowell
Copy link
Contributor Author

bsowell commented Sep 25, 2023

Still happening: https://github.com/aryn-ai/sycamore/actions/runs/6293759018/job/17084907000. It is infrequent enough, I suspect it may be dependent on the result result returned by OpenAI.

@bsowell
Copy link
Contributor Author

bsowell commented Sep 26, 2023

We traced this to throttling from OpenAI. Looks like we are already doing basic retries, but we should see if there is anything better we can do and at the very least surface the error more clearly

@ChillOrb
Copy link

ChillOrb commented Oct 1, 2023

Hello @bsowell,

I've been investigating the non-deterministic failures you've mentioned. I noticed that test_pdf_to_opensearch does not pass llm_kwargs here, and OpenAI is ultimately set here. I believe you're referring to the retry mechanism here?

So, my queries are:

  1. Would it be beneficial to add a similar mechanism to _generate_using_guidance?

To enhance our retry mechanism, I'm considering using This guide

I would appreciate your thoughts on this. Please let me know if I am heading in the right direction.

@ChillOrb
Copy link

ChillOrb commented Oct 1, 2023

Also, i think this is related to #36

@bsowell
Copy link
Contributor Author

bsowell commented Oct 1, 2023

Hey @ChillOrb! Thanks so much for taking a look.

So guidance, which is another library for interacting with LLMs, does have some retries enabled by default (e.g. see the default parameter here). Even with that, we still sometimes see timeouts. I do think eventually we want to move off of guidance and just use openai directly -- we aren't getting too much value, and I think we will want to customize the behavior, for example by using some of the techniques you linked. One place to start might be to see if we are doing the right thing in terms of retries for the non-guidance based access.

I do think this is slightly different than #36. That issue is about OpenSearch. When writing to OpenSearch we can sometimes overwhelm the cluster and need to add mechanisms to backoff. Here we are hitting issues access OpenAI. Kind of unfortunate in this project that everything starts with Open :).

@ChillOrb
Copy link

ChillOrb commented Oct 9, 2023

Hey @bsowell ,

Thanks for clarifying,

Code here looks fine to me.
Still a few things i think could be addressed and would love to get your opinion,
1. Deciding on number of max retries (currently 3)
2. Using exponential backoff which I think can be useful since I predict this error can be more frequent as one scales up.
3. Also will it be a good idea to check for content length here
4. What are your views on creating a custom retry backoff mechanism as mentioned here and calculate delay and retries based on headers here ?
5. Custom retry can also help in surfacing the error more clearly.
6. Is there a need for batching request parallelly? Ref: here .

Let me know if this is something that needs to be worked on right now.

eric-anderson pushed a commit that referenced this issue Feb 2, 2024
Add example questions under sort-all
HenryL27 pushed a commit that referenced this issue Mar 20, 2024
Need to use logging to get error messages out so the show up.

Improve docker build speed by caching the npm stuff. I was just getting failures to install without
this, npm install would hang indefinitely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants