Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce errors in workers #3962

Merged
merged 4 commits into from
Feb 13, 2025
Merged

Reduce errors in workers #3962

merged 4 commits into from
Feb 13, 2025

Conversation

pablonyx
Copy link
Contributor

@pablonyx pablonyx commented Feb 11, 2025

Description

This is a general pass at reducing the number of unnecessary / spammy error logs in our backgrounds pods. Each exception has a link to a relevant example in Grafana + the relevant change

Primary Worker

Tenant ID exceptions

Fixed miscellaneous tenant ID exceptions

  • Updated and added alerting in Alembic check to identify schemas without proper tables + deleted tenant-less schemas

Monitoring worker

Entity exception

  • Updated monitoring logic to avoid logging this as an error for sync record types with no related entity

Heavy Worker

Dupe external email issue
Error ended up being that we have concurrent tasks creating the same external user email

  • Just log error and return false in the task to minimize logspam

Some pretty excessive Slack spam https://g-dfa4b7be0c.grafana-workspace.us-east-2.amazonaws.com/goto/Jnc1ujFHR?orgId=1)

  • Remove logspam

Light Worker

View in Grafana

Exception: Failed to update doc chunk 304652f0-2e46-5705-94ef-6a22ea003c55 (doc_id=https://www.1955apache.com/blog/jl-audio-kick-panels). Details: {"pathId":"/document/v1/default/danswer_chunk_cohere_embed_english_v3_0/docid/304652f0-2e46-5705-94ef-6a22ea003c55","message":"Rejecting execution due to overload: 100 requests already enqueued"}
  • Retries for document chunk updates

How Has This Been Tested?

  • Locally- various background flows

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Copy link

vercel bot commented Feb 11, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 13, 2025 11:17pm

return basic_retry_wrapper(
make_slack_api_rate_limited(_make_slack_api_call_logged(call))
)(**kwargs)
return basic_retry_wrapper(make_slack_api_rate_limited(call))(**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to completely remove the logging? or should we truncate or shorten it? I can see us wanting to have debug logging here occasionally for troubleshooting purposes, but it definitely is not helpful to be at the current level of spam by default.

# Then we upsert the document's external permissions in postgres
try:
# Add the users to the DB if they don't exist
batch_add_ext_perm_user_if_not_exists(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more appropriate for the function itself to be more resilient to the type of error that is occurring here rather than the caller being responsible. It should be entirely possible for this function to "just work" and it seems better if we write it as such.

Couple of ideas for batch_add_ext_perm_user_if_not_exists:

  • Catch the IntegrityError if the batch commit fails and go one by one while deliberately ignoring the failure we know can happen.
  • Use Postgres ON CONFLICT DO NOTHING ... could work?

@pablonyx pablonyx enabled auto-merge February 13, 2025 23:23
@pablonyx pablonyx added this pull request to the merge queue Feb 13, 2025
github-merge-queue bot pushed a commit that referenced this pull request Feb 13, 2025
* reduce errors in workers

* quick nit

* update

* nit
@pablonyx pablonyx removed this pull request from the merge queue due to a manual request Feb 13, 2025
@pablonyx pablonyx merged commit eff433b into main Feb 13, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants