Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add more settings and tunability to snapshot table and dictionary #28062

Merged
merged 2 commits into from
Jan 30, 2025

Conversation

tkaemming
Copy link
Contributor

Problem

If max_memory_usage is set on the cluster client settings, the INSERT INTO statement in the populate step can fail due to the large aggregation needed to ensure we have the latest value for each override in the overrides table.

Changes

optimize_aggregation_in_order reduces the memory (which we don't have enough of) at the cost of some time (which we have plenty of.)

This also allows providing a max_memory_usage parameter to the dictionary as an additional safeguard against excessive memory usage on the cluster during job execution.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Already covered by existing test, and ensured optimize_aggregation_in_order improved the peak memory usage of the relevant query in production

@tkaemming tkaemming marked this pull request as ready for review January 29, 2025 22:29
@tkaemming tkaemming requested a review from a team January 29, 2025 22:29
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR adds memory optimization settings to the person overrides squashing process to prevent out-of-memory errors during large data operations.

  • Added optimize_aggregation_in_order=1 in dags/person_overrides.py to reduce memory usage during snapshot table population
  • Added max_memory_usage parameter to dictionary creation for controlling memory consumption
  • Added max_execution_time parameter to dictionary creation for timeout control
  • Improved memory efficiency by trading execution speed for lower memory usage during aggregation steps

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@tkaemming tkaemming merged commit fdee4c8 into master Jan 30, 2025
95 checks passed
@tkaemming tkaemming deleted the more-squash-settings branch January 30, 2025 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants