Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon/internal/db2/history: Insert and query rows from history lookup tables with one query #5415

Merged
merged 15 commits into from
Aug 23, 2024

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented Aug 7, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

fix #4864

Currently, the loaders for the history lookup tables attempt to get or create history ids by issuing up to 3 SQL statements. First, we query for any existing rows. Then we insert the rows that are not present. Finally, we query the rows that were just inserted.

In #4864 we discussed that if we included a RETURNING clause to the insert statements we could obtain all the rows both present and inserted in a single query.

However, @urvisavla pointed out:

The RETURNING clause in an INSERT statement only returns the newly inserted rows. To return all rows, including existing ones, we can force an update using the ON CONFLICT UPDATE SET... clause. However this additional update will likely have a negative impact on performance.

So that means the RETURNING clause could reduce the number of SQL statements from 3 to 2 but we would still need to do a SQL statement to find all the pre-existing rows.

But, it turns out we can use a UNION clause to combine the INSERT ... RETURNING * statement with another query which just selects all the pre-existing rows. This technique is described in more detail here:

https://hakibenita.com/postgresql-get-or-create

Known limitations

Unfortunately when deploying this PR to staging I did not observe any significant improvements to the "Ingestion Loaders Duration" metric. Also, with this change we lose insight into the number of rows which were inserted vs queried. Nevertheless, I do think it's still worth including this change as it reduces code complexity

@tamirms tamirms marked this pull request as ready for review August 12, 2024 18:42
@tamirms tamirms requested a review from a team August 12, 2024 18:43
@tamirms
Copy link
Contributor Author

tamirms commented Aug 14, 2024

@urvisavla can you take another look? I believe I've addressed all the feedback

@tamirms
Copy link
Contributor Author

tamirms commented Aug 14, 2024

Actually it turns out there's a possible race condition in the get or create query where the query can return 0 rows if there are 2 transactions attempting to insert the same data concurrently. See https://stackoverflow.com/a/15950324 for more details. I will need to update this PR

@tamirms
Copy link
Contributor Author

tamirms commented Aug 14, 2024

@urvisavla now it's ready for review again. I fixed the race condition but unfortunately we cannot do both insertion and querying within a single sql statement

Copy link
Contributor

@urvisavla urvisavla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're essentially removing the initial select, updating the insert to return the inserted rows, and then adding a select to fetch only the remaining rows, correct?

Also, I'm assuming there's no need to check the performance of the change on staging since we're no longer combining the queries, right?

Changes look good to me. Left a couple of comments for consideration. Also the description needs update.

@tamirms
Copy link
Contributor Author

tamirms commented Aug 15, 2024

@urvisavla

So we're essentially removing the initial select, updating the insert to return the inserted rows, and then adding a select to fetch only the remaining rows, correct?

yes, that's correct

Also, I'm assuming there's no need to check the performance of the change on staging since we're no longer combining the queries, right?

I'm still checking the performance of staging to make sure there are no regressions

I also updated the PR to reduce code duplication between all 4 loaders. PTAL, thanks!

@tamirms tamirms merged commit 2349c8f into stellar:master Aug 23, 2024
23 checks passed
@tamirms tamirms deleted the lookuptable-query branch August 23, 2024 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify history.CreateAccounts() to avoid select statement
2 participants