Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce checksum failures #274

Merged
merged 1 commit into from
Mar 10, 2024
Merged

reduce checksum failures #274

merged 1 commit into from
Mar 10, 2024

Conversation

morgo
Copy link
Collaborator

@morgo morgo commented Mar 7, 2024

A Pull Request should be associated with an Issue.

This is another attempt to fix #248

There seems to be a case where the checksum is failing, and inspection of the rows show that it's not garbled data but a last_viewed_at DATETIME(3) column which an update was missed for.

All the cases affected appear to be resume cases (typically it's gone to resume because it has high lock wait timeouts while trying to get a lock). Which leads me to believe it's:

  1. The flushing under lock mechanism. This PR changes it to be the same as used in 8.0 cutover
  2. The key above watermark optimization is incorrectly ignoring rows. I've added protection to make sure there's no race on resume where it could return incorrectly
  3. The binlog checkpoint is ahead of itself. We're pretending work has safely been applied, but it hasn't.

Note: (2) is more likely since (1) occurs under both resume and non-resume cases. I don't think the cause is (3).

There is also a little bit more cleanup to remove 5.7 support.

@morgo morgo force-pushed the mtocker-test-checksum-failure branch 2 times, most recently from 215a93a to 5fbc94e Compare March 7, 2024 16:35
@morgo morgo force-pushed the mtocker-test-checksum-failure branch from 5fbc94e to 81bc1db Compare March 7, 2024 16:42
@morgo morgo requested review from joycse06 and kolbe and removed request for joycse06 March 7, 2024 18:40
// With the lock held, flush one more time under the lock tables.
// Because we know canal is up to date this now guarantees
// we have everything in the new table.
if err := c.feed.Flush(ctx); err != nil {
if err := c.feed.FlushUnderLock(ctx, serverLock); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, This is also a important change to reduce the chances the below check AllChangesFlushed() failing due to parallelism, but I guess not the reason the migrations under consideration failed, as they failed in the checksum phase.

Copy link
Collaborator

@prudhvi prudhvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the changes.
Removing the races also cleanup up code and makes it more explicit and readable.

@morgo morgo merged commit 17b92c3 into main Mar 10, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

checksum appears to show inconsistencies on high modification rates
2 participants