Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inline verifier mismatch errors #361

Merged
merged 9 commits into from
Sep 11, 2024

Conversation

driv3r
Copy link
Contributor

@driv3r driv3r commented Aug 29, 2024

No description provided.

@driv3r driv3r force-pushed the improve-inline-verifier-mismatch-errors branch from 154e234 to 2dd8c90 Compare September 3, 2024 12:40
Copy link
Contributor Author

@driv3r driv3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New message format:

From

[paginationKeys: 7 (source: boo, target: aaa) ]

To

[PKs: 7 (type: content difference, source: boo, target: aaa, column: name) ]

In case of missing rows we skip the source/target

MismatchRowMissingOnSource mismatchType = "row missing on source"
MismatchRowMissingOnTarget mismatchType = "row missing on target"
MismatchContentDifference mismatchType = "content difference"
MismatchChecksumDifference mismatchType = "rows checksum difference"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added checksum difference next to content one, to differentiate between whole row checksum comparison and column data one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't content & checksum difference the same thing? especially now that we are reporting missing row & column separately. If all the columns are there, and content is different checksum will be different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a difference, because we check both checksum of a whole row + compare each rows columns separately, so the checksum difference is for when the SQL query generating checksum for whole row doesn't match one in source, while the content difference is when we compare each single column of a row and encounter and issue there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 If i understand this right, then maybe we should rename these to MismatchFieldDifference and MismatchRowChecksumDifference?

type InlineVerifierMismatches struct {
Pk uint64
SourceChecksum string
TargetChecksum string
MismatchColumn string
MismatchType mismatchType
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatten the structure a bit, as we always want to have the type in

if mismatch.TargetChecksum != "" {
messageBuf.WriteString(", target: ")
messageBuf.WriteString(mismatch.TargetChecksum)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source and target checksum are optional, as we don't include them in case rows are missing (source or target)

mismatchSet[paginationKey] = InlineVerifierMismatches{
Pk: paginationKey,
MismatchType: MismatchChecksumDifference,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I couldn't find any reason why !exist was after the bytes comparison, it's counter intuitive as we should first check if rows are there
  • That's why I've split up check for !exists from !bytes.Equal(sourceHash, targetHash) so we could properly define the mismatch type

if !bytes.Equal(sourceHash, targetHash) || !exists {
for paginationKey, _ := range source {
_, exists := target[paginationKey]
if !exists {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if rows exist in both source & target, then they are already compared above, no need to check it again

MismatchColumn: colName,
}
break // no need to compare other columns
} else if !bytes.Equal(sourceData, targetData) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • again, split !exists check from bytes comparison
  • for bytes comparison, generate checksums for columns data (avoid storing data which might be PII)

break
for colName := range sourceDecompressedColumns {
_, exists := targetDecompressedColumns[colName]
if !exists {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, there's no point of comparing the data for the second time, it should be the same as in the loop above, check for column existence only

}
compressedMismatch := compareDecompressedData(sourceData, targetData)
for paginationKey, mismatch := range compressedMismatch {
mismatches[paginationKey] = mismatch
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this, even if there's a hash comparison already there, we will get more details on the column involved

for tableName, _ := range mismatches[schemaName] {
sortedTables = append(sortedTables, tableName)
}
sort.Strings(sortedTables)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort tables for predictability

@driv3r driv3r self-assigned this Sep 10, 2024
@driv3r driv3r marked this pull request as ready for review September 10, 2024 11:06
@driv3r driv3r force-pushed the improve-inline-verifier-mismatch-errors branch from 0593a70 to 53c5c66 Compare September 10, 2024 14:05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper unit tests! 🎉

@driv3r driv3r merged commit d0a5d63 into main Sep 11, 2024
9 of 10 checks passed
@driv3r driv3r deleted the improve-inline-verifier-mismatch-errors branch September 11, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants