Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: DiffMatchPatch Surrogate Pairs Support #581

Closed
wants to merge 6 commits into from
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 22 additions & 23 deletions External/diffmatchpatch/DiffMatchPatchCFUtilities.c
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,9 @@ CFIndex diff_commonPrefix(CFStringRef text1, CFStringRef text2) {
char2 = CFStringGetCharacterFromInlineBuffer(&text2_inlineBuffer, i);

if (char1 != char2) {
if ( CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1) ) {
i = MAX(i - 1, 0);
}
return i;
}
}
Expand Down Expand Up @@ -142,7 +145,11 @@ CFIndex diff_commonSuffix(CFStringRef text1, CFStringRef text2) {
char2 = CFStringGetCharacterFromInlineBuffer(&text2_inlineBuffer, (text2_length - i));

if (char1 != char2) {
return i - 1;
if ( CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1) ) {
return MIN(i - 2, 0);
}

return i - 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you see google/diff-match-patch#69 (comment) where new bugs appeared by using this approach?

}
}
return n;
Expand Down Expand Up @@ -648,27 +655,19 @@ CFIndex diff_cleanupSemanticScore(CFStringRef one, CFStringRef two) {
// 'whitespace'. Since this function's purpose is largely cosmetic,
// the choice has been made to use each language's native features
// rather than force total conformity.
UniChar char1 =
CFStringGetCharacterAtIndex(one, (CFStringGetLength(one) - 1));
UniChar char2 =
CFStringGetCharacterAtIndex(two, 0);
Boolean nonAlphaNumeric1 =
!CFCharacterSetIsCharacterMember(alphaNumericSet, char1);
Boolean nonAlphaNumeric2 =
!CFCharacterSetIsCharacterMember(alphaNumericSet, char2);
Boolean whitespace1 =
nonAlphaNumeric1 && CFCharacterSetIsCharacterMember(whiteSpaceSet, char1);
Boolean whitespace2 =
nonAlphaNumeric2 && CFCharacterSetIsCharacterMember(whiteSpaceSet, char2);
Boolean lineBreak1 =
whitespace1 && CFCharacterSetIsCharacterMember(controlSet, char1);
Boolean lineBreak2 =
whitespace2 && CFCharacterSetIsCharacterMember(controlSet, char2);
Boolean blankLine1 =
lineBreak1 && diff_regExMatch(one, &blankLineEndRegEx);
Boolean blankLine2 =
lineBreak2 && diff_regExMatch(two, &blankLineStartRegEx);

UniChar char1 = CFStringGetCharacterAtIndex(one, (CFStringGetLength(one) - 1));
UniChar char2 = CFStringGetCharacterAtIndex(two, 0);
Boolean char1IsSurrogate = CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1);
Boolean char2IsSurrogate = CFStringIsSurrogateLowCharacter(char2) || CFStringIsSurrogateHighCharacter(char1);
Boolean nonAlphaNumeric1 = !CFCharacterSetIsCharacterMember(alphaNumericSet, char1);
Boolean nonAlphaNumeric2 = !CFCharacterSetIsCharacterMember(alphaNumericSet, char2);
Boolean whitespace1 = nonAlphaNumeric1 && CFCharacterSetIsCharacterMember(whiteSpaceSet, char1);
Boolean whitespace2 = nonAlphaNumeric2 && CFCharacterSetIsCharacterMember(whiteSpaceSet, char2);
Boolean lineBreak1 = whitespace1 && CFCharacterSetIsCharacterMember(controlSet, char1);
Boolean lineBreak2 = whitespace2 && CFCharacterSetIsCharacterMember(controlSet, char2);
Boolean blankLine1 = lineBreak1 && diff_regExMatch(one, &blankLineEndRegEx);
Boolean blankLine2 = lineBreak2 && diff_regExMatch(two, &blankLineStartRegEx);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the whitespace changes? It's making it hard to scan and see if any actual changes occurred here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two new lines there...

  Boolean char1IsSurrogate = CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1);
   Boolean char2IsSurrogate = CFStringIsSurrogateLowCharacter(char2) || CFStringIsSurrogateHighCharacter(char1);

I've relocated all the things because it was next to impossible to understand what was going on. Lemme see if i can revert that... relatively easily!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell there you go! The changes themselves are very low footprint. And the beauty of it? if there are no surrogate pairs anywhere... this will run the same way it ever did.

if (blankLine1 || blankLine2) {
// Five points for blank lines.
return 5;
Expand All @@ -681,7 +680,7 @@ CFIndex diff_cleanupSemanticScore(CFStringRef one, CFStringRef two) {
} else if (whitespace1 || whitespace2) {
// Two points for whitespace.
return 2;
} else if (nonAlphaNumeric1 || nonAlphaNumeric2) {
} else if (nonAlphaNumeric1 && !char1IsSurrogate || nonAlphaNumeric2 && !char1IsSurrogate) {
// One point for non-alphanumeric.
return 1;
}
Expand Down