Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: DiffMatchPatch Surrogate Pairs Support #581

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions External/diffmatchpatch/DiffMatchPatchCFUtilities.c
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,9 @@ CFIndex diff_commonPrefix(CFStringRef text1, CFStringRef text2) {
char2 = CFStringGetCharacterFromInlineBuffer(&text2_inlineBuffer, i);

if (char1 != char2) {
if ( CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1) ) {
i = MAX(i - 1, 0);
}
return i;
}
}
Expand Down Expand Up @@ -142,7 +145,11 @@ CFIndex diff_commonSuffix(CFStringRef text1, CFStringRef text2) {
char2 = CFStringGetCharacterFromInlineBuffer(&text2_inlineBuffer, (text2_length - i));

if (char1 != char2) {
return i - 1;
if ( CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1) ) {
return MIN(i - 2, 0);
}

return i - 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you see google/diff-match-patch#69 (comment) where new bugs appeared by using this approach?

}
}
return n;
Expand Down Expand Up @@ -652,6 +659,10 @@ CFIndex diff_cleanupSemanticScore(CFStringRef one, CFStringRef two) {
CFStringGetCharacterAtIndex(one, (CFStringGetLength(one) - 1));
UniChar char2 =
CFStringGetCharacterAtIndex(two, 0);
Boolean char1IsSurrogate =
CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1);
Boolean char2IsSurrogate =
CFStringIsSurrogateLowCharacter(char2) || CFStringIsSurrogateHighCharacter(char1);
Boolean nonAlphaNumeric1 =
!CFCharacterSetIsCharacterMember(alphaNumericSet, char1);
Boolean nonAlphaNumeric2 =
Expand All @@ -668,7 +679,7 @@ CFIndex diff_cleanupSemanticScore(CFStringRef one, CFStringRef two) {
lineBreak1 && diff_regExMatch(one, &blankLineEndRegEx);
Boolean blankLine2 =
lineBreak2 && diff_regExMatch(two, &blankLineStartRegEx);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the whitespace changes? It's making it hard to scan and see if any actual changes occurred here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two new lines there...

  Boolean char1IsSurrogate = CFStringIsSurrogateLowCharacter(char1) || CFStringIsSurrogateHighCharacter(char1);
   Boolean char2IsSurrogate = CFStringIsSurrogateLowCharacter(char2) || CFStringIsSurrogateHighCharacter(char1);

I've relocated all the things because it was next to impossible to understand what was going on. Lemme see if i can revert that... relatively easily!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell there you go! The changes themselves are very low footprint. And the beauty of it? if there are no surrogate pairs anywhere... this will run the same way it ever did.

if (blankLine1 || blankLine2) {
// Five points for blank lines.
return 5;
Expand All @@ -681,7 +692,7 @@ CFIndex diff_cleanupSemanticScore(CFStringRef one, CFStringRef two) {
} else if (whitespace1 || whitespace2) {
// Two points for whitespace.
return 2;
} else if (nonAlphaNumeric1 || nonAlphaNumeric2) {
} else if ((nonAlphaNumeric1 && !char1IsSurrogate) || (nonAlphaNumeric2 && !char2IsSurrogate)) {
// One point for non-alphanumeric.
return 1;
}
Expand Down
39 changes: 39 additions & 0 deletions SimperiumTests/DiffMatchPatchTest.m
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,17 @@ - (void)testDiffCommonPrefixTest {
XCTAssertEqual((NSUInteger)4, [dmp diff_commonPrefixOfFirstString:@"1234" andSecondString:@"1234xyz"], @"Common suffix whole case failed.");
}

- (void)testDiffCommonPrefixDoesntSplitSurrogatePairs {
DiffMatchPatch *dmp = [DiffMatchPatch new];

NSString *pristine = @"☺️🖖🏿";
NSString *edited = @"☺️😃🖖🏿";
NSString *common = @"☺️";

NSUInteger prefix = [dmp diff_commonPrefixOfFirstString:pristine andSecondString:edited];
XCTAssertEqual(prefix, common.length, @"Common Prefix should match the common emoji's length");
}

- (void)testDiffCommonSuffixTest {
DiffMatchPatch *dmp = [DiffMatchPatch new];

Expand All @@ -62,6 +73,19 @@ - (void)testDiffCommonSuffixTest {
XCTAssertEqual((NSUInteger)4, [dmp diff_commonSuffixOfFirstString:@"1234" andSecondString:@"xyz1234"], @"Detect any common suffix. Whole case.");
}

- (void)testDiffCommonSuffixDoesntSplitSurrogatePairs {
DiffMatchPatch *dmp = [DiffMatchPatch new];

NSString *pristine = @"☺️🖖🏿";
NSString *edited = @"☺️😃🖖🏿";
NSString *expected = @"🖖🏿";

NSUInteger suffix = [dmp diff_commonSuffixOfFirstString:pristine andSecondString:edited];
NSString *common = [pristine substringFromIndex:pristine.length - suffix];

XCTAssertEqualObjects(expected, common, @"Common Suffix should match the last emoji");
}

- (void)testDiffCommonOverlapTest {
DiffMatchPatch *dmp = [DiffMatchPatch new];

Expand Down Expand Up @@ -388,6 +412,21 @@ - (void)testDiffCleanupSemanticLosslessTest {
XCTAssertEqualObjects(expectedResult, diffs, @"Sentence boundaries.");
}

- (void)testDiffCleanupSemanticLosslessDoesNotTamperWithSurrogatePairsTest {
NSArray *expected = @[
[Diff diffWithOperation:DIFF_EQUAL andText:@"☺️"],
[Diff diffWithOperation:DIFF_INSERT andText:@"😃"],
[Diff diffWithOperation:DIFF_EQUAL andText:@"🖖🏿"]
];

NSMutableArray *diffs = [expected mutableCopy];

DiffMatchPatch *dmp = [DiffMatchPatch new];
[dmp diff_cleanupSemanticLossless:diffs];

XCTAssertEqualObjects(diffs, expected, @"The result should match the input!");
}

- (void)testDiffCleanupSemanticTest {
DiffMatchPatch *dmp = [DiffMatchPatch new];
NSMutableArray *expectedResult = nil;
Expand Down