fix: catastrophic backtracking in Core.AggressivelyFixLt #440
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When provided with a large HTML document (over a million characters) the
Core.AggressivelyFixLt
regex results in catastrophic backtracking and$html = null
being returned. TLDR; HTMLPurifier gives you back anull
document...I tried many times to produce a regex which did not suffer from catastrophic backtracking but I think it ultimately comes back to the argument of why you should not use regex to parse HTML. The only solutions I could come up with were to either:
pcre.backtrack_limit
to a higher valueCore.AggressivelyFixLt
but that's sub-optimal given the approach seems to work on documents of a reasonable size...null
return value frompreg_replace_callback
and return$html
(disable armor logic if a regex error occurs)The solution in this PR uses a little algorithm which employs only standard string manipulation functions so it works incredibly fast. The algorithm searches for HTML comments and allows a callback to be ran on them.
I've not messed with the signatures of the
callbackUndoCommentSubst
andcallbackArmorCommentEntities
functions because they'republic
and might be used by other libraries.