Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: catastrophic backtracking in Core.AggressivelyFixLt #440

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bytestream
Copy link
Contributor

When provided with a large HTML document (over a million characters) the Core.AggressivelyFixLt regex results in catastrophic backtracking and $html = null being returned. TLDR; HTMLPurifier gives you back a null document...

I tried many times to produce a regex which did not suffer from catastrophic backtracking but I think it ultimately comes back to the argument of why you should not use regex to parse HTML. The only solutions I could come up with were to either:

  • Increase pcre.backtrack_limit to a higher value
  • Disable Core.AggressivelyFixLt but that's sub-optimal given the approach seems to work on documents of a reasonable size...
  • Handle the null return value from preg_replace_callback and return $html (disable armor logic if a regex error occurs)

The solution in this PR uses a little algorithm which employs only standard string manipulation functions so it works incredibly fast. The algorithm searches for HTML comments and allows a callback to be ran on them.

I've not messed with the signatures of the callbackUndoCommentSubst and callbackArmorCommentEntities functions because they're public and might be used by other libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant