-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize sanitizing www-form-urlencoded input #58
base: main
Are you sure you want to change the base?
Conversation
Using gsub with block syntax is slow. This implementation uses lookup tables to cache the mapping of encoded/decoded characters since these are known subsets / ranges, and handle replacement in a way that doesn't have any overhead for yielding. Benchmarks for random payloads containing 20% urlencoded characters show a considerable improvement: Before: 0.000900 0.000565 0.003071 0.033079 0.305175 3.015223 After: 0.000836 0.000488 0.001445 0.011248 0.110888 1.107596 Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.
I don't understand the correctness or the performance properties of this PR to be comfortable signing off on it. |
This matches the implementation from ruby stdlib, and returns uppercase characters for hex.
This keeps the benefit of avoiding sprintf on common codepoints without ballooning the lookup table too much.
Hi @whitequark, On correctness—I just reverted one change back to the This is mainly changing I added a test for benchmarking: For a 100kb www-form-urlencoded payload, I'm measuring master branch at about 15ms per request, and these revisions at about 5ms. I'm not yet convinced of this change since I just wrote it. I only want to share it to get more eyes on it & decide if it's pursuing further. I won't be sad if you decide to close it. 😄 |
FWIW we have an endpoint that processes a massive form-encoded payload and just this middleware can take anywhere from 1.5s to 11s(!!) to process the request body. So we'd love this (or any performance-focused PR) to land. I can take a closer look later this week, but I'd be curious what it will take to get this merged & released @whitequark |
Please try running this PR in your environment, and if it works well, I'll merge it. |
Using gsub with block syntax is slow. This implementation uses lookup tables to
cache the mapping of encoded/decoded characters since these are known subsets /
ranges, and handle replacement in a way that doesn't have any overhead for
yielding.
Benchmarks for random payloads containing 20% urlencoded characters show a
considerable improvement:
Before:
After:
Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.