-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Study to compare t-Digest and REQ sketch #416
Comments
We compared t-digest and ReqSketch in a paper (at KDD'21) with Graham and people from Splunk. I think t-digest was not updated much since then, so I hope the points below are still true. Here's an upshot based on the paper (see Section 6 in the paper):
|
I'd say that values of The choice of the uniform distribution may not be the best for a comparison. I'd suggest to also include the log uniform distribution (uniform on the log scale) that we used in the paper. Or a normal distribution. |
@PavelVesely although slightly off topic, do you understand why REQ k=12 and k=10 end up retaining about the same number of items and having about the same rank error? Do you think it might be a bug in the implementation? |
My guess is that for such a small k, rounding would yield sketches of pretty much the same size. Specifically, since k is the initial section size and section size decreases by a factor of Apologies for the belated answer! (You can let me know by email if you'd like me to look at something, but I'm not watching the dev mailing list closely.) |
@AlexanderSaydakov The plots are nice, albeit it's not surprising that t-digest is much better on uniform distribution. Can you try another distribution which is more skewed? |
sorry for the delay. I plan to do this eventually, but was busy with more urgent tasks. |
Thanks for the update! In the KDD paper, we tried the log-uniform distribution, that is, choosing a random (We actually tried some variations of the log-uniform distribution, like additionally choosing a random sign, or squaring |
@AlexanderSaydakov @leerho Btw, are there any differences between your implementation of t-digest and the "official" implementation? I mean primarily algorithmic differences that would influence the accuracy on some datasets. |
We implemented merge slightly differently since the reference implementation modifies the input sketches. My testing shows that our version does a bit better in terms of accuracy after merge. Other than that I believe our implementation is equivalent to the reference implementation. |
Compare the performance of t-Digest with the closest competitor in the library, REQ sketch.
REQ sketch is the closest competitor because it prioritizes high rank accuracy (HRA mode) or low rank accuracy (LRA mode), unlike other quantile sketches (KLL, classic) with the same rank error for any rank.
There are a few obvious differences:
The text was updated successfully, but these errors were encountered: