-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garnet crash on load data #766
Comments
Garnet 1.0.37, 1.0.39:
Same error on EPYC 9634 (Zen4), Debian 11 (ldd (Debian GLIBC 2.31-13+deb11u8) 2.31) kernel 5.15.143-1-pve, zfs:
But no error on Ryzen 9 7900X (Zen4), ArchLinux (ldd (GNU libc) 2.40) kernel 6.11.5-arch1-1, ext4 on LUKS, same run under podman (conman) but rootless:
But
|
677 minutes - performed scan... This is uber slow! |
can you share exactly what data to load into redis before dumping to Resp file, as the repro might depend of the number of keys, size of values etc. that are being dumped. |
Also, what is the result of was a save/bgsave invoked either by client or the dump loading tool? try re-running with the error indicates that a read was conducted, why would loading a dump cause a read is not clear. any idea on what the actual operations were being done? why would there have been a read being performed. |
try loading the same data directly using Resp.benchmark and see if scan is still slow. it is hard to say what might be the cause of this slowdown with the given information. is the hash table too small, leading to lots of collisions, for example.
Make sure you did not try to store more than this number of distinct objects. |
Unfortunately, I cannot provide the data itself - it is private. But here is the result of the command
Also I tried to load different datasets - 61M, 30M and 3M - the failure occurs when 2M-2.5M are processed.
No. Saving was not performed, scan was performed only on data in memory
No special readings were performed during loading. I will perform other tests on the next week |
Data was load (on EPYC server)
with next params
But scan still slow on EPYC with 336 cores (same on Ryzen 9 7900x with 16 cores): For compare same data scan on keydb:
|
1.0.44 - have same slow scan:
And scan speed is down when already scanned ~440-443 MiB (on different Garnet versions)
I didn't understand how to load data with resp.benchmark? |
Is the issue with "garnet crash" still there or is that no longer happening? Scan is a separate issue - could you please provide a repro including data (generated data is fine since you cannot share the real data) for us to diagnose that further. |
@badrishc with the above configuration, garnet no longer crashes. Ok I create separate issue about slow scan |
@rnz is there still a load-related issue here that should be looked into? |
@TalZaccai |
In version 1.0.51 get more informative message on fail load data when enabled AOF:
And now Garnet not crash:
I was try different AOF settings:
...
|
Do you get the error even after setting the AOF page size to be sufficiently large? The error with small AOF page size is expected because the key value pair needs to fit on one AOF page. We cannot break up records to span multiple AOF pages. If there is still a problem even after setting the AOF page size to be larger than inserted records, then there might be an issue. In that case, we would also need some repro logic and synthetic data that causes the crash on loading, so that we can diagnose this further. |
Yes.
I'm trying to determine the key that causes this error. It would be easier if the error message would display the key that the error occurred on. |
i think the object being inserted might be very large (e.g. a HSET command), therefore crossing the AOF page size. We should close this issue unless this hypothesis in proven otherwise. I tried various scenarios and cannot make it happen either. |
still present in 1.0.54:
|
Can you somehow prepare a repro (data + command) that we can look at? Otherwise we are unable to make progress on this without more information, apologies. |
@badrishc |
I found problem key - it have very big value - 342M. I think it was inserted by a program that was not working correctly and generated an incorrect data encoded by base64 with many repeaded sequence
Yes, server definitely shouldn't interrupt the import process by resetting the client session (redis-cli), but it should write to the server log - the error, the operation and, of course, the key identifier. Well, redis or keydb had no problems importing this key |
We have to let the client know of this error, because otherwise they would not realize that a key-value was skipped, which can lead to data loss. A good client will see this error, and flag it or fix it.
If you increase the AOF page size to 512MB, then you will not run into the error. |
Actually, it is enough to check the number of keys with a scan, i.e. you probably shouldn't break the session on purpose, especially when the data is uploaded in pipe mode. Garnet have transactions for strict processing. Or it makes sense to add the ability to manage the error processing mode.
Yep, I checked 1.0.54 with: Data loaded OK
But scan of this data is slowdown after 443M:
|
Server does not even know you are doing data loading. It is only receiving SET commands. We can't fail to do a SET commands and let the session continue normally. There is no safe error response for SET in the protocol. We could return NIL but clients may not expect this return value, which could cause issues for the client, like crash, which is worse than losing the session gracefully.
The client may be setting the same value multiple times. if a SET fails, they cannot detect this with a check of number of keys.
If you see a bug in scan and you have a self-contained repro, please open a separate issue for it. |
Describe the bug
Here
redis-cli --pipe < db0.resp
Steps to reproduce the bug
redis-cli --pipe < db0.resp
Expected behavior
Full loaded data from resp-file by redis-cli to garnet
Screenshots
No response
Release version
v1.0.35
IDE
No response
OS version
debian 11
Additional context
FYI: dragonfly is load same resp file (in same environment).
The text was updated successfully, but these errors were encountered: