Garnet crash on load data #766

rnz · 2024-11-01T15:35:17Z

Describe the bug

    _________
   /_||___||_\      Garnet 1.0.35 64 bit; standalone mode
   '. \   / .'      Port: 6379
     '.\ /.'        https://aka.ms/GetGarnet
       '.'

03::27::19 info: GarnetServer[0] Garnet 1.0.35 64 bit; standalone mode; Port: 6379
03::27::19 info: ArgParser[0] Configuration import from embedded resource succeeded. Path: defaults.conf.
03::27::19 info: ArgParser[0] Configuration import from local machine succeeded. Path: /etc/garnet/garnet.conf.
03::27::19 info: Options[0] [Store] Using page size of 32m
03::27::19 info: Options[0] [Store] Using log memory size of 128g, with 896 empty pages, for effective size of -0.109252929688p
03::27::19 info: Options[0] [Store] There are 4k log pages in memory
03::27::19 info: Options[0] [Store] Using disk segment size of 1g
03::27::19 info: Options[0] [Store] Using hash index size of 128m (2m cache lines)
03::27::19 info: Options[0] [Store] Hash index size is optimized for up to ~8m distinct keys
03::27::19 info: Options[0] [Store] Using log mutable percentage of 90%
03::27::19 info: Options[0] [Store] Not using Revivification
03::27::19 trce: TsavoriteKV [main][0] KV Initialize size:2097152, sizeBytes:134217728 sectorSize:512 alignedSizeBytes:134218240
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using page size of 1m
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Each page can hold ~43690 key-value pairs of objects
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using log memory size of 32m
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] This can hold ~1398101 key-value pairs of objects in memory total
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] There are 32 log pages in memory
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using disk segment size of 32m
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using hash index size of 16m (256k cache lines)
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Hash index size is optimized for up to ~1m distinct keys
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Using log mutable percentage of 90%
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Total memory size including heap objects is unlimited
03::27::19 info: TsavoriteKV  [obj][0] [Object Store] Not using Revivification
03::27::19 trce: GarnetServer[0] TLS is disabled
03::27::19 info: GarnetServer[0] Total configured memory limit: 107558731776
03::27::19 info: TsavoriteKV [main][0] ********* Primary Recovery Information ********
03::27::19 info: StoreWrapper[0] Error during recovery of store; storeVersion = -1; objectStoreVersion = -1 Tsavorite.core.TsavoriteException: Unable to find valid HybridLog token    at Tsavorite.core.TsavoriteKV`4.FindRecoveryInfo(Int64 requestedVersion, HybridLogCheckpointInfo& recoveredHlcInfo, IndexCheckpointInfo& recoveredICInfo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Recovery/Recovery.cs:line 327    at Tsavorite.core.TsavoriteKV`4.Recover(Int32 numPagesToPreload, Boolean undoNextVersion, Int64 recoverTo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs:line 349    at Garnet.server.StoreWrapper.RecoverCheckpoint(Boolean recoverMainStoreFromToken, Boolean recoverObjectStoreFromToken, Guid storeIndexToken, Guid storeHlogToken, Guid objectStoreIndexToken, Guid objectStoreHlogToken) in /src/libs/server/StoreWrapper.cs:line 246
* Ready to accept connections
03::27::19 info: StoreWrapper[0] NOTE: Take a checkpoint (SAVE/BGSAVE) in order to actually delete the older data segments (files) from disk
03::27::23 dbug: GarnetServer[0] Accepted TCP connection from 10.200.3.201:50768
03::27::23 dbug: Session[0] [10.200.3.201:50768] [005C39D4] Starting RespServerSession Id=1

Here redis-cli --pipe < db0.resp

03::28::02 dbug: GarnetServer[0] Accepted TCP connection from 172.16.16.1:40488
03::28::02 dbug: Session[0] [172.16.16.1:40488] [005818D1] Starting RespServerSession Id=2
Unhandled exception. System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at Garnet.server.GarnetObjectSerializer.DeserializeInternal(BinaryReader binaryReader) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 46
   at Garnet.server.GarnetObjectSerializer.Deserialize(IGarnetObject& obj) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 29
   at Tsavorite.core.GenericAllocatorImpl`3.RetrievedFullRecord(Byte* record, AsyncIOContext`2& ctx) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/GenericAllocatorImpl.cs:line 909
   at Tsavorite.core.AllocatorBase`4.AsyncGetFromDiskCallback(UInt32 errorCode, UInt32 numBytes, Object context) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs:line 1719
   at Tsavorite.core.NativeStorageDevice._callback(IntPtr context, Int32 errorCode, UInt64 numBytes) in /src/libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs:line 112

Steps to reproduce the bug

dump many data from redis/keydb to Resp file (https://github.com/yannh/redis-dump-go):

# podman run ghcr.io/yannh/redis-dump-go:latest -insecure -host source-host -db 0 -n 1000 > db0.resp
Database 0: 61267252 element dumped

load data from resp-file redis-cli --pipe < db0.resp
wait

Expected behavior

Full loaded data from resp-file by redis-cli to garnet

Screenshots

No response

Release version

v1.0.35

IDE

No response

OS version

debian 11

Additional context

# cat /opt/garnet/etc/garnet.conf 
{
	"Port" : 6379,
	"Address" : null, 
	"MemorySize" : "100g",
	"PageSize" : "32m",
	"SegmentSize" : "1g",
	"IndexSize" : "128m",
	"IndexMaxSize": "",
	"MutablePercent" : 90,
	"ObjectStoreHeapMemorySize" : "",
	"ObjectStoreLogMemorySize" : "32m",
	"ObjectStorePageSize" : "1m",
	"ObjectStoreSegmentSize" : "32m",
	"ObjectStoreIndexSize" : "16m",
	"ObjectStoreIndexMaxSize": "",
	"ObjectStoreMutablePercent" : 90,
	"EnableStorageTier" : true,
	"CopyReadsToTail" : false,
	"ObjectStoreCopyReadsToTail" : false,
	"LogDir" : "/data",
	"CheckpointDir" : null,
	"Recover" : true,
	"DisablePubSub" : false,
	"EnableIncrementalSnapshots" : false,
	"PubSubPageSize" : "4k",
	"DisableObjects" : false,
	"EnableCluster" : false,
	"CleanClusterConfig" : false,
	"AuthenticationMode" : "NoAuth",
	"Password" : null,
	"ClusterUsername" : null,
	"ClusterPassword" : null,
	"AclFile" : null,
	"AadAuthority" : "https://login.host.com",
	"AadAudiences" : null,
	"AadIssuers" : null,
	"AuthorizedAadApplicationIds" : null,
	"AadValidateUsername": false,
	"EnableAOF" : false,
	"AofMemorySize" : "64m",
	"AofPageSize" : "4m",
	"AofReplicationRefreshFrequencyMs": 10,
	"SubscriberRefreshFrequencyMs": 0,
	"CommitFrequencyMs" : 0,
	"WaitForCommit" : false,
	"AofSizeLimit" : "",
	"CompactionFrequencySecs" : 300,
	"CompactionType" : "Lookup",
	"CompactionForceDelete": false,
	"CompactionMaxSegments" : 32,
	"ObjectStoreCompactionMaxSegments" : 32,
	"EnableLua" : false,
	"LuaTransactionMode" : false,
	"GossipSamplePercent" : 100,
	"GossipDelay" : 5,
	"ClusterTimeout" : 60,
	"ClusterTlsClientTargetHost" : "GarnetTest",
	"EnableTLS" : false,
	"CertFileName" : null,
	"CertPassword" : null,
	"CertSubjectName" : null,
	"CertificateRefreshFrequency" : 0,
	"ClientCertificateRequired" : true,
	"ServerCertificateRequired" : true,
	"CertificateRevocationCheckMode" : "NoCheck",
	"IssuerCertificatePath" : "",
	"LatencyMonitor" : true,
	"MetricsSamplingFrequency" : 1,
	"QuietMode" : false,
	"LogLevel" : "Trace",
	"LoggingFrequency" : "10",
	"DisableConsoleLogger" : false,
	"FileLogger" : null,
	"ThreadPoolMinThreads" : 0,
	"ThreadPoolMaxThreads" : 0,
	"UseAzureStorage" : false,
	"AzureStorageConnectionString" : null,
	"CheckpointThrottleFlushDelayMs" : 0,
	"EnableFastCommit" : true,
	"FastCommitThrottleFreq" : 1000,
	"NetworkSendThrottleMax" : 8,
	"EnableScatterGatherGet" : false,
	"ReplicaSyncDelayMs" : 5,
	"MainMemoryReplication" : false,
	"OnDemandCheckpoint" : false,
	"UseAofNullDevice" : false,
	"UseNativeDeviceLinux" : true,
	"RevivBinRecordSizes" : null,
	"RevivBinRecordCounts" : null,
	"RevivifiableFraction" : 1.0,
	"EnableRevivification" : false,
	"RevivNumberOfBinsToSearch" : 0,
	"RevivBinBestFitScanLimit" : 0,
	"RevivInChainOnly" : false,
	"RevivObjBinRecordCount" : 256,
	"ObjectScanCountLimit" : 1000,
	"ExtensionBinPaths": null,
	"ExtensionAllowUnsignedAssemblies": false,
	"IndexResizeFrequencySecs": 60,
	"IndexResizeThreshold": 50,
	"LoadModuleCS": null
}

FYI: dragonfly is load same resp file (in same environment).

The text was updated successfully, but these errors were encountered:

rnz · 2024-11-13T20:44:01Z

Garnet 1.0.37, 1.0.39:

$ du -hs --apparent-size rdb05-db0-again-31073609.resp 
13G	rdb05-db0-again-31073609.resp (31073609 keys)

     "CreateCommand": [
          "podman",          "run",
          "-dt",
          "--pod",          "new:garnet",
          "--name",          "gr01",
          "-p",          "6379:6379",
          "-v",           "/var/lib/garnet:/data",
          "-v",          "/opt/garnet/etc:/etc/garnet",
          "ghcr.io/microsoft/garnet",
          "--config-import-path",          "/etc/garnet/garnet.conf"
     ],

Same error on EPYC 9634 (Zen4), Debian 11 (ldd (Debian GLIBC 2.31-13+deb11u8) 2.31) kernel 5.15.143-1-pve, zfs:
Same error on EPYC 9634 (Zen4), Debian 12 (ldd (Debian GLIBC 2.36-9+deb12u9) 2.36) kernel 6.8.12-3-pve, zfs:

root@n01-test-podman:~/dl# podman logs -f gr01
    _________
   /_||___||_\      Garnet 1.0.39 64 bit; standalone mode
   '. \   / .'      Port: 6379
     '.\ /.'        https://aka.ms/GetGarnet
       '.'

06::54::54 info: GarnetServer[0] Garnet 1.0.39 64 bit; standalone mode; Port: 6379
06::54::54 info: ArgParser[0] Configuration import from embedded resource succeeded. Path: defaults.conf.
06::54::54 info: ArgParser[0] Configuration import from local machine succeeded. Path: /etc/garnet/garnet.conf.
06::54::54 info: Options[0] [Store] Using page size of 512m
06::54::54 info: Options[0] [Store] Using log memory size of 128g, with 56 empty pages, for effective size of -6.875t
06::54::54 info: Options[0] [Store] There are 256 log pages in memory
06::54::54 info: Options[0] [Store] Using disk segment size of 1g
06::54::54 info: Options[0] [Store] Using hash index size of 128m (2m cache lines)
06::54::54 info: Options[0] [Store] Hash index size is optimized for up to ~8m distinct keys
06::54::54 info: Options[0] [Store] Using log mutable percentage of 90%
06::54::54 info: Options[0] [Store] Not using Revivification
06::54::54 trce: TsavoriteKV [main][0] KV Initialize size:2097152, sizeBytes:134217728 sectorSize:512 alignedSizeBytes:134218240
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using page size of 8m
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Each page can hold ~349525 key-value pairs of objects
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using log memory size of 64m
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] This can hold ~2796202 key-value pairs of objects in memory total
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] There are 8 log pages in memory
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using disk segment size of 128m
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using hash index size of 64m (1m cache lines)
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Hash index size is optimized for up to ~4m distinct keys
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Using log mutable percentage of 90%
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Total memory size including heap objects is unlimited
06::54::54 info: TsavoriteKV  [obj][0] [Object Store] Not using Revivification
06::54::54 trce: GarnetServer[0] TLS is disabled
06::54::54 info: GarnetServer[0] Total configured memory limit: 107642617856
06::54::54 info: TsavoriteKV [main][0] ********* Primary Recovery Information ********
06::54::54 info: StoreWrapper[0] Error during recovery of store; storeVersion = -1; objectStoreVersion = -1 Tsavorite.core.TsavoriteException: Unable to find valid HybridLog token    at Tsavorite.core.TsavoriteKV`4.FindRecoveryInfo(Int64 requestedVersion, HybridLogCheckpointInfo& recoveredHlcInfo, IndexCheckpointInfo& recoveredICInfo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Recovery/Recovery.cs:line 327    at Tsavorite.core.TsavoriteKV`4.Recover(Int32 numPagesToPreload, Boolean undoNextVersion, Int64 recoverTo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs:line 349    at Garnet.server.StoreWrapper.RecoverCheckpoint(Boolean replicaRecover, Boolean recoverMainStoreFromToken, Boolean recoverObjectStoreFromToken, CheckpointMetadata metadata) in /src/libs/server/StoreWrapper.cs:line 262
* Ready to accept connections
06::54::54 info: StoreWrapper[0] NOTE: Take a checkpoint (SAVE/BGSAVE) in order to actually delete the older data segments (files) from disk
06::55::18 dbug: GarnetServer[0] Accepted TCP connection from 172.16.16.1:38760
06::55::18 dbug: Session[0] [172.16.16.1:38760] [009FE5F7] Starting RespServerSession Id=1

Unhandled exception. System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at Garnet.server.GarnetObjectSerializer.DeserializeInternal(BinaryReader binaryReader) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 46
   at Garnet.server.GarnetObjectSerializer.Deserialize(IGarnetObject& obj) in /src/libs/server/Objects/Types/GarnetObjectSerializer.cs:line 29
   at Tsavorite.core.GenericAllocatorImpl`3.RetrievedFullRecord(Byte* record, AsyncIOContext`2& ctx) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/GenericAllocatorImpl.cs:line 909
   at Tsavorite.core.AllocatorBase`4.AsyncGetFromDiskCallback(UInt32 errorCode, UInt32 numBytes, Object context) in /src/libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs:line 1719
   at Tsavorite.core.NativeStorageDevice._callback(IntPtr context, Int32 errorCode, UInt64 numBytes) in /src/libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs:line 112

But no error on Ryzen 9 7900X (Zen4), ArchLinux (ldd (GNU libc) 2.40) kernel 6.11.5-arch1-1, ext4 on LUKS, same run under podman (conman) but rootless:

$ time redis-cli --pipe < rdb05-db0-again-31073609.resp 
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 31169367

real	0m44.228s
user	0m3.314s
sys	0m8.543s

But time redis-cli --scan | wc -l on Ryzen 9 7900x is very slow and Garnet (1.0.39) have big CPU utilizatiion:

$ time redis-cli --scan | wc -l
31073609

real	677m44.041s
user	0m30.711s
sys	0m35.334s

rnz · 2024-11-14T15:11:58Z

677 minutes - performed scan... This is uber slow!

badrishc · 2024-11-14T23:23:57Z

dump many data from redis/keydb to Resp file (https://github.com/yannh/redis-dump-go):

can you share exactly what data to load into redis before dumping to Resp file, as the repro might depend of the number of keys, size of values etc. that are being dumped.

badrishc · 2024-11-14T23:36:34Z

Also, what is the result of info store when the crash occurred.

was a save/bgsave invoked either by client or the dump loading tool?

try re-running with UseNativeDeviceLinux false, and see if that also fails loading.

the error indicates that a read was conducted, why would loading a dump cause a read is not clear. any idea on what the actual operations were being done? why would there have been a read being performed.

badrishc · 2024-11-14T23:39:18Z

677 minutes - performed scan... This is uber slow!

try loading the same data directly using Resp.benchmark and see if scan is still slow. it is hard to say what might be the cause of this slowdown with the given information. is the hash table too small, leading to lots of collisions, for example.

Hash index size is optimized for up to ~4m distinct keys

Make sure you did not try to store more than this number of distinct objects.

rnz · 2024-11-15T17:42:27Z

@badrishc

Unfortunately, I cannot provide the data itself - it is private.

But here is the result of the command redis-cli --memkeys:

552904 lists with 14078285092 bytes (00.89% of keys, avg size 25462.44)
15966432 hashs with 11068174238 bytes (25.64% of keys, avg size 693.22)
32485940 strings with 3500674253 bytes (52.18% of keys, avg size 107.76)
0 streams with 0 bytes (00.00% of keys, avg size 0.00)
13252339 sets with 10202337375 bytes (21.28% of keys, avg size 769.85)
4441 zsets with 586801 bytes (00.01% of keys, avg size 132.13)

Also I tried to load different datasets - 61M, 30M and 3M - the failure occurs when 2M-2.5M are processed.

was a save/bgsave invoked either by client or the dump loading tool?

No. Saving was not performed, scan was performed only on data in memory

the error indicates that a read was conducted, why would loading a dump cause a read is not clear. any idea on what the actual operations were being done? why would there have been a read being performed.

No special readings were performed during loading.

I will perform other tests on the next week

rnz · 2024-11-19T19:57:56Z

Data was load (on EPYC server)

root@n02-test-podman:~# time redis-cli --pipe < rdb05-db0-again-31073609.resp
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 31169367

real    1m8.535s
user    0m4.939s
sys     0m38.830s

with next params

{
	"Port" : 6379,
	"Address" : null, 
	"MemorySize" : "100g",
	"PageSize" : "512m",
	"SegmentSize" : "1g",
	"IndexSize" : "128m",
	"IndexMaxSize": "",
	"MutablePercent" : 90,
	"ObjectStoreHeapMemorySize" : "",
	"ObjectStoreLogMemorySize" : "32m",
	"ObjectStorePageSize" : "1m",
	"ObjectStoreSegmentSize" : "32m",
	"ObjectStoreIndexSize" : "64m",
	"ObjectStoreIndexMaxSize": "",
	"ObjectStoreMutablePercent" : 90,
	"EnableStorageTier" : true,
	"CopyReadsToTail" : false,
	"ObjectStoreCopyReadsToTail" : false,
	"LogDir" : "/data",
	"CheckpointDir" : null,
	"Recover" : true,
	"DisablePubSub" : false,
	"EnableIncrementalSnapshots" : false,
	"PubSubPageSize" : "4k",
	"DisableObjects" : false,
	"EnableCluster" : false,
	"CleanClusterConfig" : false,
	"AuthenticationMode" : "NoAuth",
	"Password" : null,
	"ClusterUsername" : null,
	"ClusterPassword" : null,
	"AclFile" : null,
	"AadAuthority" : "https://login.host.com",
	"AadAudiences" : null,
	"AadIssuers" : null,
	"AuthorizedAadApplicationIds" : null,
	"AadValidateUsername": false,
	"EnableAOF" : false,
	"AofMemorySize" : "64m",
	"AofPageSize" : "4m",
	"AofReplicationRefreshFrequencyMs": 10,
	"SubscriberRefreshFrequencyMs": 0,
	"CommitFrequencyMs" : 0,
	"WaitForCommit" : false,
	"AofSizeLimit" : "",
	"CompactionFrequencySecs" : 300,
	"CompactionType" : "Lookup",
	"CompactionForceDelete": false,
	"CompactionMaxSegments" : 32,
	"ObjectStoreCompactionMaxSegments" : 32,
	"EnableLua" : false,
	"LuaTransactionMode" : false,
	"GossipSamplePercent" : 100,
	"GossipDelay" : 5,
	"ClusterTimeout" : 60,
	"ClusterTlsClientTargetHost" : "GarnetTest",
	"EnableTLS" : false,
	"CertFileName" : null,
	"CertPassword" : null,
	"CertSubjectName" : null,
	"CertificateRefreshFrequency" : 0,
	"ClientCertificateRequired" : true,
	"ServerCertificateRequired" : true,
	"CertificateRevocationCheckMode" : "NoCheck",
	"IssuerCertificatePath" : "",
	"LatencyMonitor" : true,
	"MetricsSamplingFrequency" : 1,
	"QuietMode" : false,
	"LogLevel" : "Trace",
	"LoggingFrequency" : "10",
	"DisableConsoleLogger" : false,
	"FileLogger" : null,
	"ThreadPoolMinThreads" : 0,
	"ThreadPoolMaxThreads" : 0,
	"UseAzureStorage" : false,
	"AzureStorageConnectionString" : null,
	"CheckpointThrottleFlushDelayMs" : 0,
	"EnableFastCommit" : true,
	"FastCommitThrottleFreq" : 1000,
	"NetworkSendThrottleMax" : 100,
	"EnableScatterGatherGet" : false,
	"ReplicaSyncDelayMs" : 5,
	"MainMemoryReplication" : false,
	"OnDemandCheckpoint" : false,
	"UseAofNullDevice" : false,
	"UseNativeDeviceLinux" : true,
	"RevivBinRecordSizes" : null,
	"RevivBinRecordCounts" : null,
	"RevivifiableFraction" : 1.0,
	"EnableRevivification" : false,
	"RevivNumberOfBinsToSearch" : 0,
	"RevivBinBestFitScanLimit" : 0,
	"RevivInChainOnly" : false,
	"RevivObjBinRecordCount" : 512,
	"ObjectScanCountLimit" : 100000,
	"ExtensionBinPaths": null,
	"ExtensionAllowUnsignedAssemblies": false,
	"IndexResizeFrequencySecs": 60,
	"IndexResizeThreshold": 50,
	"LoadModuleCS": null
}

But scan still slow on EPYC with 336 cores (same on Ryzen 9 7900x with 16 cores):
https://github.com/user-attachments/assets/d66f1ec3-7c93-4c8a-9edb-de96650a8e95

For compare same data scan on keydb:

# time redis-cli --scan | (pv -p --timer --rate --bytes ) | wc -l
1.01GiB 0:01:09 [14.9MiB/s] [                                                                                                                   <=>                                                                              ]
31073609

real	1m9.384s
user	0m9.547s
sys	0m18.885s

rnz · 2024-11-27T13:47:52Z

1.0.44 - have same slow scan:

root@n02-test-podman:~# time redis-cli --scan | (pv -p --timer --rate --bytes )| wc -l
1.01GiB 13:19:15 [22.1KiB/s] [                                                                                                   <=>                                                                                             ]
31073609

real    799m15.279s
user    0m51.988s
sys     2m4.286s

And scan speed is down when already scanned ~440-443 MiB (on different Garnet versions)

@badrishc

try loading the same data directly using Resp.benchmark and see if scan is still slow. it is hard to say what might be the cause of this slowdown with the given information. is the hash table too small, leading to lots of collisions, for example.

I didn't understand how to load data with resp.benchmark?

badrishc · 2024-11-27T17:56:19Z

Is the issue with "garnet crash" still there or is that no longer happening?

Scan is a separate issue - could you please provide a repro including data (generated data is fine since you cannot share the real data) for us to diagnose that further.

rnz · 2024-12-03T21:57:29Z

@badrishc with the above configuration, garnet no longer crashes.

Ok I create separate issue about slow scan

TalZaccai · 2024-12-05T02:14:49Z

@rnz is there still a load-related issue here that should be looked into?

rnz · 2024-12-10T12:33:12Z

@TalZaccai
I think it's worth checking what checks are performed when parsing the loaded data. It seems that Garnet skips data that is not valid (this also applies to other redis-like kv storages). Because of this, problems arise when subsequently accessing the data, for example when performing SCAN. Unfortunately, I have not yet been able to determine the data set for which this happens.

rnz · 2025-01-20T15:27:11Z

In version 1.0.51 get more informative message on fail load data when enabled AOF:

ProcessMessages threw an exception: 
	Tsavorite.core.TsavoriteException: 
		Entry does not fit on page
		    at Tsavorite.core.TsavoriteKV`4.CreateNewRecordUpsert[TInput,TOutput,TContext,TSessionFunctionsWrapper]
		    (TKey& key, TInput& input, TValue& value, TOutput& output, PendingContext`3& pendingContext, TSessionFunctionsWrapper sessionFunctions, OperationStackContext`4& stackCtx, RecordInfo& srcRecordInfo) 
		    in /src/libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs:line 359    at Garnet.server.RespServerSession.ProcessMessages() 
		    in /src/libs/server/Resp/RespServerSession.cs:line 461    at Garnet.server.RespServerSession.TryConsumeMessages(Byte* reqBuffer, Int32 bytesReceived) 
		    in /src/libs/server/Resp/RespServerSession.cs:line 321

And now Garnet not crash:

ae2421348c2c 03::22::38 dbug: GarnetServer[0] Accepted TCP connection from 10.88.0.1:41476
ae2421348c2c 03::22::38 dbug: Session[0] [10.88.0.1:41476] [02E1A791] Starting RespServerSession Id=3
ae2421348c2c 03::22::43 crit: Session[0] [10.88.0.1:41476] [02E1A791] ProcessMessages threw an exception: Tsavorite.core.TsavoriteException: Entry does not fit on page    at Tsavorite.core.TsavoriteKV`4.CreateNewRecordUpsert[TInput,TOutput,TContext,TSessionFunctionsWrapper](TKey& key, TInput& input, TValue& value, TOutput& output, PendingContext`3& pendingContext, TSessionFunctionsWrapper sessionFunctions, OperationStackContext`4& stackCtx, RecordInfo& srcRecordInfo) in /src/libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs:line 359    at Garnet.server.RespServerSession.ProcessMessages() in /src/libs/server/Resp/RespServerSession.cs:line 461    at Garnet.server.RespServerSession.TryConsumeMessages(Byte* reqBuffer, Int32 bytesReceived) in /src/libs/server/Resp/RespServerSession.cs:line 321
ae2421348c2c 03::22::43 trce: GarnetServer[0] Accept socket was disposed at RecvEventArg_Completed

I was try different AOF settings:

	"EnableAOF" : true,
	"AofMemorySize" : "64m",
	"AofPageSize" : "4m",
	"AofReplicationRefreshFrequencyMs": 10,
	"AofSizeLimit" : "128m",
	"UseAofNullDevice" : false,

...

	"EnableAOF" : true,
	"AofMemorySize" : "512m",
	"AofPageSize" : "128m",
	"AofReplicationRefreshFrequencyMs": 10,
	"AofSizeLimit" : "1024m",
	"UseAofNullDevice" : false,

badrishc · 2025-01-20T22:35:58Z

Do you get the error even after setting the AOF page size to be sufficiently large? The error with small AOF page size is expected because the key value pair needs to fit on one AOF page. We cannot break up records to span multiple AOF pages.

If there is still a problem even after setting the AOF page size to be larger than inserted records, then there might be an issue. In that case, we would also need some repro logic and synthetic data that causes the crash on loading, so that we can diagnose this further.

rnz · 2025-01-22T13:08:47Z

@badrishc

Do you get the error even after setting the AOF page size to be sufficiently large?

Yes.

The error with small AOF page size is expected because the key value pair needs to fit on one AOF page. We cannot break up records to span multiple AOF pages.

I'm trying to determine the key that causes this error. It would be easier if the error message would display the key that the error occurred on.

badrishc · 2025-01-26T23:12:21Z

i think the object being inserted might be very large (e.g. a HSET command), therefore crossing the AOF page size. We should close this issue unless this hypothesis in proven otherwise. I tried various scenarios and cannot make it happen either.

rnz · 2025-02-04T16:30:51Z

still present in 1.0.54:

ProcessMessages threw an exception: Tsavorite.core.TsavoriteException: 
	Entry does not fit on page    
	at Tsavorite.core.TsavoriteKV`4.CreateNewRecordUpsert[TInput,TOutput,TContext,TSessionFunctionsWrapper]
		(TKey& key, TInput& input, TValue& value, TOutput& output, PendingContext`3& pendingContext, TSessionFunctionsWrapper sessionFunctions, OperationStackContext`4& stackCtx, RecordInfo& srcRecordInfo) 
		in /src/libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs:line 362    
	at Garnet.server.RespServerSession.ProcessMessages() 
		in /src/libs/server/Resp/RespServerSession.cs:line 463    
	at Garnet.server.RespServerSession.TryConsumeMessages(Byte* reqBuffer, Int32 bytesReceived) 
		in /src/libs/server/Resp/RespServerSession.cs:line 321

badrishc · 2025-02-04T17:35:25Z

Can you somehow prepare a repro (data + command) that we can look at? Otherwise we are unable to make progress on this without more information, apologies.

rnz · 2025-02-04T18:19:58Z

@badrishc
I'm still trying to narrow down the search by isolating the keys where this happens. Unfortunately, neither Garnet nor redis-cli allow to see the key where the failure occurs.
Honestly, it would be logical not to interrupt the import process by returning an error code, but just write to server log - error, operation and key identifier, while continuing to process the import.

rnz · 2025-02-04T20:50:26Z

I found problem key - it have very big value - 342M. I think it was inserted by a program that was not working correctly and generated an incorrect data encoded by base64 with many repeaded sequence gAWABYAF (in data it is protobuf have repeated 134217728 times string 11: 0) :

CkkI6f...jI4LjYgACgAMAA4AEAASABQAFgAWABYAFgAWABYAFgAWABYAFgAWABYAFgAWABYAFgAWABYAFgAWABYAF...gAWABYAFgAWABYAFgAWABYAFgAWABYAF...gAWABYAFgAWABYAFgAWABYAFgAWABYAF...gAWABYAFgAWACwBgC6BpEBCgNHU00KA0...IEMXNpbQ==

Yes, server definitely shouldn't interrupt the import process by resetting the client session (redis-cli), but it should write to the server log - the error, the operation and, of course, the key identifier.

Well, redis or keydb had no problems importing this key

badrishc · 2025-02-05T02:02:12Z

We have to let the client know of this error, because otherwise they would not realize that a key-value was skipped, which can lead to data loss. A good client will see this error, and flag it or fix it.

Well, redis or keydb had no problems importing this key

If you increase the AOF page size to 512MB, then you will not run into the error.

rnz · 2025-02-05T14:39:13Z

We have to let the client know of this error, because otherwise they would not realize that a key-value was skipped, which can lead to data loss. A good client will see this error, and flag it or fix it.

Actually, it is enough to check the number of keys with a scan, i.e. you probably shouldn't break the session on purpose, especially when the data is uploaded in pipe mode. Garnet have transactions for strict processing. Or it makes sense to add the ability to manage the error processing mode.

Well, redis or keydb had no problems importing this key
If you increase the AOF page size to 512MB, then you will not run into the error.

Yep, I checked 1.0.54 with:
"AofMemorySize" : "1024m",
"AofPageSize" : "512m",

Data loaded OK

$ time redis-cli -p 6379 -2 --pipe < rdb05-db0-again-31073609.resp
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 31169367

real	1m18.916s
user	0m5.272s
sys	0m36.993s

But scan of this data is slowdown after 443M:

$ time redis-cli -p 6379 --scan | pv| wc -l
 444MiB 0:41:34 [0.00 B/s] [      <=>                                                                                 ]

badrishc · 2025-02-05T16:07:17Z

Server does not even know you are doing data loading. It is only receiving SET commands. We can't fail to do a SET commands and let the session continue normally. There is no safe error response for SET in the protocol. We could return NIL but clients may not expect this return value, which could cause issues for the client, like crash, which is worse than losing the session gracefully.

it is enough to check the number of keys with a scan

The client may be setting the same value multiple times. if a SET fails, they cannot detect this with a check of number of keys.

But scan of this data is slowdown after 443M:

If you see a bug in scan and you have a self-contained repro, please open a separate issue for it.

TalZaccai self-assigned this Nov 12, 2024

rnz closed this as completed Dec 3, 2024

rnz reopened this Jan 20, 2025

TalZaccai assigned badrishc and unassigned TalZaccai Jan 28, 2025

badrishc closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garnet crash on load data #766

Garnet crash on load data #766

rnz commented Nov 1, 2024 •

edited

Loading

rnz commented Nov 13, 2024 •

edited

Loading

rnz commented Nov 14, 2024

badrishc commented Nov 14, 2024

badrishc commented Nov 14, 2024 •

edited

Loading

badrishc commented Nov 14, 2024 •

edited

Loading

rnz commented Nov 15, 2024 •

edited

Loading

rnz commented Nov 19, 2024 •

edited

Loading

rnz commented Nov 27, 2024 •

edited

Loading

badrishc commented Nov 27, 2024

rnz commented Dec 3, 2024

TalZaccai commented Dec 5, 2024

rnz commented Dec 10, 2024

rnz commented Jan 20, 2025 •

edited

Loading

badrishc commented Jan 20, 2025 •

edited

Loading

rnz commented Jan 22, 2025

badrishc commented Jan 26, 2025

rnz commented Feb 4, 2025

badrishc commented Feb 4, 2025

rnz commented Feb 4, 2025 •

edited

Loading

rnz commented Feb 4, 2025 •

edited

Loading

badrishc commented Feb 5, 2025

rnz commented Feb 5, 2025 •

edited

Loading

badrishc commented Feb 5, 2025 •

edited

Loading

Garnet crash on load data #766

Garnet crash on load data #766

Comments

rnz commented Nov 1, 2024 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Screenshots

Release version

IDE

OS version

Additional context

rnz commented Nov 13, 2024 • edited Loading

rnz commented Nov 14, 2024

badrishc commented Nov 14, 2024

badrishc commented Nov 14, 2024 • edited Loading

badrishc commented Nov 14, 2024 • edited Loading

rnz commented Nov 15, 2024 • edited Loading

rnz commented Nov 19, 2024 • edited Loading

rnz commented Nov 27, 2024 • edited Loading

badrishc commented Nov 27, 2024

rnz commented Dec 3, 2024

TalZaccai commented Dec 5, 2024

rnz commented Dec 10, 2024

rnz commented Jan 20, 2025 • edited Loading

badrishc commented Jan 20, 2025 • edited Loading

rnz commented Jan 22, 2025

badrishc commented Jan 26, 2025

rnz commented Feb 4, 2025

badrishc commented Feb 4, 2025

rnz commented Feb 4, 2025 • edited Loading

rnz commented Feb 4, 2025 • edited Loading

badrishc commented Feb 5, 2025

rnz commented Feb 5, 2025 • edited Loading

badrishc commented Feb 5, 2025 • edited Loading

rnz commented Nov 1, 2024 •

edited

Loading

rnz commented Nov 13, 2024 •

edited

Loading

badrishc commented Nov 14, 2024 •

edited

Loading

badrishc commented Nov 14, 2024 •

edited

Loading

rnz commented Nov 15, 2024 •

edited

Loading

rnz commented Nov 19, 2024 •

edited

Loading

rnz commented Nov 27, 2024 •

edited

Loading

rnz commented Jan 20, 2025 •

edited

Loading

badrishc commented Jan 20, 2025 •

edited

Loading

rnz commented Feb 4, 2025 •

edited

Loading

rnz commented Feb 4, 2025 •

edited

Loading

rnz commented Feb 5, 2025 •

edited

Loading

badrishc commented Feb 5, 2025 •

edited

Loading