Adaptive Radix Tree: high-performance memtable #273

Little-Wallace · 2022-04-04T16:57:07Z

Background

see more algorithm details in https://db.in.tum.de/~leis/papers/ART.pdf
Adaptive Radix Tree is a kind of Trie and can save more memory capacity but still keep high performance.
But the classic algorithm does not explain how to make it work in concurrency read and write.
Here I port a high performance memtable which based on this paper. I only support it in one thread write, which means that we shall set allow_concurrent_memtable_write false. But ART is 8 times than skiplist, it means that only one thread could make a large throughput.

Perfomance compare

LD_PRELOAD=/opt/homebrew/lib/libgflags.dylib ./db_bench --db=./data --disable_wal=true --enable_pipelined_write=true --key_size=35 --value_size=100 --write_buffer_size=4096000000 --benchmarks=fillrandom --batch_size=512 --num=1000000 --threads=1 --compression_type=none --allow_concurrent_memtable_write=false

fillrandom : 1.446 micros/op 691349 ops/sec; 89.0 MB/s

LD_PRELOAD=/opt/homebrew/lib/libgflags.dylib ./db_bench --db=./data --disable_wal=true --enable_pipelined_write=true --memtablerep=art --key_size=35 --value_size=100 --write_buffer_size=4096000000 --benchmarks=fillrandom --batch_size=512 --num=1000000 --threads=1 --compression_type=none --allow_concurrent_memtable_write=false

fillrandom : 0.283 micros/op 3527534 ops/sec; 454.2 MB/s

TODO

support snapshot isolation (not replace the value directly).
In the future we will port the optimistic lock to support multi-thread write.
Node256 will allocate a node which use 2048 bytes memory, it's too large. Maybe we could create a compressedl Node256 which uses less memory and owns a sub-allocator for the nodes of its sub-tree.
Similar idea for CompressedNode16 like the previous note.
Merge Node and InnerNode to save memory. Because we do not need to store prefix and prefix_len for a leaf node.

class CompressedNode256 : public Inner Node {
public:
     Node* find_child(uint8_t c) const override {
          uint16_t index = children_index[c].load(std::memory_order_relaxed);
          return arena.get_node(index);
     }

private:
     NodeArena arena;
     std::atomic<uint16_t> children_index[256];
}

Signed-off-by: Little-Wallace <[email protected]>

Little-Wallace · 2022-04-05T16:53:15Z

Memory Usage

I tested the adaptive radix tree and inlineskiplist with 8byte key + 16byte value and 8byte key + 136byte (include 8byte sequence in value).

For value = 16byte.
AdaptiveRadixTree takes up 13.1MB while InlineSkipList takes up 6.6MB

For value = 136byte
AdaptiveRadixTree takes up 36.5MB while InlineSkipList takes 30.4MB

I think the extra memory cost is worth it.

  const int N = 200000;
  Arena arena;
  AdaptiveRadixTree list(&arena);
  for (int i = 0; i < N; i++) {
    Key key = i;
    char* buf = arena.AllocateAligned(sizeof(Key) + 16);
    const char* d = Encode(key);
    memcpy(buf, d, sizeof(Key));
    list.Insert(buf, sizeof(key), buf);
  }
  printf("cost memory: %lu\n", arena.ApproximateMemoryUsage());

  ConcurrentArena arena;
  TestComparator cmp;
  InlineSkipList<TestComparator> list(cmp, &arena);
  for (int i = 0; i < N; i++) {
    Key key = i;
    char* buf = list.AllocateKey(sizeof(Key) + 16);
    memcpy(buf, &key, sizeof(Key));
    list.Insert(buf);
  }
  printf("cost memory: %lu\n", arena.ApproximateMemoryUsage());

WenyXu · 2022-05-23T11:11:43Z

Excellent job; I'm interested in your work. Can I work with your on this? I recently worked with art, and I did some micro bench compared with the B tree, which showed that range-scan performance is poor. (sort of similar to the result in the paper ) To improve the range-scan performance, the first idea that comes to my mind is that may be added the double link between the parent node of the leaves, and I'm going to do more research on this idea. For synchronization, we may look at this paper.

** sequential set **
artTree:    set-seq        1,000,000 ops in 102ms, 9,780,250/sec, 102 ns/op, 86.9 MB, 91 bytes/op
google:     set-seq        1,000,000 ops in 219ms, 4,557,655/sec, 219 ns/op, 54.2 MB, 56 bytes/op
tidwall:    set-seq        1,000,000 ops in 154ms, 6,483,031/sec, 154 ns/op, 38.8 MB, 40 bytes/op
tidwall(G): set-seq        1,000,000 ops in 129ms, 7,740,272/sec, 129 ns/op, 23.6 MB, 24 bytes/op
tidwall:    set-seq-hint   1,000,000 ops in 81ms, 12,298,654/sec, 81 ns/op, 38.8 MB, 40 bytes/op
tidwall(G): set-seq-hint   1,000,000 ops in 61ms, 16,473,638/sec, 60 ns/op, 23.6 MB, 24 bytes/op
tidwall:    load-seq       1,000,000 ops in 42ms, 23,674,685/sec, 42 ns/op, 38.8 MB, 40 bytes/op
tidwall(G): load-seq       1,000,000 ops in 34ms, 29,754,119/sec, 33 ns/op, 23.6 MB, 24 bytes/op
go-arr:     append         1,000,000 ops in 24ms, 40,864,488/sec, 24 ns/op, 26.5 MB, 27 bytes/op

** sequential get **
artTree:    get-seq        1,000,000 ops in 20ms, 49,690,984/sec, 20 ns/op
google:     get-seq        1,000,000 ops in 207ms, 4,831,567/sec, 206 ns/op
tidwall:    get-seq        1,000,000 ops in 151ms, 6,629,358/sec, 150 ns/op
tidwall(G): get-seq        1,000,000 ops in 117ms, 8,519,272/sec, 117 ns/op
tidwall:    get-seq-hint   1,000,000 ops in 68ms, 14,612,574/sec, 68 ns/op
tidwall(G): get-seq-hint   1,000,000 ops in 38ms, 26,476,360/sec, 37 ns/op

** random set **
artTree:    set-rand       1,000,000 ops in 146ms, 6,865,334/sec, 145 ns/op, 86.9 MB, 91 bytes/op
google:     set-rand       1,000,000 ops in 1435ms, 696,714/sec, 1435 ns/op, 44.9 MB, 47 bytes/op
tidwall:    set-rand       1,000,000 ops in 938ms, 1,066,533/sec, 937 ns/op, 44.9 MB, 47 bytes/op
tidwall(G): set-rand       1,000,000 ops in 709ms, 1,409,824/sec, 709 ns/op, 32.9 MB, 34 bytes/op
tidwall:    set-rand-hint  1,000,000 ops in 931ms, 1,073,607/sec, 931 ns/op, 44.9 MB, 47 bytes/op
tidwall(G): set-rand-hint  1,000,000 ops in 628ms, 1,592,353/sec, 628 ns/op, 32.9 MB, 34 bytes/op
tidwall:    set-after-copy 1,000,000 ops in 1015ms, 984,766/sec, 1015 ns/op, 344 bytes, 0 bytes/op
tidwall(G): set-after-copy 1,000,000 ops in 596ms, 1,678,383/sec, 595 ns/op, 344 bytes, 0 bytes/op
tidwall:    load-rand      1,000,000 ops in 907ms, 1,102,022/sec, 907 ns/op, 44.9 MB, 47 bytes/op
tidwall(G): load-rand      1,000,000 ops in 582ms, 1,718,565/sec, 581 ns/op, 32.9 MB, 34 bytes/op

** random get **
artTree:    get-rand       1,000,000 ops in 266ms, 3,763,685/sec, 265 ns/op
google:     get-rand       1,000,000 ops in 1908ms, 524,022/sec, 1908 ns/op
tidwall:    get-rand       1,000,000 ops in 1145ms, 873,394/sec, 1144 ns/op
tidwall(G): get-rand       1,000,000 ops in 599ms, 1,668,960/sec, 599 ns/op
tidwall:    get-rand-hint  1,000,000 ops in 1416ms, 706,306/sec, 1415 ns/op
tidwall(G): get-rand-hint  1,000,000 ops in 719ms, 1,391,549/sec, 718 ns/op

** range **
artTree:    traverse      1,000,000 ops in 11ms, 88,034,949/sec, 11 ns/op
artTree:    iter          1,000,000 ops in 30ms, 33,882,559/sec, 29 ns/op
google:     ascend        1,000,000 ops in 6ms, 180,193,719/sec, 5 ns/op
tidwall:    ascend        1,000,000 ops in 5ms, 208,342,361/sec, 4 ns/op
tidwall(G): iter          1,000,000 ops in 6ms, 153,921,148/sec, 6 ns/op
tidwall(G): scan          1,000,000 ops in 4ms, 222,692,350/sec, 4 ns/op
tidwall(G): walk          1,000,000 ops in 2ms, 401,949,454/sec, 2 ns/op
go-arr:     for-loop      1,000,000 ops in 2ms, 614,266,461/sec, 1 ns/op

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Radix Tree: high-performance memtable #273

Adaptive Radix Tree: high-performance memtable #273

Little-Wallace commented Apr 4, 2022 •

edited

Loading

Little-Wallace commented Apr 5, 2022 •

edited

Loading

WenyXu commented May 23, 2022

Adaptive Radix Tree: high-performance memtable #273

Are you sure you want to change the base?

Adaptive Radix Tree: high-performance memtable #273

Conversation

Little-Wallace commented Apr 4, 2022 • edited Loading

Background

Perfomance compare

TODO

Little-Wallace commented Apr 5, 2022 • edited Loading

Memory Usage

WenyXu commented May 23, 2022

Little-Wallace commented Apr 4, 2022 •

edited

Loading

Little-Wallace commented Apr 5, 2022 •

edited

Loading