Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Persistent Trace with RocksDB #124

Merged
merged 5 commits into from
Oct 28, 2022
Merged

Persistent Trace with RocksDB #124

merged 5 commits into from
Oct 28, 2022

Conversation

gz
Copy link
Contributor

@gz gz commented Jul 23, 2022

Opened a PR for now so I can dump information somewhere; but this isn't ready to merge. If we are fine with the current overheads I measured with the OrdZSet experiment (see below) then, my next step will be to make a persistent trace by porting this code over from the persistent OrdZSet into something that implements Trace functionality instead.

Notes on using RocksDB

  • RocksDB seek calls always start from the beginning, whereas our Cursor seek
    does not

    • The fix is just to maintain the last Key in the Cursor struct
    • Not sure if this has any downsides for the RocksDB seek performance,
      probably doesn't matter (it keeps an index to the right block)
  • Comparison of complex keys

    • By default RocksDB uses bytewise comparison (of the serialized binary data).
    • If we have complex Rust keys that won't work
    • We can set a custom comparator function in RocksDB that calls into the Rust Ord implementation
    • The downside is we need to deserialize the key every time this is called
    • We might want to measure the impact of this and if it turns out to be
      significant we can avoid it in ceqrtain well-specified but common(?) cases,
      e.g., if the key is a usize then the serialized bytes (using big-endian
      encoding) preserves lexicographic ordering with bincode
  • Gets

    • It seems that get_pinned is preferrable over get as it avoids a Vec
      allocation/deallocation on every lookup.
  • Merging:

    • One way is to take the files of one (the bigger) batch and use
      ingest_external_file for creating a new "batch", then use db.merge() to
      merge in the other (smaller) half?
    • The alternative is to just do the merge in rust and write out a new DB
    • Not sure which one is faster or better...
  • One database per DS/Trace or one DB with a ColumnFamily per DS/Trace?

    • Both can give a unique key-space
    • We can share a Cache in both cases (I think)
    • ColumFamily approach:
      • Odd API requires to name all ColumnFamilys on DB::open and fails if not done
      • There is a separate call that you need to use to get a Vec of all CFs
      • Then you supply that to DB::open and hope no CF got created meanwhile?
      • This seems strange: what about liveness if lots of CFs are created etc.
      • I guess the workaround is to open the DB once in the program and access
        through a singleton
      • Haven't seen anything if this feature is made for having 10s, 100s, 1000s
        of CFs?
    • Many databases approach:
      • Makes me think CFs is not what we want, maybe need one DB per DS/Trace?
      • One advantage of that is once we wrote it we can open the DB as read-only
      • Not sure if this has more overhead in the end (as opposed to ColumnFamily
        approach) for perf or mem or disk consumption
      • There are atomic operations across columnfamilies but not across different
        databases
  • ThreadingMode: there are two ways, we currently use MultiThreaded

    • SingleThreaded apparently works multi-threaded too, so it's a bit confusing,
      but...
    • SingleThreaded takes a &mut self to create a ColumnFamily whereas
      MuliThreaded takes a &self and uses the DB's internal RWLock
    • Since we have a singleton RocksDB instance in the code, I just used
      MultiThreaded
    • According to docs: MultiThreaded should be used if you want to do
      multi-threaded access to a single ColumnFamily
    • We likely don't need this, so we can investigate if there is a perf benefit
      going from MultiThreaded to SingleThreaded
    • Note aside: we found when porting LevelDB to NrOS that we got contended on
      the LevelDB rwlock, replacing it with a better variant helped to scale
      reads...

Notes on RocksDB Performance

  • Benchmarks: Just a simple benchmark, OrdZSet with u64 keys and two different
    size 4K and 16M keys
  • Run using cargo criterion (need to do cargo install cargo-criterion)
  • Results are put in target/criterion
  • Uploaded results here: https://gz.github.io/database-stream-processor/criterion-2022-07-22/reports/index.html
  • Some changes from default that I did:
    • Added a 2 GiB block cache (default is none or 8 MiB?)
    • Disabled compression
    • Made sure to use a RAM disk (e.g., tmpfs)
    • Ran into No space left on device during benchmarking, because tmpfs was
      small; increased with: sudo mount -o remount,size=12G /tmp to 12 GiB
    • Ran into some open file limits, so I set
      global_opts.set_max_open_files(9000); and ulimit -n 9000 in bash
      which resolved the problem
    • More things to explore:
  • Observations:
    • The persistent OrdZSetBuilder is 343x slower for 4k entries (2 ms vs 6us)
    • OrdZSetBuilder for 16M entries: it's 32x slower 4.2s vs. 129ms
    • Iterating through a OrdZSet with 4K entries is 125x slower; Iterating through 16M entries is 51x slower
    • Uniform random seek in 4k entries 8x slower, 16M entries
    • Uniform random seek 16M entries 29x faster in the persistent version (!)
      • Not sure why, maybe the OrdZSet seek() is O(n), RocksDB will use its index?, OrdZSet uses exponential search (log(n)), whereas RockDB will use an index to find the right block so O(1))
    • zipf random seek 4k entries 14x slower
    • zipf random seek 16M entries 5x faster (!)
      • Not sure why, maybe the OrdZSet seek() is O(n), RocksDB will use its index?, OrdZSet uses exponential search (log(n)), whereas RockDB will use an index to find the right block so O(1))

@gz gz marked this pull request as draft July 23, 2022 00:51
@Kixiron
Copy link
Contributor

Kixiron commented Jul 23, 2022

In regards to serializing and comparing data my vote is on rykv, it should have minimal overhead

@ryzhyk
Copy link
Collaborator

ryzhyk commented Jul 25, 2022

Very cool, thanks heaps!

If we are fine with the current overheads I measured with the OrdZSet experiment

I don't think we need to achieve optimal performance to merge this. Let's start with something that works and iterate. Besides, I've no idea what performance level we should strive for -- something we need to figure out.

RocksDB seek calls always start from the beginning, whereas our Cursor seek
does not

The fix is just to maintain the last Key in the Cursor struct

Alternatively, we can require that one should call seek on a Cursor with monotonically increasing keys (I think this is the case for all current uses), in which case the two semantics become equivalent.

Merging:

Assuming we switch to one RocksDB instance per trace, I'm not sure we need the merge operation on persistent traces at all (other than the merging performed by RocksDB in the background).

ThreadingMode: there are two ways, we currently use MultiThreaded

We only have single-threaded accesses now (since we shard data across multiple threads). We recently realized that non-equi-join operators may require sharing the same batch or trace across all workers, so this may change.

Made sure to use a RAM disk (e.g., tmpfs)

Fair enough, so we measure pure RocksDB + OS overheads. I wonder how much slower it will run on top of an SSD.

The persistent OrdZSetBuilder is 343x slower for 4k entries (2 ms vs 6us)
OrdZSetBuilder for 16M entries: it's 32x slower 4.2s vs. 129ms

I don't expect we will construct persistent Z-sets using builders. Most likely, batches will be constructed in memory as our normal "light" OrdZSets and will get converted to persistent representation when added to a persistent trace.

Iterating through a OrdZSet with 4K entries is 125x slower; Iterating through 16M entries is 51x slower

Likewise, these overheads probably don't matter too much, as iterating over entire persistent Z-set is not a common operation.

Uniform random seek in 4k entries 8x slower, 16M entries

Seek is much more important than iteration, so it's great that this overhead is lower.

Uniform random seek 16M entries 29x faster in the persistent version (!)

Not sure why, maybe the OrdZSet seek is O(n), RocksDB will use its index?

OrdZSet uses binary search, which I guess can be expensive compared to hash-based indexing for random seeks. But in practice we seek for monotonically increasing keys, which should be less expensive using binary search (less distance to search for).

@ryzhyk
Copy link
Collaborator

ryzhyk commented Jul 25, 2022

Ideally, our persistent trace design will avoid the overheads of serialization and persistence unless we are actually running out of RAM. Small traces may end up living in memory forever (checkpointing aside).

benches/persistence.rs Outdated Show resolved Hide resolved
benches/persistence.rs Outdated Show resolved Hide resolved
@gz
Copy link
Contributor Author

gz commented Jul 25, 2022

Thanks for the comments!

Some other things that came up (aside from turning this code into a trace instead of an OrdZSet):

  • use rykv seems like a great choice
  • try to get a sense of what is the serialization overhead vs. what is the rocksdb overhead
  • do a benchmark with an actual (NVMe) disk and see if there is "an additional tax"
  • try to figure out where OrdZSet builder overhead is coming from (we discussed we do end up having this operation because it's the same as adding a batch to a trace)

@codecov
Copy link

codecov bot commented Aug 25, 2022

Codecov Report

Merging #124 (d9be91a) into main (fec12bb) will decrease coverage by 0.23%.
The diff coverage is 79.30%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #124      +/-   ##
==========================================
- Coverage   85.62%   85.39%   -0.24%     
==========================================
  Files         129      129              
  Lines       23702    24005     +303     
==========================================
+ Hits        20295    20499     +204     
- Misses       3407     3506      +99     
Impacted Files Coverage Δ
src/algebra/present.rs 0.00% <0.00%> (ø)
src/operator/communication/shard.rs 100.00% <ø> (ø)
src/operator/recursive.rs 100.00% <ø> (ø)
...me_series/radix_tree/partitioned_tree_aggregate.rs 92.57% <ø> (ø)
.../operator/time_series/radix_tree/tree_aggregate.rs 99.21% <ø> (ø)
src/operator/trace.rs 84.15% <ø> (ø)
src/operator/upsert.rs 95.00% <ø> (ø)
src/trace/cursor/mod.rs 42.85% <0.00%> (-17.15%) ⬇️
src/trace/mod.rs 46.50% <ø> (ø)
src/nexmark/queries/q14.rs 73.17% <16.00%> (-25.17%) ⬇️
... and 18 more

@gz gz force-pushed the rocksdb2 branch 4 times, most recently from b689142 to a78cc7f Compare September 24, 2022 05:40
@gz
Copy link
Contributor Author

gz commented Sep 27, 2022

I opened PRs at

and

which solves some issues with serialization that we need for supporting the nexmark benchmarks.

@gz gz force-pushed the rocksdb2 branch 4 times, most recently from cb8c28d to 1ac37bc Compare October 5, 2022 07:21
rustfmt.toml Outdated Show resolved Hide resolved
@gz gz marked this pull request as ready for review October 5, 2022 07:23
@gz gz changed the title WIP on persistence Persistent Trace with RocksDB Oct 5, 2022
@gz
Copy link
Contributor Author

gz commented Oct 5, 2022

We now pass most tests -- a few are ignored with the persistent feature due to missing implementation of map_batches and len (but @ryzhyk wants to change that interface anyways) -- so it's probably a good time to start the process of review so we can at some point merge this initial state to master.

@ryzhyk
Copy link
Collaborator

ryzhyk commented Oct 5, 2022

This is super cool, thanks! Will start reviewing this as soon as I'm finished with my current PR (should be today).

Re merging process: I haven't seen all the code yet, but I was thinking I will start with creating a PR that will move all standard type bounds to a trait. This will simplify trait bounds everywhere and will create a single location where we need to specify encode/decode bounds. Does this make sense?

Copy link
Contributor

@Kixiron Kixiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd you use bincode instead of rkyv, rkyv is significantly faster and can even allow us to do zero-copy data manipulation instead of having to serialize and deserialize at every boundary point between rocksdb and everything else?

Cargo.toml Outdated
Comment on lines 49 to 57
rocksdb = { version = "0.18", default-features = false, features = ["multi-threaded-cf"] }
bincode = { version = "2.0.0-rc.2" }
uuid = { version = "1.1.2", features = ["v4"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be optional and activated with the the persistence feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the persistence feature flag only "swaps" the Spine struct with the PersistentTrace struct so it gets used as part of all the other dbsp code:
https://github.com/vmware/database-stream-processor/pull/124/files#diff-f8789d5760a694bd6491c0ed6d25099ee9e3dcaed42f69d2f492e779b10c44b4R21

Without the persistence feature the persistent code will still compile and run the persistent tests that check for spine equivalence (which I figured might be nice during development to catch problems early). But if we want to change that so that we don't have any persistent code compiled in without that feature flag we can make them optional.

Maybe at least I should rename the feature to describe better what it does.

@gz
Copy link
Contributor Author

gz commented Oct 5, 2022

Why'd you use bincode instead of rkyv, rkyv is significantly faster and can even allow us to do zero-copy data manipulation instead of having to serialize and deserialize at every boundary point between rocksdb and everything else?

I tried to switch to rkyv once during writing this (for like 20min but gave up because I didn't understand the docs quickly enough ;))
That being said I'm still happy to switch to it and it shouldn't be too hard but I felt like it's better to get something to work first, then benchmark and see what the performance problems are and then improve them (maybe it turns out rykv will be much better or maybe we need something else).

Copy link
Contributor

@Kixiron Kixiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial pass is that everything looks pretty good, however I don't really like how infectious the persistent trace is. There's definitely cases where we want in-memory traces and the ability to store more native rust types (e.g. &'static str) within traces and the fact that persistence precludes that regardless of whether or not it's enabled isn't great. I definitely think that a dedicated trace type and the ability to specify what kind of trace an operator will use is the way to go in terms of flexibility and modularity

rustfmt.toml Outdated Show resolved Hide resolved
src/operator/aggregate/average.rs Outdated Show resolved Hide resolved
@ryzhyk
Copy link
Collaborator

ryzhyk commented Oct 6, 2022

My initial pass is that everything looks pretty good, however I don't really like how infectious the persistent trace is. There's definitely cases where we want in-memory traces and the ability to store more native rust types (e.g. &'static str) within traces and the fact that persistence precludes that regardless of whether or not it's enabled isn't great. I definitely think that a dedicated trace type and the ability to specify what kind of trace an operator will use is the way to go in terms of flexibility and modularity

We certainly don't want to rely on a compile-time switch to enable/disable persistence. The long-term plan is to tune the performance of persistent traces, so that the cost of persistence only needs to be paid when needed (i.e., stuff doesn't fit in memory or needs to be checkpointed). Then the persistent trace will become the default one, but the programmer will still be able to use Spine wherever the choice is exposed by the API. But for now the feature flag enables us to work on persistence support while keeping it in sync with the rest of the code base, so I don't think it should prevent us from landing the PR with the understanding the persistence support is not "production-ready" yet.

@gz
Copy link
Contributor Author

gz commented Oct 6, 2022

the ability to store more native rust types (e.g. &'static str) within traces

Support for (at least) &'static str would be nice. One way -- maybe a bit hacky -- would be to walk the symbol table of the binary on deserializaton and see if something matches (and otherwise Box::leak it and store the reference in some global table for the future)...

@Kixiron
Copy link
Contributor

Kixiron commented Oct 6, 2022

I was meaning types that aren't compatible with the database in general, not just &'static str

Copy link
Collaborator

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments so far. Haven't read cursor.rs yet.

src/trace/cursor/mod.rs Outdated Show resolved Hide resolved
src/trace/cursor/mod.rs Outdated Show resolved Hide resolved
@@ -132,6 +132,24 @@ pub trait CursorDebug<'s, K: Clone, V: Clone, T: Clone, R: Clone>: Cursor<'s, K,
}
out
}

fn val_to_vec(&mut self) -> Vec<(V, Vec<(T, R)>)> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these allocations make me nervous, and there is no easy way to at least reduce them using with_capacity. We try to minimize allocations throughout the code base, and malloc still shows up a lot in profiling, so this is going to introduce significant overhead unless of course it's dominated by other rocksbd costs :)

I don't have the big picture yet, and maybe this is ok for getting something working initially, but still want to bookmark this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can optimize this by reusing the allocations/vectors from the batch itself which we get on insert.

src/trace/cursor/mod.rs Outdated Show resolved Hide resolved
src/trace/layers/ordered.rs Outdated Show resolved Hide resolved
src/trace/persistent/trace.rs Show resolved Hide resolved
let mut found_v = false;
let mut found_t = false;
for (existing_v, ref mut existing_tw) in vals.iter_mut() {
if existing_v == &v {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use trace::layers::advance() to lookup the value?

for (existing_v, ref mut existing_tw) in vals.iter_mut() {
if existing_v == &v {
for (t, w) in &tws {
for (existing_t, ref mut existing_w) in existing_tw.iter_mut() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this could use exponential search.

///
/// # TODO
/// Probably lots of efficiency improvements to be had here: We're sorting
/// several times when we probably can be smarter etc. -- not clear it matters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you could just replace Values with Spine, which already minimizes excessive sorting. Most of the logic in this function would then be replaced with Spine::insert and Spine::recede_to.

src/trace/persistent/cursor.rs Show resolved Hide resolved
src/trace/persistent/cursor.rs Outdated Show resolved Hide resolved
src/trace/persistent/cursor.rs Show resolved Hide resolved
src/trace/persistent/cursor.rs Show resolved Hide resolved
src/trace/persistent/cursor.rs Show resolved Hide resolved
src/trace/persistent/cursor.rs Outdated Show resolved Hide resolved
src/trace/persistent/cursor.rs Outdated Show resolved Hide resolved
@ryzhyk
Copy link
Collaborator

ryzhyk commented Oct 9, 2022

Finished reading the code. I'll work on the following TODOs to merge this:

  • simplify trait bounds (see discussion in this thread).
  • map_batches exposes the internal structure of Spine and should not be part of the public API. I only use it in one place, and it can be easily avoided.
  • recede_to isn't something we want to do on a persistent trace. I have some ideas -- let's discuss.

Did I miss anything?

The main high-level issue (which we already discussed offline) is that storing all values for a key as a blob is really expensive and non-scalable. This is in contrast to Spine that keeps subsets of values in multiple batches and combines them on the fly. I don't know if there's a way to simulate this with rocksdb.

@gz
Copy link
Contributor Author

gz commented Oct 9, 2022

Finished reading the code. I'll work on the following TODOs to merge this:

* [ ]  simplify trait bounds (see discussion in this thread).

* [ ]  `map_batches` exposes the internal structure of `Spine` and should not be part of the public API.  I only use it in one place, and it can be easily avoided.

* [ ]  `recede_to` isn't something we want to do on a persistent trace. I have some ideas -- let's discuss.

Did I miss anything?

That seems to be it yes.

The main high-level issue (which we already discussed offline) is that storing all values for a key as a blob is really expensive and non-scalable. This is in contrast to Spine that keeps subsets of values in multiple batches and combines them on the fly. I don't know if there's a way to simulate this with rocksdb.

Yes, I think it's good to separate performance from design/correctness, e.g., let's make the first PR about having some design/impl that works (but not optimized) and we make sure it's tested as best as we can. Then we can improve performance in upcoming PRs?

@ryzhyk
Copy link
Collaborator

ryzhyk commented Oct 9, 2022

Yes, I think it's good to separate performance from design/correctness, e.g., let's make the first PR about having some design/impl that works (but not optimized) and we make sure it's tested as best as we can. Then we can improve performance in upcoming PRs?

Definitely.

@ryzhyk ryzhyk mentioned this pull request Oct 10, 2022
@gz
Copy link
Contributor Author

gz commented Oct 11, 2022

here's some numbers from running nexmark :) (note that q0-2, q14 and q22 don't use nexmark Spine/PersistentTrace so they're the same code):

query # dram tput K/s rocksdb tput K/s as percentage of dram tput
0 654.227 656.815 100.40%
1 627.916 621.968 99.05%
2 672.933 688.594 102.33%
3 660.12 10.573 1.60%
4 544.041 2.679 0.49%
5 676.982 2.626 0.39%
6 525.654 2.786 0.53%
7 576.421 4.035 0.70%
8 690.686 7.919 1.15%
9 221.277 4.058 1.83%
12 652.094 16.687 2.56%
13 521.136 10.541 2.02%
14 649.36 648.154 99.81%
15 480.039 0.048 0.01%
16 247.848 0.041 0.02%
17 420.035 0.263 0.06%
18 414.711 18.108 4.37%
19 349.02 17.944 5.14%
20 412.072 9.912 2.41%
21 649.994 667.311 102.66%
22 668.739 640.301 95.75%

@gz gz force-pushed the rocksdb2 branch 6 times, most recently from d06944d to 4cd35a9 Compare October 12, 2022 22:27
@ryzhyk ryzhyk mentioned this pull request Oct 14, 2022
@gz
Copy link
Contributor Author

gz commented Oct 25, 2022

Just for reference I was looking into a memory leak when we use the rocksdb APIs: set_comparator set_compaction_filter -- they allocate some state (for callback meta-data) which supposedly leaks.
I checked the code in the rust-rocksdb crate and rocksdb itself and it looks like it's correct.

For reference of set_compaction_filter (set_comparator is similar):
there is a destructor that is supposed to get called for the allocations (some callback state) which will deallocate things again).

A few ideas of why this might be happening:

  • Most likely: a false positive, meaning rocksdb lazyness did not execute the callback yet (e.g., it would have at some point in the future) -- I looked and experimented with API calls (explicitly closing the DB, compact_range_cf etc. but didn't get the desired result of the destructor being called)
  • Maybe the C++ rocksdb code was compiled wrong (or it doesn't work in cross-language settings (C++ and Rust)) and so it didn't detect the free path -- I used ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer-14 CXX="clang++" CC="clang" CXXFLAGS="-fsanitize=address" RUST_BACKTRACE=0 ASAN_OPTIONS="detect_stack_use_after_return=1,detect_leaks=1" RUSTFLAGS="-Z sanitizer=address" cargo test --features "with-serde with-csv with-nexmark" --target x86_64-unknown-linux-gnu -Z build-std to run the tests (notice the CXXFLAGS and I checked that they do get picked up by the rocksdb compilation)
  • It's an actual leak, which would be bad...

@gz gz force-pushed the rocksdb2 branch 3 times, most recently from 7d4ba41 to b15d1d5 Compare October 25, 2022 23:09
@ryzhyk
Copy link
Collaborator

ryzhyk commented Oct 27, 2022

@Kixiron, did @gz address your concerns? If so, can you approve?

@ryzhyk ryzhyk merged commit 2e746ac into main Oct 28, 2022
@ryzhyk ryzhyk deleted the rocksdb2 branch October 28, 2022 18:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants