-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34
base: wip-bluestore-kv-finisher
Are you sure you want to change the base?
os/bluestore: [RFC] merge multiple TransContext into a batch to improve performance #34
Conversation
The kv_sync_thread is a bottleneck; making it do less work improves performance on fast devices. Signed-off-by: Jianpeng Ma <[email protected]> Signed-off-by: Igor Fedotov <[email protected]> Signed-off-by: Sage Weil <[email protected]>
No reason to push them onto *another* Finisher thread. Signed-off-by: Sage Weil <[email protected]>
This means that the completion thread could in theory get backed up. However, it resolves hard-to-hit deadlock cases where the completion is blocked by pg->lock and the lock holder is blocked on a throttle. Signed-off-by: Sage Weil <[email protected]>
…cific dir This hack only works for 1 osd, but it lets you put it on a specific directory/device. Signed-off-by: Sage Weil <[email protected]>
We aren't using these anymore. We may want to shard the finalizer thread, but that can work a bit differently. Signed-off-by: Sage Weil <[email protected]>
We aren't using these anymore. We may want to shard the finalizer thread, but that can work a bit differently. Signed-off-by: Sage Weil <[email protected]>
Signed-off-by: Igor Fedotov <[email protected]>
Signed-off-by: Igor Fedotov <[email protected]> yet another way to merge rocksdb transactions into a batch
65330a1
to
bedcbcd
Compare
This looks pretty straightforward! I think if we go down this path, though, we'd want to patch upstream rocksdb properly to implement the transaction append. And make txc0 reserve enough memory based on what we expect to see for the whole list, etc. Alternatively, rocksdb could provide a submit_transaction call that takes a vector of writebatches and does the same thing internally without the memcpy. (That would probably be more efficient, fwiw.) or, we could batch explicitly in bluestore. (Maybe this is what adam's appraoch did? I don't remember)
The downside to that approach is that prepare_transaction might be slow (esp if there is a read/modify/write) and that may block up kv sync thread. I don't think there is a way around that without doing a memcpy like your approach. Which probably means that's the way to go... |
93f760c57c Merge pull request #40 from ivancich/wip-change-client-rec-init 824d92dd3d Merge pull request #38 from ivancich/wip-improve-next-request-return 941d1bef54 Change initialization of IndIntruHeapData to C++'s value-initialization to better future-proof the code. Since at the momeent they are scalars they'll be zero-initialized (i.e., to zero). However if they ever become something more complex, their default constructors will be called. 19153d979f Merge pull request #39 from ivancich/wip-delta-rho-plugin a94c4e086c Allow the calculations of rho and delta to be handled by a "tracker" specified via template parameter (i.e., by static polymorphism). The tracker follows a simple interface constisting of three functions and one static function. 856a26c466 Clarify code surrounding the return value of do_next_request. b632cfda4f Merge pull request #37 from ivancich/wip-fix-uninit-data e6df585153 The coverity scan published in ceph-devel on 2017-09-21 revealed some uninitialized data in a constructor. This fixes that. 165a02542d Merge pull request #34 from TaewoongKim/anticipate 72e4df95cf Make anticipation_timeout configurable with config file 2f06d632d5 Add anticipation duration that keeps from resetting tag values to the current time git-subtree-dir: src/dmclock git-subtree-split: 93f760c57c75b9eb88382bcba29fcac3ce365e7f
This is somewhat hackery approach to merge multiple TransContexts into a single batch to submit them in a single step. The code relies on RocksDB WriteBatch internals to merge multiple batches into a single one. Unfortunately more elegant solution to collect operations out of RocksDB WriteBatch and then build it from scratch isn't beneficial from performance POV as we overburden kv_sync_thread. (See code at (https://github.com/ifed01/ceph/tree/wip-bluestore-large-batch-prod)
This PR is actually a question if we want to proceed this way as well as a summary on obtained numbers.
Results below are for FIO bluestore plugin for both highly concurrent 4K appends and overwrites using 2 NVME drives. (min_alloc_size=4K too). Each case was run for 3 times
Original code (kv_finisher tread PR + intrusive lists for TransContextx)
Append:
Run 1 = 370 Mb/s , run 2 = 367 Mb/s, run3 = 358 Mb/s
Overwrite:
Run 1 = 329 Mb/s , run 2 = 333 Mb/s, run3 = 329 Mb/s
This PR code on top of above-mentioned original one
Append:
Run 1 = 419 Mb/s , run 2 = 413 Mb/s, run3 = 420 Mb/s
Overwrite:
Run 1 = 338 Mb/s , run 2 = 338 Mb/s, run3 = 345 Mb/s
Failed approach to collect KV operations out of RocksDB and build a resulting batch before the submit
Append:
Run 1 = 368 Mb/s , run 2 = 375 Mb/s, run3 = 373 Mb/s
Overwrite:
Run 1 = 270 Mb/s , run 2 = 272 Mb/s, run3 = 278 Mb/s