Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 0.18 (at least compared to 0.12) #357

Closed
peixian opened this issue Feb 22, 2024 · 13 comments
Closed

Performance regression in 0.18 (at least compared to 0.12) #357

peixian opened this issue Feb 22, 2024 · 13 comments
Labels

Comments

@peixian
Copy link

peixian commented Feb 22, 2024

Adding this in case other people run into it. I'm still trying to figure out what the root cause is, but it seems like configuring the pool size affects runtime latency on the newest version.

On 0.12.0, we were running with 300 connections to each Scylla node underneath. On 0.18.1, I tried it with 2 connections per node, as well as 300 connections per node in pool_size.

0.12.0, 300 connections:

Operating System: Linux
CPU Information: Intel(R) Xeon(R) CPU @ 2.80GHz
Number of Available Cores: 4
Available memory: 7.77 GB
Elixir 1.16.1
Erlang 26
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 28 s

Benchmarking lines by game sharded 100 - 0.12.0 - 300 ...
Benchmarking lines single - 0.12.0 - 300 ...
Benchmarking lines specific - 0.12.0 - 300 ...
Benchmarking user single - 0.12.0 - 300 ...
Calculating statistics...
Formatting results...

Name                                                             ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - 0.18.1 - 300                    2.45 K        0.41 ms    ±38.85%        0.40 ms        0.70 ms
SELECT * FROM L LIMIT 1 - 0.18.1 - 300                           1.66 K        0.60 ms    ±36.74%        0.56 ms        1.34 ms
SELECT * FROM U LIMIT 1 - 0.18.1 - 300                           0.67 K        1.49 ms    ±27.70%        1.41 ms        2.43 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1 - 300        0.36 K        2.77 ms    ±28.29%        2.68 ms        4.69 ms

Comparison:
SELECT * FROM L WHERE id='123' - 0.18.1 - 300                    2.45 K
SELECT * FROM L LIMIT 1 - 0.18.1 - 300                           1.66 K - 1.48x slower +0.194 ms
SELECT * FROM U LIMIT 1 - 0.18.1 - 300                           0.67 K - 3.66x slower +1.08 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1 - 300        0.36 K - 6.78x slower +2.36 ms

0.18.1, 300 connections:

Name                                                                ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - 0.18.1 - 300                    1.18 K        0.85 ms    ±35.47%        0.80 ms        1.63 ms
SELECT * FROM L LIMIT 1 - 0.18.1 - 300                           1.01 K        0.99 ms    ±32.83%        0.95 ms        1.73 ms
SELECT * FROM U LIMIT 1 - 0.18.1 - 300                           0.53 K        1.90 ms    ±29.20%        1.80 ms        4.61 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1 - 300        0.30 K        3.35 ms    ±16.94%        3.27 ms        5.68 ms

Comparison:
SELECT * FROM L WHERE id='123' - 0.18.1 - 300                    1.18 K
SELECT * FROM L LIMIT 1 - 0.18.1 - 300                           1.01 K - 1.17x slower +0.147 ms
SELECT * FROM U LIMIT 1 - 0.18.1 - 300                           0.53 K - 2.25x slower +1.05 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1 - 300        0.30 K - 3.97x slower +2.51 ms

0.18.1, 5 connections:

Name                                                              ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - 0.18.1 - 5                    1.39 K        0.72 ms    ±47.16%        0.68 ms        1.34 ms
SELECT * FROM L LIMIT 1 - 0.18.1 - 5                           1.17 K        0.85 ms    ±43.76%        0.81 ms        1.41 ms
SELECT * FROM U LIMIT 1 - 0.18.1 - 5                           0.62 K        1.60 ms    ±29.08%        1.54 ms        2.42 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1 - 5        0.35 K        2.82 ms    ±16.92%        2.79 ms        4.01 ms

Comparison:
SELECT * FROM L WHERE id='123' - 0.18.1 - 5                    1.39 K
SELECT * FROM L LIMIT 1 - 0.18.1 - 5                           1.17 K - 1.19x slower +0.135 ms
SELECT * FROM U LIMIT 1 - 0.18.1 - 5                           0.62 K - 2.22x slower +0.88 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1 - 5        0.35 K - 3.92x slower +2.10 ms

I also reordered and randomly ran the benchmark multiple times to make sure it wasn't some cache issue.

Configuration:

[
  nodes: [<NODES>],
  autodiscovery: false,
  pool_size: 300, # or 2 in the case of tests.
]
@peixian peixian changed the title 0.18.1 is slower than 0.12.0 0.18.1 appears to be slower than 0.12.0 Feb 22, 2024
@peixian
Copy link
Author

peixian commented Feb 23, 2024

Looking into this a little bit more, I originally thought it was https://github.com/whatyouhide/xandra/blob/main/lib/xandra/cluster/connection_pool.ex#L14-L20, which would explain the query time scaling with pool size, but it seems to not be the case. Even setting the pool size to 1 results in ~.7ms for the simple select case.

I pulled some perf graphs (right click and download and open in a new table for zooming in, the svg is interactive), but I suspect something else is happening.

Xandra 0.12.0:
xandra-0 12 0

Xandra 0.18.1:
xandra-0 18 1

@whatyouhide whatyouhide changed the title 0.18.1 appears to be slower than 0.12.0 Performance regression in 0.18 (at least compared to 0.12) Feb 26, 2024
@whatyouhide
Copy link
Owner

Ah, this is super interesting, thank you @peixian! Fantastic report and investigation. 💟

I’m no expert in flame graphs, but from what I can tell a lot of the time in 0.18 is spent just taking the next (random) ID out of the set of free stream IDs. Basically, this code:

    {stream_id, data} =
      get_and_update_in(data.free_stream_ids, fn ids ->
        id = Enum.at(ids, 0)
        {id, MapSet.delete(ids, id)}
      end)

I don't think we can get this to go much faster with the current data structures (a set of free stream IDs), but we could definitely store the IDs as a list and just take the head of the list every time, which should be significantly cheaper to do. Do you want to give that a try and measure perf for that?

@harunzengin
Copy link
Contributor

harunzengin commented Feb 26, 2024

@whatyouhide @peixian , I also noticed the performance problem while fetching random ids. I solved it using Enum.random/1 on a range in 3cedab5, which should run in constant time. It is part of #355

    random_id = Enum.random(1..@max_cassandra_stream_id)

    if MapSet.member?(timed_out_ids, random_id) or Map.has_key?(in_flight_requests, random_id) do
      random_free_stream_id(in_flight_requests, timed_out_ids)
    else
      random_id
    end

This is what I'm refering to: https://elixirforum.com/t/complexity-of-enum-functions/10493/3

@whatyouhide
Copy link
Owner

Getting random IDs until you get a free one works for now but it's going to perform worse the less free IDs you have. Later on, let's go with the list, but let's do one thing at a time to avoid cluttering too much 🙃

@peixian
Copy link
Author

peixian commented Feb 29, 2024

@harunzengin I just benchmarked your fix, we're a lot closer now.

0.12.0 - 300 connections. First chunk is non-prepared statements, second chunk is prepared statements.

Name                                                                ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - 0.12.0 - 300                    2.33 K        0.43 ms    ±65.27%        0.39 ms        1.22 ms
SELECT * FROM L LIMIT 1 - 0.12.0 - 300                           1.37 K        0.73 ms    ±72.25%        0.60 ms        3.32 ms
SELECT * FROM U LIMIT 1 - 0.12.0 - 300                           0.70 K        1.42 ms    ±57.77%        1.25 ms        6.53 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.12.0 - 300        0.41 K        2.47 ms    ±50.71%        2.33 ms        7.24 ms

Name                                                                           ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - prepared - 0.12.0 - 300                    2.74 K        0.37 ms    ±69.25%        0.33 ms        1.55 ms
SELECT * FROM L LIMIT 1 - prepared - 0.12.0 - 300                           1.88 K        0.53 ms    ±60.38%        0.46 ms        1.71 ms
SELECT * FROM U LIMIT 1 - prepared - 0.12.0 - 300                           0.77 K        1.29 ms    ±62.66%        1.12 ms        4.91 ms
SELECT * FROM materialized_view LIMIT 1000 - prepared - 0.12.0 - 300        0.38 K        2.63 ms    ±70.90%        2.26 ms       14.13 ms

0.18.1 - 300 connections

Name                                                                      ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - 0.18.1-patch - 300                   1318.97        0.76 ms    ±68.82%        0.68 ms        2.46 ms
SELECT * FROM U LIMIT 1 - 0.18.1-patch - 300                           641.62        1.56 ms    ±47.71%        1.43 ms        4.18 ms
SELECT * FROM L LIMIT 1 - 0.18.1-patch - 300                           572.80        1.75 ms    ±95.20%        1.18 ms        9.39 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1-patch - 300        338.01        2.96 ms    ±50.44%        2.72 ms        9.02 ms

Name                                                                                 ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - prepared - 0.18.1-patch - 300                    1.56 K        0.64 ms    ±38.78%        0.61 ms        1.44 ms
SELECT * FROM L LIMIT 1 - prepared - 0.18.1-patch - 300                           1.19 K        0.84 ms    ±40.93%        0.78 ms        2.59 ms
SELECT * FROM U LIMIT 1 - prepared - 0.18.1-patch - 300                           0.67 K        1.49 ms    ±44.91%        1.36 ms        3.78 ms
SELECT * FROM materialized_view LIMIT 1000 - prepared - 0.18.1-patch - 300        0.26 K        3.88 ms   ±118.73%        2.72 ms       31.30 ms

0.18.1 - 5 connections

Name                                                                    ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - 0.18.1-patch - 5                    1.70 K        0.59 ms    ±43.41%        0.55 ms        1.65 ms
SELECT * FROM L LIMIT 1 - 0.18.1-patch - 5                           1.37 K        0.73 ms    ±89.78%        0.59 ms        4.11 ms
SELECT * FROM U LIMIT 1 - 0.18.1-patch - 5                           0.64 K        1.57 ms    ±48.50%        1.41 ms        5.41 ms
SELECT * FROM materialized_view LIMIT 1000 - 0.18.1-patch - 5        0.33 K        3.04 ms    ±81.88%        2.55 ms       21.62 ms

Name                                                                               ips        average  deviation         median         99th %
SELECT * FROM L WHERE id='123' - prepared - 0.18.1-patch - 5                    1.74 K        0.57 ms    ±45.96%        0.52 ms        1.98 ms
SELECT * FROM L LIMIT 1 - prepared - 0.18.1-patch - 5                           1.52 K        0.66 ms    ±68.98%        0.57 ms        3.38 ms
SELECT * FROM U LIMIT 1 - prepared - 0.18.1-patch - 5                           0.52 K        1.92 ms    ±83.37%        1.45 ms       10.02 ms
SELECT * FROM materialized_view LIMIT 1000 - prepared - 0.18.1-patch - 5        0.30 K        3.34 ms    ±83.52%        2.64 ms       20.77 ms

However, it seems like large pool sizes still incur a pretty significant slowdown, with also significantly lower throughput (1.7k rps drops to 1.3k when pool sizes go from 5 -> 300). I'm testing some fixes in connection selection, I suspect it has to do with something there.

@whatyouhide
Copy link
Owner

@peixian I don't know how you're testing this, but one important thing to keep in mind is this: the change in 0.18 was done to dramatically increase concurrent performance on a single connection.

Before, one connection could only perform requests serially (but yes, with higher throughput since it was cheaper to make the single request).

Now, a single connection can make multiple concurrent requests. So, an ideal comparison would be something like this:

  • On 0.12, open say 50 connections and spawn 50 tasks that make 10k requests each.
  • On 0.18, do the same (50 connections) and see the difference.
  • On 0.18, do the same but with say 10 open connections, to see if the reduction in conns can match the throughput of the more conns in 0.12.

@peixian
Copy link
Author

peixian commented Mar 3, 2024

@whatyouhide Ah yeah moving to multiplexed connections is the primary reason we're upgrading to 0.18.1. We're hitting cluster limits for connections on Scylla.

We won't need 300 connections, but I suspect we'll still need a significant amount (~50-100), so figuring out the scaling costs with each additional connection will likely be important for us later on.

I'm benching this with benchee, with running different selects for 5 seconds at maximum throughput. At least for us, individual performance is less important than the raw amount of thoughput a node can handle.

I can do individual query performance as well, once I figure out a bit more on the pool scaling issue. Also waiting on Scylla to get back to me on if I they have info on stream_ids not returning for #356

@whatyouhide
Copy link
Owner

I'm benching this with benchee, with running different selects for 5 seconds at maximum throughput

The question is whether you're benching this with parallel: N in Benche (N > 1). If you don't bench in parallel, then the pool size won't even matter because you'll pick a conn to run the SELECT and then return it back to the pool before the next select.

@whatyouhide
Copy link
Owner

@peixian ping?

@peixian
Copy link
Author

peixian commented Mar 28, 2024

@whatyouhide we're actively testing the latest changes now, on our production workload it seems like average query time has gone from .4ms to .6ms, which matches to what the benchmarks had. I think this is a slight regression, but the latest changes seem promising. I'm going to test with dialing the max concurrent requests per connection down and seeing if it changes later this week.

@kennethito
Copy link
Contributor

I have another facet to the performance regressions in .18. Let me know if this should be a separate issue.

The performance of large result sets has significantly worsened. One of the ones I'm looking at is about 100% slower, going from 400ms to 800ms.

I believe this is because the new connection code uses active mode sockets and by default is receiving something like a single packet of data per message (byte_size says about 1400). For each tiny message, we append it into the buffer and try to parse a frame from it.

I experimented a bit and while I don't know if this is the right way to solve this, if I set the buffer option on the transport options to some large number, it fixes the performance degredation.

For instance

Transport.setopts(data.transport, active: :once, buffer: 50000)

Thoughts?

@peixian
Copy link
Author

peixian commented Apr 22, 2024

I'm closing this, in our experiments the performance regression has disappeared with a combination of @harunzengin's and @kennethito's changes.

@peixian peixian closed this as completed Apr 22, 2024
@whatyouhide
Copy link
Owner

That's metal news 🤘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants