Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IPv6 support in Xandra.Cluster #328

Merged

Conversation

jacquelineIO
Copy link
Contributor

@jacquelineIO jacquelineIO commented Aug 31, 2023

  1. Fix node validation errors
  2. use rpc_address when available

Xandra will now use tuples for IP addresses instead of strings, which are valid Erlang types. And the tuples are able to run through node validations.

History of why we started working on these changes:

Started this work on 0.16 of Xandra with this PR. Even with these changes we still encountered problems with deploying and Karl opened up an issue with Xandra.

New changes were released for Xandra and reapplying the old PR changes to v0.17 of Xandra.

We are using this lib and the host structs are returning ipv6 addresses for our clusters. The validate_node function is failing on ipv6 addresses

"The pool supervisor will run the validations again on the ipv6 values returned in https://github.com/lexhide/xandra/blob/v0.16.0/lib/xandra/cluster.ex#L793 and in the logs we see that our hosts are returned with both ipv4 and ipv6 addresses. The ipv6 addresses are being used and fail to validate." - @britto

invalid value for list element at position 0: invalid node: \"xxxx:xxxx:xxxx:xxxx:0:xxx:xxxx:xxxx:xxxxx\"", value: ["xxxx:xxxx:xxxx:xxxx:0:xxx:xxxx:xxxx:xxxxx"]}, [{NimbleOptions, :validate!, 2, [file: 'lib/nimble_options.ex', line: 343]}
     (xandra 0.16.0) lib/xandra/cluster.ex:750: Xandra.Cluster.start_pool/2
     (xandra 0.16.0) lib/xandra/cluster.ex:798: anonymous fn/4 in Xandra.Cluster.maybe_start_pools/1
     (elixir 1.12.3) lib/enum.ex:4286: Enumerable.List.reduce/3
     (elixir 1.12.3) lib/enum.ex:2431: Enum.reduce_while/3
     (xandra 0.16.0) lib/xandra/cluster.ex:726: Xandra.Cluster.handle_info/2
     (stdlib 3.17.2) gen_server.erl:695: :gen_server.try_dispatch/4
     (stdlib 3.17.2) gen_server.erl:771: :gen_server.handle_msg/6

These changes are tagged v0.17.x-community.1.

@relistan
Copy link
Contributor

relistan commented Sep 1, 2023

We don't just need to validate the IPv6 addresses. As shown here, we need to actually be able to use the RPC addresses because the addresses shown in peer are not routable.

@@ -345,6 +345,15 @@ defmodule Xandra.Cluster do
*Available since v0.18.0*.
"""
],
use_rpc_address_for_peer_address: [
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases when we'd want to use peer instead of rpc_address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peer was the original implementation. I don't know why you would use peer instead of rpc_address and not much information when I google this, so going to phone a friend. @relistan do you know or think this should just solely use rpc_address and no longer use peer? (I think that is what @whatyouhide is asking).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I’m saying. @jacquelineIO I also recommend doing just what any other C* driver maintained by Datastax is doing. I always reference the Python driver. I don't have much time to look into this until maybe Sunday, so if you want you can just take a look there and use what they're using 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is using rpc_address with a fallback to peer if rpc_address isn't available

So I can remove this flag and have a preference for using rpc_address, which the code is basically already set up to do

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do that so that we don't have to put this decision on users 🙃 Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, catching up here. That sounds right to me, then.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquelineIO any news on this and timeline on when I can expect these changes to go in? I'm planning the 0.18.0 release, that's why I ask 🙃

queried_peer_to_host(peer_attrs, use_rpc_address)
end

defp queried_peer_to_host(%{"rpc_address" => _} = peer_attrs, true = use_rpc_address) do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it that we get rpc_address as a string? Xandra should return :inet.ip_address/0 for IPs, with a 4-element tuple for IPv4 and 8-element tuple for IPv6.

Copy link
Contributor Author

@jacquelineIO jacquelineIO Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the peers table is queried, it was using the string.

For example this was one of the logs (when address is ipv6 this was another issue because the validation for ipv6 strings fails in xandra)

[] 21:35:58.910 [info] Xandra: [:connected]: %{address: '10.4.0.12', connection_name: nil, port: 11172, protocol_module: Xandra.Protocol.V4, supported_options: %{"COMPRESSION" => ["snappy", "lz4"], "CQL_VERSION" => ["3.4.5"], "PROTOCOL_VERSIONS" => ["3/v3", "4/v4", "5/v5", "6/v6-beta"]}}

Now the log looks like

[] 16:52:41.083 [info] Xandra Cluster: [:change_event]: %{cluster_name: {:local, FastSummariesConsumer.Store}, cluster_pid: #PID<0.5488.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 12},

There is still another place where strings are being used for the original host for the control connection if I'm not mistaken but it broke a lot of tests trying to change that out. That might be good for a different PR to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other location where string is in use https://github.com/Shimmur/xandra/blob/c563d737b99bcdec195a38408db490fec046a5e8/lib/xandra/cluster/pool.ex#L469 and I think might be good to convert this to a tuple as well.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the peers table is queried, it was using the string.

I’m confused. If you look at the schema for the system.peers table in C* itself, rpc_address is of type inet, so how can the result of the query to that table include rpc_address as a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also adding the validate_node options. There was only one previously and it took a string https://github.com/lexhide/xandra/blob/main/lib/xandra/options_validators.ex#L51
if it were not a string, it would have failed...

Copy link
Contributor

@relistan relistan Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about in this particular case, but we were seeing IPv6 addresses getting returned as strings. This was then causing DNS lookups which failed. @jacquelineIO can confirm if this is the reason or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whatyouhide you are right about the query returning the inet format.
It looks like there are some back n forth translations happening, and maybe it shouldn't be.
https://github.com/lexhide/xandra/blob/96303f07bae820ff693d85f2e1585cb343118e00/lib/xandra/cluster/pool.ex#L464-L466 part of the code is formatting the :node as a string.
e.g.

host = %Xandra.Cluster.Host{address: {10, 4, 0, 12}, port: 1234}

Xandra.Cluster.Host.format_address(host)
"10.4.0.12:1234"

so conn_options has a string and the id is a tuple. https://github.com/lexhide/xandra/blob/96303f07bae820ff693d85f2e1585cb343118e00/lib/xandra/cluster/pool.ex#L469-L471

When validate_node is run on the node address it is fine because it was a string.

So now I'm confused too. I do not completely follow the flow here...

I assumed this was pulling a string and wanted to make sure it was only using a tuple. BUT then also validate node started failing (if my memory is correct) so I had to update that too. So there might be multiple (at least one more) places in the code that a connection is happening, the first one beingstart_pool...

I got the desired result but the original connection is still a string. I this it makes sense to not use format_address for code (maybe only for logging things).

(I also feel like I'm talking in circles)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquelineIO okay yeah this code I have in Xandra is a little bit messy. I'll do another pass and fix this up later on then, so that we can standardize on having a single way of representing addresses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whatyouhide that went in already 8340b4d

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool awesome! Will review this soon then

@whatyouhide whatyouhide changed the title ipv6 validation and ipv4 usage on v0.17 Fix IPv6 support in Xandra.Cluster Sep 2, 2023
@relistan
Copy link
Contributor

relistan commented Sep 6, 2023

@whatyouhide those test failures seem to have occurred with ScyllaDB. I am suspicious about that. Is there a way to get it to re-run these without committing?

@whatyouhide
Copy link
Owner

@relistan yep I can re-run the CI suite any time, if you can't (top-right on GH Actions) because you don't have commit access to the repo then just ping me and I'll do it for ya 🙃

@relistan
Copy link
Contributor

relistan commented Sep 7, 2023

I can't trigger a rebuild. But it looks like @jacquelineIO needs to run mix format so that will trigger a build when she pushes.

@jacquelineIO
Copy link
Contributor Author

So these test runs feel sporadic, I was running the formatter and it was down to 1 build failure (not counting the linter) now it's up to 4. And previously I had none before taking out the option to use rpc_address. So I'm not sure what to do here.

@whatyouhide
Copy link
Owner

We have some flaky tests (check the issue tracker), so I would suggest to not worry about tests too much for now, let's get the changes in and when we're ready I can take a look at the failures and merge even if they're there, no worries 🙃

@mattias01
Copy link
Contributor

The change to use rpc_address instead of peer is a fix we need to be able to use a newer Xandra version, which forces the use of auto discovery. I've created a fork where I change just that, but I saw this PR so I will not create a PR for it myself.
Thanks for your work.

@whatyouhide whatyouhide merged commit e04ff97 into whatyouhide:main Sep 12, 2023
2 of 6 checks passed
@whatyouhide
Copy link
Owner

Thanks @jacquelineIO! 💟

@britto britto deleted the jacqueline/ipv6-handling-changes-on-v17 branch December 29, 2023 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants