Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable QDR port 5671 to listen on both IPv4 and IPv6 #599

Merged
merged 4 commits into from
Jun 10, 2024

Conversation

chaturvedi-kna
Copy link
Contributor

@chaturvedi-kna chaturvedi-kna commented May 26, 2024

Summary

This PR updates the Qpid Dispatch Router (QDR) configuration to allow port 5671 to listen on all local addresses, both IPv4 and IPv6. Previously, the port was configured to listen only on IPv4 addresses. With this change, QDR can now support single stack IPv6 clusters as well as dual stack configurations.

Changes

  • Modified the Ansible task in component_qdr.yml to configure QDR to listen on all interfaces (both IPv4 and IPv6) on port 5671.
  • Updated the listener configuration to use host: '', which is the standard way to specify all interfaces for both IPv4 and IPv6.

Testing

  • Dual Stack OCP 4.12.6: Verified that QDR is listening on both IPv4 and IPv6 addresses on port 5671.
  • Single Stack IPv6 OCP 4.14.0: Confirmed that QDR is correctly listening on IPv6 addresses.
  • Connected OpenStack clusters to ensure proper connectivity and functionality.

Motivation and Context

This change is necessary to ensure compatibility with IPv6-only Kubernetes clusters, which are becoming more common as organizations prepare for the future of networking. Supporting both IPv4 and IPv6 ensures broader compatibility and future-proofs the service telemetry operator.

Related Issues

  • None

image

Copy link
Collaborator

@csibbitt csibbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I haven't tested it, but CI seems to be passing

@csibbitt csibbitt requested a review from vkmc May 27, 2024 16:28
@chaturvedi-kna
Copy link
Contributor Author

Does the CI have IPV6 only env test case??

@chaturvedi-kna
Copy link
Contributor Author

It looks like the change I made has impact on SG's for IPV6 only clusters.
I am getting below error:
cookie.go:209: net/http: invalid cookie.domain "fd01:0:0:1::896"; dropping domain attribute

Can someone provide insights on which component may be causing this problem, I am guessing that somewhere in SG also it is configured to connect for IPV4 only of AMQP

@csibbitt
Copy link
Collaborator

Does the CI have IPV6 only env test case??

No

@csibbitt
Copy link
Collaborator

It looks like the change I made has impact on SG's for IPV6 only clusters. I am getting below error: cookie.go:209: net/http: invalid cookie.domain "fd01:0:0:1::896"; dropping domain attribute

Can someone provide insights on which component may be causing this problem, I am guessing that somewhere in SG also it is configured to connect for IPV4 only of AMQP

The SG connects to AMQP from the sg-bridge component, which needs a small fix here: infrawatch/sg-bridge#37 but this seems like a different problem.

Since it says cookie.go, it must be from sg-core (it's written in golang, but the bridge is in C). I'm guessing it's probably in the prometheus exporter code, but I can't seem to put my finger on it yet. Can you provide any additional logging context or stack trace for that error?

@chaturvedi-kna
Copy link
Contributor Author

The SG connects to AMQP from the sg-bridge component, which needs a small fix here: infrawatch/sg-bridge#37 but this seems like a different problem.

Since it says cookie.go, it must be from sg-core (it's written in golang, but the bridge is in C). I'm guessing it's probably in the prometheus exporter code, but I can't seem to put my finger on it yet. Can you provide any additional logging context or stack trace for that error?

Logging context:
I got the following logs from smart gateway pods created.
Here the IPV6 that is coming up is IP assigned to particular pod (self ip)

2024/05/24 13:42:57 provider.go:129: Defaulting client-id to system:serviceaccount:service-telemetry:smart-gateway
2024/05/24 13:42:57 provider.go:134: Defaulting client-secret to service account token /var/run/secrets/kubernetes.io/serviceaccount/token
2024/05/24 13:42:57 provider.go:358: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates.
2024/05/24 13:42:57 oauthproxy.go:210: mapping path "/" => upstream "http://localhost:8081/"
2024/05/24 13:42:57 oauthproxy.go:237: OAuthProxy configured for Client ID: system:serviceaccount:service-telemetry:smart-gateway
2024/05/24 13:42:57 oauthproxy.go:247: Cookie settings: name:_oauth_proxy secure(https):true httponly:true expiry:168h0m0s domain: samesite: refresh:disabled
2024/05/24 13:42:57 http.go:64: HTTP: listening on 127.0.0.1:4180
I0524 13:42:57.094070 1 dynamic_serving_content.go:132] "Starting controller" name="serving::/etc/tls/private/tls.crt::/etc/tls/private/tls.key"
2024/05/24 13:42:57 http.go:110: HTTPS: listening on [::]:8083
2024/05/28 11:51:31 cookie.go:209: net/http: invalid Cookie.Domain "fd01:0:0:3::883"; dropping domain attribute
2024/05/28 11:52:01 cookie.go:209: net/http: invalid Cookie.Domain "fd01:0:0:3::883"; dropping domain attribute
2024/05/28 11:52:31 cookie.go:209: net/http: invalid Cookie.Domain "fd01:0:0:3::883"; dropping domain attribute

Copy link
Collaborator

@vkmc vkmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for your contribution. As already discussed, further changes are required in other STF components for this to work, and we don't have a CI w/IPv6.

@csibbitt
Copy link
Collaborator

Logging context: I got the following logs from smart gateway pods created. Here the IPV6 that is coming up is IP assigned to particular pod (self ip)
...
2024/05/24 13:42:57 oauthproxy.go:247: Cookie settings: name:_oauth_proxy secure(https):true httponly:true expiry:168h0m0s domain: samesite: refresh:disabled
2024/05/24 13:42:57 http.go:64: HTTP: listening on 127.0.0.1:4180
I0524 13:42:57.094070 1 dynamic_serving_content.go:132] "Starting controller" name="serving::/etc/tls/private/tls.crt::/etc/tls/private/tls.key"
2024/05/24 13:42:57 http.go:110: HTTPS: listening on [::]:8083
2024/05/28 11:51:31 cookie.go:209: net/http: invalid Cookie.Domain "fd01:0:0:3::883"; dropping domain attribute

Thanks, this makes more sense, now! The error is coming from the oauth-proxy container, not from the SG itself. It looks like this may be a problem in golang's cookie handling, though the ultimate impact is unclear without some additional testing. Here is another report of non-ipv6 compatible behavior in golang's cookie.go golang/go#65521

@@ -227,7 +227,7 @@
prefix: ceilometer
edgeListeners:
- expose: true
host: 0.0.0.0
host: ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note:

While documentation [1] says that empty string makes QDR listen on all local interfaces and [::] makes QDR listen on all IPv6 addresses, my manual testing with those two values was not successful:

case 1:

2024-05-29 19:04:10.899935 +0000 CONN_MGR (info) Configured Listener: '':5999 proto=any, role=normal
2024-05-29 19:04:10.909703 +0000 SERVER (notice) Operational, 4 Threads Running (process ID 1)
2024-05-29 19:04:10.909981 +0000 SERVER (notice) Process VmSize 114.67 MiB (125.40 GiB available memory)
2024-05-29 19:04:10.910211 +0000 SERVER (error) Listener error on '':5999: Name or service not known - listen on '':5999 (proton:io)
2024-05-29 19:04:10.910241 +0000 SERVER (critical) Shutting down, required listener failed '':5999

case2:

2024-05-29 19:11:11.596032 +0000 CONN_MGR (info) Configured Listener: [::]:5999 proto=any, role=normal
2024-05-29 19:11:11.601813 +0000 SERVER (notice) Operational, 4 Threads Running (process ID 1)
2024-05-29 19:11:11.601937 +0000 SERVER (notice) Process VmSize 114.67 MiB (125.40 GiB available memory)
2024-05-29 19:11:11.602230 +0000 SERVER (error) Listener error on [::]:5999: Name or service not known - listen on [::]:5999 (proton:io)
2024-05-29 19:11:11.602280 +0000 SERVER (critical) Shutting down, required listener failed [::]:5999

The only possible way to force QDR to listen on IPv6 address was to use the address explicitly.

[1] https://qpid.apache.org/releases/qpid-dispatch-1.17.0/man/qdrouterd.conf.html#_listener

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not official since my local build environment was giving me trouble today, but:

  • I edited my Interconnect object to say host: '' like in this patch
  • OCP returns it to me with empty double-quotes instead
$ oc get Interconnect -o yaml | grep host
      host: ""
  • The operator renders the config file with no host field at all
$ oc rsh pod/default-interconnect-574d468757-wshtn grep -C 4 "port: 5671" /opt/interconnect/etc/qdrouterd.conf
}

listener {
   role: edge
   port: 5671
   saslMechanisms: PLAIN
   authenticatePeer: true
   sslProfile: openstack
}

  • qdr is listening on both protocols
$ oc rsh pod/default-interconnect-574d468757-wshtn netstat -anp | grep :5671
tcp        0      0 0.0.0.0:5671            0.0.0.0:*               LISTEN      1/qdrouterd         
tcp6       0      0 :::5671                 :::*                    LISTEN      1/qdrouterd 

Note that this was done on top of upstream builds and by manually editing the Interconnect object, not from actually running this PR code. I'll need to figure out what's wrong with my cluster; I think I'm just using it for too many things at once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried this with no host field in the Interconnect object at all, and the results were the same. Maybe that's the preferred way to write it to prevent any future doubts about the problems Martin has shown.

@@ -227,7 +227,7 @@
prefix: ceilometer
edgeListeners:
- expose: true
host: 0.0.0.0
host: ''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
host: ''

Copy link

This change depends on a change that failed to merge.

Change infrawatch/sg-bridge#37 is needed.

@mrunge
Copy link
Member

mrunge commented May 31, 2024

Can we please get an insight why infrawatch/sg-bridge#37 does not pass in CI, as these two both are needed for IPv6 to work (plus potentially more?)

@elfiesmelfie
Copy link
Collaborator

I am updating this PR to synch with master and include #603, so that it doesn't get stuck in the gate.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/3774ffdb897742a4943a19c165568454

✔️ stf-crc-ocp_412-local_build SUCCESS in 37m 33s
✔️ stf-crc-ocp_414-local_build SUCCESS in 41m 33s
stf-crc-ocp_412-local_build-index_deploy RETRY_LIMIT in 20m 23s
✔️ stf-crc-ocp_414-local_build-index_deploy SUCCESS in 44m 11s

@chaturvedi-kna
Copy link
Contributor Author

chaturvedi-kna commented May 31, 2024

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/531a6b4354344aaf9c2972c45fb94388

stf-crc-ocp_412-local_build FAILURE in 24m 22s
✔️ stf-crc-ocp_414-local_build SUCCESS in 44m 17s
stf-crc-ocp_412-local_build-index_deploy RETRY_LIMIT in 21m 01s
✔️ stf-crc-ocp_414-local_build-index_deploy SUCCESS in 51m 55s

@vkmc
Copy link
Collaborator

vkmc commented Jun 3, 2024

recheck

@csibbitt csibbitt enabled auto-merge (squash) June 5, 2024 21:05
@csibbitt
Copy link
Collaborator

csibbitt commented Jun 5, 2024

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/1fcf852d1d8a4bbf89d3917a4a1d8eb6

stf-crc-ocp_412-local_build FAILURE in 28m 43s
stf-crc-ocp_414-local_build RETRY_LIMIT in 9m 07s
stf-crc-ocp_412-local_build-index_deploy FAILURE in 29m 03s
stf-crc-ocp_414-local_build-index_deploy FAILURE in 27m 25s

@elfiesmelfie
Copy link
Collaborator

The zuul jobs are blocked by this change: infrawatch/sg-core#135, which will shortly be resolved

@csibbitt csibbitt disabled auto-merge June 10, 2024 17:03
@csibbitt csibbitt merged commit b46f215 into infrawatch:master Jun 10, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants