Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update libnetwork to fix port binding issue #428

Merged
merged 1 commit into from
Apr 24, 2023

Conversation

lmbarros
Copy link
Contributor

@lmbarros lmbarros commented Apr 20, 2023

This updates balena-libnetwork to a version that should fix some port binding issues that may happen after balenaEngine or device crashes. Specifically, this balena-libnetwork version cherry-picks this unmerged upstream patch (with minor changes to make it compatible with recent Moby versions).

I cannot comment on the precise details, but this patch essentially changes the order of initialization of some network-related components in order to avoid getting into a inconsistent state.

Fixes #272 (at least shall fix some of its occurrences)

Testing

Tested for regressions: Engine unit tests and integration tests passing. Tried it in a meta-balena branch; all tests passed. Also did some manual testing on a Pi 3.

Testing for effectiveness is another story. We don't have a reliable way to reproduce the issue, so I created a version of the Engine meant to crash at a point that triggers the issue. Now, I cannot tell for sure this is reproducing exactly the same case we are seeing in practice, but to me the symptoms look close enough to give a good confidence this is a step in the right direction.

I'll describe in details what I did to reproduce the issue and test the patch because this might be a good future reference should other similar issues appear (or this one re-appear).

First, based on this analysis we see that the issue happens when the Engine crashes at a more or less specific point. I tried to locate such point; not sure I found it exactly, but I found something -- and then added some code that allows us to force a crash right there:

diff --git a/daemon/container_operations.go b/daemon/container_operations.go
index af69e0474b..029d8176a6 100644
--- a/daemon/container_operations.go
+++ b/daemon/container_operations.go
@@ -825,6 +825,14 @@ func (daemon *Daemon) connectToNetwork(container *container.Container, idOrName
               return err
       }

+       ///////////////////
+       _, err = os.Stat("/mnt/data/crash-the-engine.please")
+       if err == nil {
+               os.Remove("/mnt/data/crash-the-engine.please")
+               panic("Geronimoooooo!")
+       }
+       //////////////////
+
       if !container.Managed {
               // add container name/alias to DNS
               if err := daemon.ActivateContainerServiceBinding(container.Name); err != nil {

For the test itself, I prepared two Engine versions: one containing the patch we are testing (balena-engine-patched), another containing the "crash code" above (balena-engine-crashable). I copied both to the data partition of a Pi 3, so that I can symlink /usr/bin/balena-engine to either of them as needed. And then:

  1. First let's reproduce the issue. Initial state: running balena-engine-crashable (but not forcing a crash yet!), user service (container) running, all nice and fine.
  2. Run ps aux | grep proxy, check the PIDs. In my case, 2216 and 2226.
  3. touch /mnt/data/crash-the-engine.please
  4. Restart the service.
  5. Engine crashes, service doesn't restart.
  6. But we get our stale balena-engine-proxy processes holding the ports. Check with lsof -nP -iTCP -sTCP:LISTEN and ps aux | grep proxy. Notice these are new processes (PIDs 2984 and 2993 in my case) created while bringing up the service again, before the forced crash.
  7. reboot
  8. We get balena-engine-proxy processes even before we try to start the service (IIUC, they are created as the Engine initializes the network subsystem; it's basically trying to restore the pre-reboot state.)
  9. As the device tries to start the service, we get the error we were looking for: "Failed to allocate and map port 80-80: Bind for 0.0.0.0:80 failed: port is already allocated". Service remains in the "Installed" state.
  10. Now let's test the patch. Redo steps 1-6.
  11. Replace the Engine with the patched version: mount -o remount,rw /, cd /usr/bin/, ln -nfs /mnt/data/balena-engine-patched balena-engine.
  12. reboot
  13. The service starts normally, no port binding issues at all!

So, looks like the patch helped, Q.E.D. 🙂

Side note: If we reboot again between steps 9 and 10 , the service starts successfully. In this case, we apparently don't create balena-engine-proxy processes before attempting to start the service. I don't know why this happens -- why does this second reboot (apparently) makes the internal state consistent again?

This new version has a patch cherry-picked from here:
moby/libnetwork#1805

This patch is meant to avoid cases in which libnetwork internal state
gets inconsistent in case of crashes.

Signed-off-by: Leandro Motta Barros <[email protected]>
Change-type: patch
@lmbarros lmbarros force-pushed the lmb/libnetwork-sandbox-cleanup branch from 0217c4f to 56aa633 Compare April 20, 2023 20:43
@lmbarros lmbarros changed the title [WIP] Update libnetwork, bind port fix Update libnetwork to fix port binding issue Apr 20, 2023
@lmbarros lmbarros marked this pull request as ready for review April 20, 2023 20:51
@lmbarros lmbarros marked this pull request as draft April 20, 2023 21:19
@lmbarros lmbarros self-assigned this Apr 20, 2023
@lmbarros lmbarros marked this pull request as ready for review April 20, 2023 22:11
@klutchell
Copy link
Contributor

Very clever test case reproduction, looks good to me!

@lmbarros lmbarros merged commit 232c9c1 into master Apr 24, 2023
@lmbarros lmbarros deleted the lmb/libnetwork-sandbox-cleanup branch April 24, 2023 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port already in use, because proxy keeps binding to the wrong container IP
3 participants