Update libnetwork to fix port binding issue #428
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This updates balena-libnetwork to a version that should fix some port binding issues that may happen after balenaEngine or device crashes. Specifically, this balena-libnetwork version cherry-picks this unmerged upstream patch (with minor changes to make it compatible with recent Moby versions).
I cannot comment on the precise details, but this patch essentially changes the order of initialization of some network-related components in order to avoid getting into a inconsistent state.
Fixes #272 (at least shall fix some of its occurrences)
Testing
Tested for regressions: Engine unit tests and integration tests passing. Tried it in a meta-balena branch; all tests passed. Also did some manual testing on a Pi 3.
Testing for effectiveness is another story. We don't have a reliable way to reproduce the issue, so I created a version of the Engine meant to crash at a point that triggers the issue. Now, I cannot tell for sure this is reproducing exactly the same case we are seeing in practice, but to me the symptoms look close enough to give a good confidence this is a step in the right direction.
I'll describe in details what I did to reproduce the issue and test the patch because this might be a good future reference should other similar issues appear (or this one re-appear).
First, based on this analysis we see that the issue happens when the Engine crashes at a more or less specific point. I tried to locate such point; not sure I found it exactly, but I found something -- and then added some code that allows us to force a crash right there:
For the test itself, I prepared two Engine versions: one containing the patch we are testing (
balena-engine-patched
), another containing the "crash code" above (balena-engine-crashable
). I copied both to the data partition of a Pi 3, so that I can symlink/usr/bin/balena-engine
to either of them as needed. And then:balena-engine-crashable
(but not forcing a crash yet!), user service (container) running, all nice and fine.ps aux | grep proxy
, check the PIDs. In my case, 2216 and 2226.touch /mnt/data/crash-the-engine.please
balena-engine-proxy
processes holding the ports. Check withlsof -nP -iTCP -sTCP:LISTEN
andps aux | grep proxy
. Notice these are new processes (PIDs 2984 and 2993 in my case) created while bringing up the service again, before the forced crash.reboot
balena-engine-proxy
processes even before we try to start the service (IIUC, they are created as the Engine initializes the network subsystem; it's basically trying to restore the pre-reboot state.)mount -o remount,rw /
,cd /usr/bin/
,ln -nfs /mnt/data/balena-engine-patched balena-engine
.reboot
So, looks like the patch helped, Q.E.D. 🙂
Side note: If we reboot again between steps 9 and 10 , the service starts successfully. In this case, we apparently don't create
balena-engine-proxy
processes before attempting to start the service. I don't know why this happens -- why does this second reboot (apparently) makes the internal state consistent again?