Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Replace 'screen' use with Docker. WIP #54

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

figuernd
Copy link
Contributor

@figuernd figuernd commented Dec 26, 2024

@rgov I want your opinion on this WIP refactor before I finish it.

Today we have accumulated a lot of management layers:

  • ROS launch files
  • phyto-arm Python script
  • screen within phyto-arm
  • Docker now wrapping everything
  • tmux for monitoring Docker containers has become standard in the field
  • Shell scripts for starting/stopping Docker and Tmux

If we commit to always using Docker, we can reduce this to:

  • ROS launch files
  • phyto-arm Python script
  • Docker

So remove all the tmux and screen use since Docker persists sessions anyway, use phyto-arm to launch ROS directly within containers, and simplify config by also having phyto-arm handle mounting of devices/volumes/ports from config automatically, which will eliminate all the gotchas we see in the field where startup fails because Docker-config does not match PA-config.

This also removes the need to launch arms separately - which processes get launched is all setup in the config YAML. Operators will basically only have to touch the YAML going forward.

Thoughts? Not working yet as-is, but I think there's enough there for you to get the approach.

@figuernd figuernd requested a review from rgov December 26, 2024 21:19
@rgov rgov marked this pull request as draft December 26, 2024 21:25
@rgov
Copy link
Member

rgov commented Dec 26, 2024

I think it's reasonable, with a few notes:

  1. Why is there an outer tmux in the field today? I think just because the IFCBacquire container requires the run.sh script because there's no systemd unit file?

  2. I prefer the nodes in the containers to have a fixed view of their filesystem. I.e., I'd rather the node just assumes to use /data and then to run the container we use -v $DATA_DIR:/data. I don't like mapping a variable -v $DATA_DIR:$DATA_DIR and then telling the node to use $DATA_DIR.

  3. What is the plan for running this on a development system? Are we constantly going to have to rebuild the container images while coding? Or mount the local Python code into the container?

  4. I'd like to be able to use podman or systemd-nspawn in the future. (I've completely migrated away from Docker on my Mac for at least a year now.) I don't like that Docker is a redundant service management and logging system. I'd rather logs go to journald and that we configure that appropriately.

  5. What's the transition plan for ROS 2 which doesn't use launch files but lets us write Python?

@figuernd
Copy link
Contributor Author

Thanks for the quick response-

  • Why is there an outer tmux in the field today? I think just because the IFCBacquire container requires the run.sh script because there's no systemd unit file?

I don't really know - it seemed to already be popular when I first joined the lab, and I think it's just a trusted tool for the team that wants to be able to quickly pop into a session to check status, but I never understood why when screen is already used by the phyto-arm script. I think it is dispensable so long as we give them something to reattach the console to for monitoring.

  • I prefer the nodes in the containers to have a fixed view of their filesystem. I.e., I'd rather the node just assumes to use /data and then to run the container we use -v $DATA_DIR:/data. I don't like mapping a variable -v $DATA_DIR:$DATA_DIR and then telling the node to use $DATA_DIR.

I'm trying to keep the phyto-arm script agnostic to the specifics of what the nodes and arguments are; one less place to wire up new nodes/arguments that way. I could require people to specify docker-style volumes in the YAML, e.g. /host/path:/data

  • What is the plan for running this on a development system? Are we constantly going to have to rebuild the container images while coding? Or mount the local Python code into the container?

Good question. Today I usually use development containers, makes testing faster. .devcontainer/devcontainer.json facilitates this. To enable launching from within the container, we could add a switch to phyto-arm that has it skip building the container and run ROS launch commands in the current context instead?

  • I'd like to be able to use podman or systemd-nspawn in the future. (I've completely migrated away from Docker on my Mac for at least a year now.) I don't like that Docker is a redundant service management and logging system. I'd rather logs go to journald and that we configure that appropriately.

I think I understand what you're saying, though I'm not sure that is in scope for this refactor?

  • What's the transition plan for ROS 2 which doesn't use launch files but lets us write Python?

I have put very little thought into a ROS2 transition plan for PhytO-ARM. Maybe should be a discussion at this upcoming workshop.

@rgov
Copy link
Member

rgov commented Dec 28, 2024

Actually, your "reduced" list of layers is what we have now. The container passes DONT_SCREEN=1 to the phyto_arm script, to bypass the extra layer.

I guess on revisiting I'm like -0.5 on this. I don't feel strongly about the design here and I support the idea to reduce complexity and make management easier. But let me explain why I designed it this way:

As you know the canonical ROS1 node management system is roslaunch which runs roscore (rosmaster and rosout), interprets the godawful XML file, and then launches the nodes described therein.

The launch XML file can be parameterized, e.g., <group if="$(arg classifier)" />, with these parameters passed as command line arguments (roslaunch ... classifier:=true). But this is not really a good experience for our users, and it's confusing that there's also the parameter server which is another, orthogonal set of configuration options, just consumed by nodes instead of roslaunch.

So I came up with the phyto_arm script hack which is to take the parameter server config file, and shove some keys into it that are turned into roslaunch arguments. Tada, it's all configured in one place, and the user just passes the config file. The original version was just 50 lines. But unfortunately there's still a lot of complexity due to roslaunch just being a bad, limited design.

The other advantage of having the outer phyto_arm script is that we can control environment variables -- namely ROS_LOG_DIR and the Python virtual environment -- and monitor execution of the children so we can tell if something crashes and dispatch a notification.

Finally, the screen layer was added so that you could control the system like a system service, with start and stop. This was expedient but maybe not the best solution, compared to a proper systemd unit file.

Docker came around kind of independently to simplify deployment, and since it provides an alternative service management model (container start/stop) so there was no need for screen, thus it passes DONT_SCREEN=1 through to the phyto_arm script which bypasses that additional layer.

tmux appears to be because of the IFCBacquire run.sh script, probably. We can remove that layer by turning it into a proper system service.

  • I'd like to be able to use podman or systemd-nspawn in the future. (I've completely migrated away from Docker on my Mac for at least a year now.) I don't like that Docker is a redundant service management and logging system. I'd rather logs go to journald and that we configure that appropriately.

I think I understand what you're saying, though I'm not sure that is in scope for this refactor?

I'm just saying I don't want to create more of an obstacle to migrating away from Docker by making Docker's service management more central to the project. We already run all our containers with --rm so they don't persist, for example.

@mbrosnahan
Copy link
Collaborator

My two cents here... Extra tmux layer was helpful initially not only because it provided nohup service but also because it captured standard outputs from IFCBacquire and P-A. TMK we aren't capturing this info properly in a log file anywhere so often tmux panes are good/easy way to see where nodes are crashing and to monitor current status of a deployment.
An example of the latter:
This spring teams were in the field coordinating their sampling from a pump while P-A paused on retrieval of the CHANOS sensor (step profiling behavior). Our initial plan had been to just monitor P-A and CHANOS depth vis FoxGlove but this was often very laggy. Field teams had much better success watching terminal output from arm_chanos when connecting locally via Wifi.
I can see how creating human readable log files with these outputs could be a better solution. Tmux is nice but learning how to create and navigate panes is a pain. Definitely easier to review log files or 'tail -f' to get live feedback from a particular a particular arm systems. THAT SAID there's still a need for user to know to look for these, where, how many logs available for a particular config, etc. A nice thing about tmux solution that Nathan implemented last season is that that all high level arm components were spawned in parallel panes of a single tmux session, effectively communicating to operators all of what the system is doing on its initial launch. This has promise even if in reality it was only really fleshed out and complete for two winch barge systems (Alma with Chanos profiling and Cosmic with ESP profiling).

@figuernd
Copy link
Contributor Author

Taking a step back and trying to think about it from a user story POV, I want to be able to:

  • Set up a deployment by just cloning the repo, editing the config file, and hitting run. Anything beyond this inevitably requires developer intervention.
  • Supervise deployments over SSH with a single, universal command; e.g. phyto-arm attach. Logs may carry the same info as standard out/err but finding and parsing logs inevitably requires developer intervention.

That is essentially my goal with this PR, and I'm flexible on the approach. What makes the most sense to me is to go back to using phyto-arm script the way it was originally intended. I don't think an operator in the field has actually used that script directly in over a year, because it's buried within the container now. I'm proposing we use it to manage the containers rather than vice versa.

@rgov
Copy link
Member

rgov commented Dec 31, 2024

I agree with that philosophy. There should be some command that makes it easy for operators.

The implementation details of subprocess orchestration and monitoring is separate from that concern, and works decently well as-is.

@mbrosnahan
Copy link
Collaborator

"Anything beyond this inevitably requires developer intervention."

To the extent this is true, we are falling short of the goal for the overall PhytO-ARM effort. Very very few teams who might adopt P-A will have anyone on staff who would call themselves a developer. That shouldn't mean there isn't a path for them to adapt and repurpose this code independently.

I think config only interaction is our goal for "day to day" / operational use. In these cases we want process of standing up a replacement host IFCB to be as streamlined as possible.

@figuernd
Copy link
Contributor Author

Okay we all seem agreed on the goal. In terms of implementation, if we continue with the current approach using phyto-arm as a hidden process orchestrator running within that container, we could create a new script which becomes the new user interface for parsing and passing config, and hide the current phyto-arm script somewhere to avoid confusion. Would that satisfy your concerns @rgov ?

@figuernd figuernd force-pushed the nathan.figueroa/config_consolidation branch from b2a13c9 to 8530ef9 Compare January 1, 2025 20:15
@figuernd
Copy link
Contributor Author

figuernd commented Jan 1, 2025

I've changed this to use a new pa CLI which wraps all container management, and reverted phyto-arm so that it continues to manage within the container as before. @rgov I'm testing now, but if you're okay with this approach I would like to get it merged before Monday so that we can use the simplified UI during the workshop.

@rgov
Copy link
Member

rgov commented Jan 1, 2025

What are "processes" in this context? They seem to be containers? What is going on in the different containers?

@rgov
Copy link
Member

rgov commented Jan 1, 2025

I think a pa or pactl tool for operator control of the overall system is a good idea.

I am reluctant to add a 300 line script with new Python package dependencies that essentially implements an alternative docker compose.

I am pondering the benefits of running multiple containers, with the same container image, just with different launch files. As far as I know you can just run roslaunch a.launch b.launch c.launch so it could all be done in one container. The primary benefit I see is that we don't have the same stop-everything-on-error behavior, but this is also achievable in other ways.

At least if we go ahead with this we should define the interface clearly so that if we change out the implementation later, operators aren't disrupted.

I am not up to speed on the "multi-arm" concept.

@figuernd
Copy link
Contributor Author

figuernd commented Jan 2, 2025

What are "processes" in this context? They seem to be containers? What is going on in the different containers?

A ROS launch file. In this case, ROS core and associated "main" launch file, and then any enabled arms. There is probably a better name than "processes". "launches"?

I am reluctant to add a 300 line script with new Python package dependencies that essentially implements an alternative docker compose.

This is not an alternative docker compose. pa shields the user from any need to know about the underlying container implementation. Their view of the world is only what they put in the configuration YAML, which I've attempted to abstract such that it does not leak any knowledge of Docker. Likewise, Docker compose does not support PhytO-ARM configuration files.

I am pondering the benefits of running multiple containers, with the same container image, just with different launch files. As far as I know you can just run roslaunch a.launch b.launch c.launch so it could all be done in one container. The primary benefit I see is that we don't have the same stop-everything-on-error behavior, but this is also achievable in other ways.

It can be done with one container. There are tradeoffs, the obvious being the isolation that containerization brings, with easy kill and reboot as you noted. Another is the ease of attaching to the STDOUT specific to the "process" you're interested in.

Today we actually do multiple ROS launches within a single container, and it seems to work okay. One thing this requires though is peering into the container to determine what is running and trying to manage container internals from the outside. Whereas multiple containers allows pa to remain a little more agnostic to the internal state of each container since its only task is to read configuration and then start/attach/stop containers accordingly.

I am not up to speed on the "multi-arm" concept.

"arm" is essentially just an abstraction for a scientific payload and its behavior, since IFCB is no longer the only winched instrument in use. We segment them into different ROS launches because which arms get used is highly mission dependent.

Comment on lines +13 to +15
tcp_ports:
bridge_node: 9090
web_node: 8098
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to find a more elegant way of expressing this. Here, it is attached to the "process" structure, which is pretty far from the actual node configuration some 250 lines later (and called just web).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants