Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement supervision of processes and services via catch_unwind #7

Open
andrewjstone opened this issue Dec 5, 2016 · 8 comments
Open

Comments

@andrewjstone
Copy link
Owner

andrewjstone commented Dec 5, 2016

https://doc.rust-lang.org/std/panic/fn.catch_unwind.html

@matklad
Copy link

matklad commented May 2, 2017

Just some random idea, feel free to ignore if it doesn't fit the project's intended use case!

catch_unwind does not provide the level of fault isolation that BEAM provides. For example, and actor could eat all the CPU with loop {}, or it could segfault in unsafe code or when calling an external library. Or it can abort on OOM. Looks like the only way to provide strong isolation in Rust is to spawn a separate OS-level process. It'd be cool to be able to spawn important actors like supervisors to a separate process!

@andrewjstone
Copy link
Owner Author

I think that this is out of scope for rabble. External supervisors such as supervisord, upstart, systemd etc.. can already restart the process in case of OOM or segfault. For an infinite loop, I'm not sure how it could even be detected. Isn't this the halting problem? Unfortunately, I think the best way to prevent that is to simply tell users: "Don't do that". For CPU heavy tasks, users can already spawn workers in a separate thread. Lightweight processes are really only useful for IO bound work. Therefore any loops should immediately be suspect.

The goal I had in mind for catch_unwind was only to catch rust panics, such as unwraps etc... inside a process or thread so that they could be restarted without restarting the whole OS process. As you have pointed out this isn't ideal, but I think it's the best that can be reasonably achieved with what rust provides today.

@andrewjstone
Copy link
Owner Author

Whoops. Looked like I closed this by accident.

@andrewjstone
Copy link
Owner Author

andrewjstone commented May 26, 2017

@z1mvader mentioned he was willing to help with implementing supervision in rabble. I think it's virtually impossible to provide full supervision due to the lack of isolation provided by Rust. However, it is easy to catch panics, and restart any lightweight process or service due to bad unwrap() calls. Since the guarantees provided by rust are much less full featured than Erlang and actors don't have a hierarchy in rabble, a simple top level config is probably enough for supervision. If any actor crashes more than 3 times in X seconds, then let the panic crash the program. Alternatively, we could just always restart the process.

I think this can simply be implemented, at least for lightweight processes by wrapping the executor run loop in panic::catch_unwind.

You should be able to do the same thing for thread based services by wrapping their loop as well: https://github.com/andrewjstone/rabble/blob/master/src/service/service.rs#L51

@andrewjstone
Copy link
Owner Author

Hmm, Maybe it's not that straightforward. I think in order to figure out which process is crashing in the executor you'd need to wrap this line: https://github.com/andrewjstone/rabble/blob/master/src/executor/executor.rs#L156

You'd need to save the state of the process, prior to running it, and then restore it after crash. I'm not sure what type of perf penalties this would have, in addition to the penalty of calling catch_unwind for every message handled. Any perf overhead would have to be minimized to make this worthwhile.

@ghost
Copy link

ghost commented May 26, 2017

Ok I'll start working on this as soon as possible. I should write to you from here if I have any doubts?

@andrewjstone
Copy link
Owner Author

Yeah, post any issues you have here so that we have a full history for the community. Thanks @z1mvader

@andrewjstone
Copy link
Owner Author

Note, after taking a further look, I'm not sure how doable this is with code as is. Processes and service handlers are not constrained to be Clone, and even if they were, we wouldn't want to run clone before handling each message, because that creates a copy of internal state which could be very large. However, we still need a way to recover state in case a panic occurs. We can't just pick up where we left off as the internal state of the process or service handler may be inconsistent. I think we want to just re-run the init functions for both processes and service handlers so that they reinitialize their state. Alternatively, we could add a new restart function for each process or service handler to implement, but I think this just increases the API surface of the traits for no appreciable gain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants