Implement supervision of processes and services via `catch_unwind` #7

andrewjstone · 2016-12-05T22:12:27Z

https://doc.rust-lang.org/std/panic/fn.catch_unwind.html

matklad · 2017-05-02T10:05:18Z

Just some random idea, feel free to ignore if it doesn't fit the project's intended use case!

catch_unwind does not provide the level of fault isolation that BEAM provides. For example, and actor could eat all the CPU with loop {}, or it could segfault in unsafe code or when calling an external library. Or it can abort on OOM. Looks like the only way to provide strong isolation in Rust is to spawn a separate OS-level process. It'd be cool to be able to spawn important actors like supervisors to a separate process!

andrewjstone · 2017-05-02T16:27:56Z

I think that this is out of scope for rabble. External supervisors such as supervisord, upstart, systemd etc.. can already restart the process in case of OOM or segfault. For an infinite loop, I'm not sure how it could even be detected. Isn't this the halting problem? Unfortunately, I think the best way to prevent that is to simply tell users: "Don't do that". For CPU heavy tasks, users can already spawn workers in a separate thread. Lightweight processes are really only useful for IO bound work. Therefore any loops should immediately be suspect.

The goal I had in mind for catch_unwind was only to catch rust panics, such as unwraps etc... inside a process or thread so that they could be restarted without restarting the whole OS process. As you have pointed out this isn't ideal, but I think it's the best that can be reasonably achieved with what rust provides today.

andrewjstone · 2017-05-26T22:19:30Z

Whoops. Looked like I closed this by accident.

andrewjstone · 2017-05-26T22:56:27Z

@z1mvader mentioned he was willing to help with implementing supervision in rabble. I think it's virtually impossible to provide full supervision due to the lack of isolation provided by Rust. However, it is easy to catch panics, and restart any lightweight process or service due to bad unwrap() calls. Since the guarantees provided by rust are much less full featured than Erlang and actors don't have a hierarchy in rabble, a simple top level config is probably enough for supervision. If any actor crashes more than 3 times in X seconds, then let the panic crash the program. Alternatively, we could just always restart the process.

I think this can simply be implemented, at least for lightweight processes by wrapping the executor run loop in panic::catch_unwind.

You should be able to do the same thing for thread based services by wrapping their loop as well: https://github.com/andrewjstone/rabble/blob/master/src/service/service.rs#L51

andrewjstone · 2017-05-26T23:01:54Z

Hmm, Maybe it's not that straightforward. I think in order to figure out which process is crashing in the executor you'd need to wrap this line: https://github.com/andrewjstone/rabble/blob/master/src/executor/executor.rs#L156

You'd need to save the state of the process, prior to running it, and then restore it after crash. I'm not sure what type of perf penalties this would have, in addition to the penalty of calling catch_unwind for every message handled. Any perf overhead would have to be minimized to make this worthwhile.

ghost · 2017-05-26T23:15:17Z

Ok I'll start working on this as soon as possible. I should write to you from here if I have any doubts?

andrewjstone · 2017-05-26T23:19:10Z

Yeah, post any issues you have here so that we have a full history for the community. Thanks @z1mvader

andrewjstone · 2017-05-26T23:28:05Z

Note, after taking a further look, I'm not sure how doable this is with code as is. Processes and service handlers are not constrained to be Clone, and even if they were, we wouldn't want to run clone before handling each message, because that creates a copy of internal state which could be very large. However, we still need a way to recover state in case a panic occurs. We can't just pick up where we left off as the internal state of the process or service handler may be inconsistent. I think we want to just re-run the init functions for both processes and service handlers so that they reinitialize their state. Alternatively, we could add a new restart function for each process or service handler to implement, but I think this just increases the API surface of the traits for no appreciable gain.

andrewjstone closed this as completed May 2, 2017

andrewjstone reopened this May 26, 2017

andrewjstone mentioned this issue May 26, 2017

Supervisors in Rabble? #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement supervision of processes and services via `catch_unwind` #7

Implement supervision of processes and services via `catch_unwind` #7

andrewjstone commented Dec 5, 2016 •

edited

Loading

matklad commented May 2, 2017

andrewjstone commented May 2, 2017

andrewjstone commented May 26, 2017

andrewjstone commented May 26, 2017 •

edited

Loading

andrewjstone commented May 26, 2017

ghost commented May 26, 2017

andrewjstone commented May 26, 2017

andrewjstone commented May 26, 2017

Implement supervision of processes and services via catch_unwind #7

Implement supervision of processes and services via catch_unwind #7

Comments

andrewjstone commented Dec 5, 2016 • edited Loading

matklad commented May 2, 2017

andrewjstone commented May 2, 2017

andrewjstone commented May 26, 2017

andrewjstone commented May 26, 2017 • edited Loading

andrewjstone commented May 26, 2017

ghost commented May 26, 2017

andrewjstone commented May 26, 2017

andrewjstone commented May 26, 2017

Implement supervision of processes and services via `catch_unwind` #7

Implement supervision of processes and services via `catch_unwind` #7

andrewjstone commented Dec 5, 2016 •

edited

Loading

andrewjstone commented May 26, 2017 •

edited

Loading