`erpc:call` behavior is different on local and remote nodes #8641

juhlig · 2024-07-03T15:15:11Z

Describe the bug
erpc:call usually runs the given function in a spawned process. However, if the given Node is the local node and if the given Timeout is infinity (implicit in erpc:call/2 and /4), an optimization is used that instead uses erlang:apply:

otp/lib/kernel/src/erpc.erl

Lines 255 to 260 in d05de4c

    
           call(N, M, F, A, infinity) when node() =:= N,  %% Optimize local call 
        
                                           is_atom(M), 
        
                                           is_atom(F), 
        
                                           is_list(A) -> 
        
               try 
        
                   {return, Return} = execute_call(M,F,A),

otp/lib/kernel/src/erpc.erl

Lines 1267 to 1270 in d05de4c

    
           -doc false. 
        
           execute_call(M,F,A) -> 
        
               {return, apply(M, F, A)}.

Using apply means that the given function is executed in the context of the process calling erpc:call, which can have a row of unintended consequences:

The function could accidentially corrupt the state or interfere with the workings of the calling process, eg by modifying the process dictionary or private/protected ets tables, closing files or sockets, stealing messages, changing the trap_exit flag, linking/monitoring (or unlinking/demonitoring) other processes, etc. All of this can not happen if the function executes in a separate process.
The function could cause memory leaks by opening but not closing files or sockets, creating ets tables, etc. If the function executes in a separate process, it can create quite a mess which would all be neatly cleaned up when it finishes. When executed in the calling process, the mess remains.
When the calling process exits (eg via a link, explicit exit/kill, etc) while executing the function, the function is basically interrupted at whatever it is doing. When executed in a separate process, it runs through to the end, no matter what happens to the calling process.

To Reproduce
Exemplified by creating an private ets table. The distinction between executing the function in the calling process (1> and 2>) vs a separate one (3> and 4>) is forced by the timeout.

1> erpc:call(node(), fun() -> ets:new(foo, [named_table, private]) end, infinity).
foo
2> ets:tab2list(foo).
[]
3> erpc:call(node(), fun() -> ets:new(bar, [named_table, private]) end, 1000).
bar
4> ets:tab2list(bar).
** exception error: bad argument
     in function  ets:match_object/2
        called as ets:match_object(bar,'_')
        *** argument 1: the table identifier does not refer to an existing ETS table
     in call from ets:tab2list/1 (ets.erl, line 2266)

As can be seen at 2>, the ets table created in the call at 1> still exists, and is also accessible by the calling process. Conversely at 4>, the ets table created in the call at 3> is gone.

Exemplifying the interruption of the function when the calling process crashes. A function sets a timer to kill itself after 1s, then uses erpc:call to execute a function which waits 2s before sending done back to the shell.

1> Self = self().                                                                                        
<0.81.0>
2> spawn(fun() -> timer:kill_after(1000, self()), erpc:call(node(), fun() -> timer:sleep(2000), Self ! done end, infinity) end).
<0.108.0>
3> receive done -> ok after 5000 -> error end.                                                                                  
error
4> spawn(fun() -> timer:kill_after(1000, self()), erpc:call(node(), fun() -> timer:sleep(2000), Self ! done end, 5000) end).
<0.104.0>
5> receive done -> done after 5000 -> error end.
done

As can be seen at 3>, the done message never arrives when the calling process exits in the wait period of the function that was started at 2>. Conversely, at 5> we get the done message even though the calling process exited.

Expected behavior
erpc call should behave the same, no matter the node or timeout. Specifically, it should always execute the given function in a separate process.

Affected versions
OTP 27, but probably going back all the way to OTP 23 when erpc was introduced.

Additional context
The optimization seems pointless to me. When the function is executed at the local node, it already is fast, no need to try to shave off a few microseconds. Also, it only takes place when the given Timeout is infinity, which indicates that the caller is prepared to wait for however long it takes, not in a hurry.

The text was updated successfully, but these errors were encountered:

mikpe · 2024-07-03T16:40:07Z

The documentation explicitly warns that no assumptions can be made about the process in which the function is called. There is no bug here.

(We've been bitten by this ourselves, but that was our fault.)

essen · 2024-07-03T16:56:10Z

Running the function in the context of the calling process is quite unexpected for RPC. If there is one expectation one should be able to have it's that the code will execute in a separate process than the caller. As it stands it is unfortunately dangerous to use erpc if Node =:= node().

dgud · 2024-07-03T17:28:45Z

Mnesia uses this, i.e. it doesn't need to check that the table is on a local node when reading the ets table for example.
You can always use (e)rpc and know that you don't get the performance penalty for spawning a process.

essen · 2024-07-03T18:25:45Z

It makes sense to want to fast-track if you know nothing can go wrong. But not otherwise. If in the executed function I open a file and have a crash after that, on the local node the file will remain open until the calling process exits. If I'm not aware of the subtleties of erpc, and we weren't until today, I'll be bitten by this. If I want to avoid this problem I'm not sure what I should do besides using a very large timeout.

The fast-track is there for a good reason but it should be requested explicitly and the default should be safe not fast. The current default is surprising and error-prone.

max-au · 2024-07-03T18:54:31Z

Both parties are somewhat right:

If mnesia needs this specific optimisation, it can have a basic wrapper over erpc:call, for local node and infinity timeout.
It is unusual to have infinity timeout for any RPC, and probably code that expects isolation should not be using infinity anyway.

Generally, I'm inclined to agree with @juhlig and apply the suggested change, together with documentation. However it might be more complex in terms of compatibility, because it changes existing contract (existing users expect to "be bitten by that").

mikpe · 2024-07-03T18:57:15Z

I don't think there is a one-size-fits-all definition of "safe" here.

The "obvious" approach of always using a fresh process to call the target function and then terminate is not safe. I've seen rpcs cause failures exactly due to that. It all depends on what the target function does.

So the options then would be to (1) add options to (e)rpc to provide additional guarantees about the target execution environment, or (2) you establish that in your target function (or a wrapper for it).

(1) is doable but takes time to become widely available, (2) you can do today.

essen · 2024-07-03T19:13:47Z

There's no blocker for us, we just wanted to bring this up as it was surprising (even if it is documented). We can set a large enough integer timeout explicitly and avoid this issue.

It is unusual to have infinity timeout for any RPC

It is both unusual... and the default! Unfortunately.

juhlig · 2024-07-04T07:12:10Z

I have no strong feelings about this matter, as @essen said, we can work around it one way or another.

So, just a few IMOs and musings regarding the comments.

The documentation explicitly warns that no assumptions can be made about the process in which the function is called. There is no bug here.

(We've been bitten by this ourselves, but that was our fault.)

I know, however, I would not have taken the meaning of this paragraph as far as "... it may even be the calling process itself". The fact that you (and probably others) needed to be bitten to realize this shows that I'm not alone in this regard.

Mnesia uses this, i.e. it doesn't need to check that the table is on a local node when reading the ets table for example.
You can always use (e)rpc and know that you don't get the performance penalty for spawning a process.

So Mnesia does make a very strong assumption about the process in which the function is called, right? This is pretty dangerous IMO, the circumstances in which the calling process is the process executing the function are undocumented.

However it might be more complex in terms of compatibility, because it changes existing contract (existing users expect to "be bitten by that").

I wouldn't say that there is a contract. Mnesia relies on a little-known internal fact here, which actually violates what the documentation urges, ie to not make any assumptions regarding the executing process.
I also wouldn't say that users expect to be bitten by that. Some users were bitten by that, now they know.

The "obvious" approach of always using a fresh process to call the target function and then terminate is not safe.

It depends on what safeties you expect. A safety I would expect is that it is not possible that the function may accidentially corrupt or leak stuff into the calling process if the stars are right (ie, local node and finity timeout). This safety can be achieved by always using a fresh process.
It is ok if the function itself fails, or causes failures by whatever it explicitly does. But the calling process should not be part of this.

It is both unusual... and the default! Unfortunately.

Indeed. Skimming the results of a quick grep, I found only 2 instances of (e)rpc:call in OTP that do provide a timeout, everything else uses the /2//4 function which use the default of infinity.

rickard-green · 2024-07-04T10:02:42Z

The documentation explicitly warns that no assumptions can be made about the process in which the function is called. There is no bug here.
(We've been bitten by this ourselves, but that was our fault.)

I know, however, I would not have taken the meaning of this paragraph as far as "... it may even be the calling process itself". The fact that you (and probably others) needed to be bitten to realize this shows that I'm not alone in this regard.

The following note is present in the erpc:call() documentation. That is, it is quite explicit that this can be the case:

You cannot make any assumptions about the process that will perform the apply(). It may be the calling process itself, a server, or a freshly spawned process.

This has also been the case for the legacy rpc:call() since dawn of time.

juhlig · 2024-07-04T10:57:07Z

The following note is present in the erpc:call() documentation. That is, it is quite explicit that this can be the case:

You cannot make any assumptions about the process that will perform the apply(). It may be the calling process itself, a server, or a freshly spawned process.

Ok, I stand corrected.

Maria-12648430 · 2024-07-07T13:48:32Z

This has also been the case for the legacy rpc:call() since dawn of time.

This does't mean that it is a good thing 😐
Especially the possible pollution of the calling process sounds a bit dangerous to me.

Maria-12648430 · 2024-07-09T05:17:26Z

Another option which would not break any existing code (like the linked PR #8642 could) would be to provide a way to force the call to be always executed in a separate process, @mikpe already pointed this out. This could be done by way of a new call-like function (naming would be hard here), or by allowing an alternative to the infinity timeout value here (say, undefined) which would behave like infinity but not go the apply route (documenting this would be somewhat awkward I guess).

mikpe · 2024-07-09T09:07:29Z

Cleaner than messing with "magic" timeout values would be to add an Options parameter, where you could pass e.g. {always_spawn, true} to get that behaviour without breaking existing code. With that in place it would also be possible to change the default since code could pass {always_spawn, false} to restore the current behaviour.

Maria-12648430 · 2024-07-09T09:46:31Z

Of course. But that would put two semantics on call/3,4,5, where the last arg could be either a timeout or options. Personally, I don't like that much. OTOH, I don't like "magic" (as you put it) much, either 😅

Maria-12648430 · 2024-07-09T09:55:27Z

Specifically...

call/2 would remain unchanged
call/3 could be either call(Node, Fun, Timeout) or call(Node, Fun, Options)
call/4 could be either call(Node, M, F, A) or call(Node, Fun, Timeout, Options)
call/5 could be either call(Node, M, F, A, Timeout) or call(Node, M, F, A, Options)
call/6 would have to be introduced to be call(Node, M, F, A, Timeout, Options)

mikpe · 2024-07-09T11:33:49Z

You don't have to change call/[3-5], it's enough to add call/6. Having said that, I could live with passing either Timeout or Options as the last parameter in call/[3-5] as their types are disjoint. Or you add call_opt/5 where Timeout is passed inside Options.

But these are minor details that could be ironed out during review. It is however clear that there is a way to introduce new behaviour in erpc without breaking existing code.

juhlig · 2024-07-09T14:51:33Z

Ok, sounds good. I'll change the linked PR accordingly.

I'll go the call_opt(Node, Fun, Options)/call(Node, M, F, A, Options) route, the other is practically undocumentable with the new system 🙄

Maria-12648430 · 2024-07-10T10:31:24Z

You cannot make any assumptions about the process that will perform the apply(). It may be the calling process itself, a server, or a freshly spawned process.

@rickard-green, what does "a server" mean here? Is this some legacy in the documentation from the old rpc module? I understand that in there gen_servers were involved, but in erpc it is only either a spawned or the calling process, no servers anywhere (AFAICS).

mikpe · 2024-07-10T11:39:33Z

I'm not Rickard, but there is or used to be a rex server process on each node to handle incoming RPCs.

Maria-12648430 · 2024-07-10T13:24:25Z

I'm not Rickard

Good enough 😁 j/k

but there is or used to be a rex server process on each node to handle incoming RPCs.

There still is, this seems to be the server involved in the old rpc. However, this doesn't seem to be used in erpc, where the reference to "a server" is.

juhlig · 2024-07-10T15:10:46Z

#8642 has been updated according to recent suggestions.

juhlig added the bug Issue is reported as a bug label Jul 3, 2024

juhlig linked a pull request Jul 3, 2024 that will close this issue

Introduce erpc:call_opt/3,5 #8642

Open

IngelaAndin added the team:VM Assigned to OTP team VM label Jul 8, 2024

jhogberg assigned rickard-green Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`erpc:call` behavior is different on local and remote nodes #8641

`erpc:call` behavior is different on local and remote nodes #8641

juhlig commented Jul 3, 2024 •

edited

Loading

mikpe commented Jul 3, 2024

essen commented Jul 3, 2024

dgud commented Jul 3, 2024

essen commented Jul 3, 2024

max-au commented Jul 3, 2024

mikpe commented Jul 3, 2024

essen commented Jul 3, 2024

juhlig commented Jul 4, 2024 •

edited

Loading

rickard-green commented Jul 4, 2024

juhlig commented Jul 4, 2024

Maria-12648430 commented Jul 7, 2024

Maria-12648430 commented Jul 9, 2024 •

edited

Loading

mikpe commented Jul 9, 2024

Maria-12648430 commented Jul 9, 2024

Maria-12648430 commented Jul 9, 2024

mikpe commented Jul 9, 2024 •

edited

Loading

juhlig commented Jul 9, 2024 •

edited

Loading

Maria-12648430 commented Jul 10, 2024

mikpe commented Jul 10, 2024

Maria-12648430 commented Jul 10, 2024

juhlig commented Jul 10, 2024

erpc:call behavior is different on local and remote nodes #8641

erpc:call behavior is different on local and remote nodes #8641

Comments

juhlig commented Jul 3, 2024 • edited Loading

mikpe commented Jul 3, 2024

essen commented Jul 3, 2024

dgud commented Jul 3, 2024

essen commented Jul 3, 2024

max-au commented Jul 3, 2024

mikpe commented Jul 3, 2024

essen commented Jul 3, 2024

juhlig commented Jul 4, 2024 • edited Loading

rickard-green commented Jul 4, 2024

juhlig commented Jul 4, 2024

Maria-12648430 commented Jul 7, 2024

Maria-12648430 commented Jul 9, 2024 • edited Loading

mikpe commented Jul 9, 2024

Maria-12648430 commented Jul 9, 2024

Maria-12648430 commented Jul 9, 2024

mikpe commented Jul 9, 2024 • edited Loading

juhlig commented Jul 9, 2024 • edited Loading

Maria-12648430 commented Jul 10, 2024

mikpe commented Jul 10, 2024

Maria-12648430 commented Jul 10, 2024

juhlig commented Jul 10, 2024

`erpc:call` behavior is different on local and remote nodes #8641

`erpc:call` behavior is different on local and remote nodes #8641

juhlig commented Jul 3, 2024 •

edited

Loading

juhlig commented Jul 4, 2024 •

edited

Loading

Maria-12648430 commented Jul 9, 2024 •

edited

Loading

mikpe commented Jul 9, 2024 •

edited

Loading

juhlig commented Jul 9, 2024 •

edited

Loading