Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erpc:call behavior is different on local and remote nodes #8641

Open
juhlig opened this issue Jul 3, 2024 · 21 comments · May be fixed by #8642
Open

erpc:call behavior is different on local and remote nodes #8641

juhlig opened this issue Jul 3, 2024 · 21 comments · May be fixed by #8642
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@juhlig
Copy link
Contributor

juhlig commented Jul 3, 2024

Describe the bug
erpc:call usually runs the given function in a spawned process. However, if the given Node is the local node and if the given Timeout is infinity (implicit in erpc:call/2 and /4), an optimization is used that instead uses erlang:apply:

otp/lib/kernel/src/erpc.erl

Lines 255 to 260 in d05de4c

call(N, M, F, A, infinity) when node() =:= N, %% Optimize local call
is_atom(M),
is_atom(F),
is_list(A) ->
try
{return, Return} = execute_call(M,F,A),

otp/lib/kernel/src/erpc.erl

Lines 1267 to 1270 in d05de4c

-doc false.
execute_call(M,F,A) ->
{return, apply(M, F, A)}.

Using apply means that the given function is executed in the context of the process calling erpc:call, which can have a row of unintended consequences:

  • The function could accidentially corrupt the state or interfere with the workings of the calling process, eg by modifying the process dictionary or private/protected ets tables, closing files or sockets, stealing messages, changing the trap_exit flag, linking/monitoring (or unlinking/demonitoring) other processes, etc. All of this can not happen if the function executes in a separate process.
  • The function could cause memory leaks by opening but not closing files or sockets, creating ets tables, etc. If the function executes in a separate process, it can create quite a mess which would all be neatly cleaned up when it finishes. When executed in the calling process, the mess remains.
  • When the calling process exits (eg via a link, explicit exit/kill, etc) while executing the function, the function is basically interrupted at whatever it is doing. When executed in a separate process, it runs through to the end, no matter what happens to the calling process.

To Reproduce
Exemplified by creating an private ets table. The distinction between executing the function in the calling process (1> and 2>) vs a separate one (3> and 4>) is forced by the timeout.

1> erpc:call(node(), fun() -> ets:new(foo, [named_table, private]) end, infinity).
foo
2> ets:tab2list(foo).
[]
3> erpc:call(node(), fun() -> ets:new(bar, [named_table, private]) end, 1000).
bar
4> ets:tab2list(bar).
** exception error: bad argument
     in function  ets:match_object/2
        called as ets:match_object(bar,'_')
        *** argument 1: the table identifier does not refer to an existing ETS table
     in call from ets:tab2list/1 (ets.erl, line 2266)

As can be seen at 2>, the ets table created in the call at 1> still exists, and is also accessible by the calling process. Conversely at 4>, the ets table created in the call at 3> is gone.

Exemplifying the interruption of the function when the calling process crashes. A function sets a timer to kill itself after 1s, then uses erpc:call to execute a function which waits 2s before sending done back to the shell.

1> Self = self().                                                                                        
<0.81.0>
2> spawn(fun() -> timer:kill_after(1000, self()), erpc:call(node(), fun() -> timer:sleep(2000), Self ! done end, infinity) end).
<0.108.0>
3> receive done -> ok after 5000 -> error end.                                                                                  
error
4> spawn(fun() -> timer:kill_after(1000, self()), erpc:call(node(), fun() -> timer:sleep(2000), Self ! done end, 5000) end).
<0.104.0>
5> receive done -> done after 5000 -> error end.
done

As can be seen at 3>, the done message never arrives when the calling process exits in the wait period of the function that was started at 2>. Conversely, at 5> we get the done message even though the calling process exited.

Expected behavior
erpc call should behave the same, no matter the node or timeout. Specifically, it should always execute the given function in a separate process.

Affected versions
OTP 27, but probably going back all the way to OTP 23 when erpc was introduced.

Additional context
The optimization seems pointless to me. When the function is executed at the local node, it already is fast, no need to try to shave off a few microseconds. Also, it only takes place when the given Timeout is infinity, which indicates that the caller is prepared to wait for however long it takes, not in a hurry.

@juhlig juhlig added the bug Issue is reported as a bug label Jul 3, 2024
@juhlig juhlig linked a pull request Jul 3, 2024 that will close this issue
@mikpe
Copy link
Contributor

mikpe commented Jul 3, 2024

The documentation explicitly warns that no assumptions can be made about the process in which the function is called. There is no bug here.

(We've been bitten by this ourselves, but that was our fault.)

@essen
Copy link
Contributor

essen commented Jul 3, 2024

Running the function in the context of the calling process is quite unexpected for RPC. If there is one expectation one should be able to have it's that the code will execute in a separate process than the caller. As it stands it is unfortunately dangerous to use erpc if Node =:= node().

@dgud
Copy link
Contributor

dgud commented Jul 3, 2024

Mnesia uses this, i.e. it doesn't need to check that the table is on a local node when reading the ets table for example.
You can always use (e)rpc and know that you don't get the performance penalty for spawning a process.

@essen
Copy link
Contributor

essen commented Jul 3, 2024

It makes sense to want to fast-track if you know nothing can go wrong. But not otherwise. If in the executed function I open a file and have a crash after that, on the local node the file will remain open until the calling process exits. If I'm not aware of the subtleties of erpc, and we weren't until today, I'll be bitten by this. If I want to avoid this problem I'm not sure what I should do besides using a very large timeout.

The fast-track is there for a good reason but it should be requested explicitly and the default should be safe not fast. The current default is surprising and error-prone.

@max-au
Copy link
Contributor

max-au commented Jul 3, 2024

Both parties are somewhat right:

  1. If mnesia needs this specific optimisation, it can have a basic wrapper over erpc:call, for local node and infinity timeout.
  2. It is unusual to have infinity timeout for any RPC, and probably code that expects isolation should not be using infinity anyway.

Generally, I'm inclined to agree with @juhlig and apply the suggested change, together with documentation. However it might be more complex in terms of compatibility, because it changes existing contract (existing users expect to "be bitten by that").

@mikpe
Copy link
Contributor

mikpe commented Jul 3, 2024

I don't think there is a one-size-fits-all definition of "safe" here.

The "obvious" approach of always using a fresh process to call the target function and then terminate is not safe. I've seen rpcs cause failures exactly due to that. It all depends on what the target function does.

So the options then would be to (1) add options to (e)rpc to provide additional guarantees about the target execution environment, or (2) you establish that in your target function (or a wrapper for it).

(1) is doable but takes time to become widely available, (2) you can do today.

@essen
Copy link
Contributor

essen commented Jul 3, 2024

There's no blocker for us, we just wanted to bring this up as it was surprising (even if it is documented). We can set a large enough integer timeout explicitly and avoid this issue.

It is unusual to have infinity timeout for any RPC

It is both unusual... and the default! Unfortunately.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 4, 2024

I have no strong feelings about this matter, as @essen said, we can work around it one way or another.

So, just a few IMOs and musings regarding the comments.

The documentation explicitly warns that no assumptions can be made about the process in which the function is called. There is no bug here.

(We've been bitten by this ourselves, but that was our fault.)

I know, however, I would not have taken the meaning of this paragraph as far as "... it may even be the calling process itself". The fact that you (and probably others) needed to be bitten to realize this shows that I'm not alone in this regard.

Mnesia uses this, i.e. it doesn't need to check that the table is on a local node when reading the ets table for example.
You can always use (e)rpc and know that you don't get the performance penalty for spawning a process.

So Mnesia does make a very strong assumption about the process in which the function is called, right? This is pretty dangerous IMO, the circumstances in which the calling process is the process executing the function are undocumented.

However it might be more complex in terms of compatibility, because it changes existing contract (existing users expect to "be bitten by that").

I wouldn't say that there is a contract. Mnesia relies on a little-known internal fact here, which actually violates what the documentation urges, ie to not make any assumptions regarding the executing process.
I also wouldn't say that users expect to be bitten by that. Some users were bitten by that, now they know.

The "obvious" approach of always using a fresh process to call the target function and then terminate is not safe.

It depends on what safeties you expect. A safety I would expect is that it is not possible that the function may accidentially corrupt or leak stuff into the calling process if the stars are right (ie, local node and finity timeout). This safety can be achieved by always using a fresh process.
It is ok if the function itself fails, or causes failures by whatever it explicitly does. But the calling process should not be part of this.

It is both unusual... and the default! Unfortunately.

Indeed. Skimming the results of a quick grep, I found only 2 instances of (e)rpc:call in OTP that do provide a timeout, everything else uses the /2//4 function which use the default of infinity.

@rickard-green
Copy link
Contributor

The documentation explicitly warns that no assumptions can be made about the process in which the function is called. There is no bug here.
(We've been bitten by this ourselves, but that was our fault.)

I know, however, I would not have taken the meaning of this paragraph as far as "... it may even be the calling process itself". The fact that you (and probably others) needed to be bitten to realize this shows that I'm not alone in this regard.

The following note is present in the erpc:call() documentation. That is, it is quite explicit that this can be the case:

You cannot make any assumptions about the process that will perform the apply(). It may be the calling process itself, a server, or a freshly spawned process.

This has also been the case for the legacy rpc:call() since dawn of time.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 4, 2024

The following note is present in the erpc:call() documentation. That is, it is quite explicit that this can be the case:

You cannot make any assumptions about the process that will perform the apply(). It may be the calling process itself, a server, or a freshly spawned process.

Ok, I stand corrected.

@Maria-12648430
Copy link
Contributor

This has also been the case for the legacy rpc:call() since dawn of time.

This does't mean that it is a good thing 😐
Especially the possible pollution of the calling process sounds a bit dangerous to me.

@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Jul 8, 2024
@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Jul 9, 2024

Another option which would not break any existing code (like the linked PR #8642 could) would be to provide a way to force the call to be always executed in a separate process, @mikpe already pointed this out. This could be done by way of a new call-like function (naming would be hard here), or by allowing an alternative to the infinity timeout value here (say, undefined) which would behave like infinity but not go the apply route (documenting this would be somewhat awkward I guess).

@mikpe
Copy link
Contributor

mikpe commented Jul 9, 2024

Cleaner than messing with "magic" timeout values would be to add an Options parameter, where you could pass e.g. {always_spawn, true} to get that behaviour without breaking existing code. With that in place it would also be possible to change the default since code could pass {always_spawn, false} to restore the current behaviour.

@Maria-12648430
Copy link
Contributor

Of course. But that would put two semantics on call/3,4,5, where the last arg could be either a timeout or options. Personally, I don't like that much. OTOH, I don't like "magic" (as you put it) much, either 😅

@Maria-12648430
Copy link
Contributor

Specifically...

  • call/2 would remain unchanged
  • call/3 could be either call(Node, Fun, Timeout) or call(Node, Fun, Options)
  • call/4 could be either call(Node, M, F, A) or call(Node, Fun, Timeout, Options)
  • call/5 could be either call(Node, M, F, A, Timeout) or call(Node, M, F, A, Options)
  • call/6 would have to be introduced to be call(Node, M, F, A, Timeout, Options)

@mikpe
Copy link
Contributor

mikpe commented Jul 9, 2024

You don't have to change call/[3-5], it's enough to add call/6. Having said that, I could live with passing either Timeout or Options as the last parameter in call/[3-5] as their types are disjoint. Or you add call_opt/5 where Timeout is passed inside Options.

But these are minor details that could be ironed out during review. It is however clear that there is a way to introduce new behaviour in erpc without breaking existing code.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 9, 2024

Ok, sounds good. I'll change the linked PR accordingly.

I'll go the call_opt(Node, Fun, Options)/call(Node, M, F, A, Options) route, the other is practically undocumentable with the new system 🙄

@Maria-12648430
Copy link
Contributor

You cannot make any assumptions about the process that will perform the apply(). It may be the calling process itself, a server, or a freshly spawned process.

@rickard-green, what does "a server" mean here? Is this some legacy in the documentation from the old rpc module? I understand that in there gen_servers were involved, but in erpc it is only either a spawned or the calling process, no servers anywhere (AFAICS).

@mikpe
Copy link
Contributor

mikpe commented Jul 10, 2024

I'm not Rickard, but there is or used to be a rex server process on each node to handle incoming RPCs.

@Maria-12648430
Copy link
Contributor

I'm not Rickard

Good enough 😁 j/k

but there is or used to be a rex server process on each node to handle incoming RPCs.

There still is, this seems to be the server involved in the old rpc. However, this doesn't seem to be used in erpc, where the reference to "a server" is.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 10, 2024

#8642 has been updated according to recent suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants