Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pid weirdness with gproc_dist #141

Open
uwiger opened this issue Sep 5, 2017 · 7 comments
Open

pid weirdness with gproc_dist #141

uwiger opened this issue Sep 5, 2017 · 7 comments

Comments

@uwiger
Copy link
Owner

uwiger commented Sep 5, 2017

I came across this. I won't claim that it's a bug in Erlang - I assume there was a hickup in the networking on my Mac - but I must say that I can't even remember having seen this before.

(I believe Hans Svensson talked about something similar in Freiburg 2007).

Running some gproc tests with two local nodes. OTP 18, for no particular reason.

Node A:

Erlang/OTP 18 [erts-7.3.1] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3.1  (abort with ^G)
(a@uwpro-2)1> application:ensure_all_started(gproc).
{ok,[gproc]}

Node B:

Eshell V7.3.1  (abort with ^G)
(b@uwpro-2)1> net:ping('a@uwpro-2').
pong
(b@uwpro-2)2> application:ensure_all_started(gproc).
{ok,[gproc]}
(b@uwpro-2)3> gproc:monitor({n,g,a},follow).
#Ref<6830.0.4.170>
(b@uwpro-2)4> flush().
Shell got {gproc,unreg,#Ref<6830.0.4.170>,{n,g,a}}
ok

Node A:

(a@uwpro-2)2> ets:tab2list(gproc).
[{{<6946.39.0>,{n,g,a}},[]},
 {{<6946.39.0>,{n,g,a}},[]},
 {{{n,g,a},n},[{<6946.39.0>,#Ref<0.0.4.170>,follow}]}]
  • Oops! The 'gproc' table is an ordered_set, so the above looks impossible.
(a@uwpro-2)4> ets:lookup(gproc, {pid(6946,39,0),{n,g,a}}).
[]
  • Uhmm ...
(a@uwpro-2)8> [A,B,C] = v(2).                   
[{{<6946.39.0>,{n,g,a}},[]},
 {{<6946.39.0>,{n,g,a}},[]},
 {{{n,g,a},n},[{<6946.39.0>,#Ref<0.0.4.170>,follow}]}]
(a@uwpro-2)9> ets:lookup(gproc,element(1,A)).
[{{<6946.39.0>,{n,g,a}},[]}]
(a@uwpro-2)10> ets:lookup(gproc,element(1,B)).
[{{<6946.39.0>,{n,g,a}},[]}]
  • Ok, so the two objects are both individually accessible.
(a@uwpro-2)11> A == B.
false

(a@uwpro-2)13> Pa = element(1,element(1,A)).
<6946.39.0>
(a@uwpro-2)14> Pb = element(1,element(1,B)).
<6946.39.0>
(a@uwpro-2)15> Pa == Pb.
false
  • The two pids are not the same.
(a@uwpro-2)18> term_to_binary(Pa).
<<131,103,100,0,9,98,64,117,119,112,114,111,45,50,0,0,0,
  39,0,0,0,0,1>>
(a@uwpro-2)19> term_to_binary(Pb).
<<131,103,100,0,9,98,64,117,119,112,114,111,45,50,0,0,0,
  39,0,0,0,0,2>>
  • The pids have different serials. Presumably this means that the nodes have disconnected and reconnected (and gproc should have cleaned up, but didn't)
(a@uwpro-2)20> Pa ! hi_a.
hi_a
(a@uwpro-2)21> Pb ! hi_b.
hi_b

Node B:

(b@uwpro-2)5> flush().
Shell got {gproc,unreg,#Ref<6830.0.4.170>,{n,g,a}}
Shell got hi_b
ok
  • So one pid worked, the other one didn't.

A bit interesting to try to reproduce, I guess.

@jlouis
Copy link
Contributor

jlouis commented Sep 6, 2017

2 Immediate observations:

  • These often happen if the internal representation of the object differ from the viewable representation. We had 0.0 and -0.0 in ETS earlier, and also -2^60 +/- 1 (can't remember exactly which it was). Transfer of a Pid could mean that the internal bit structure is different and thus the compare-function on the ordered set runs into trouble.

  • This sounds like a bug for QuickCheck. It is hard to reproduce, but chances are that generating lots of tests are able to find the bug and shrink it. Once you have it shrunk, it should be possible to reproduce it.

@dszoboszlay
Copy link

The pids have different serials. Presumably this means that the nodes have disconnected and reconnected (and gproc should have cleaned up, but didn't)

No, it means the node was stopped and restarted. It is actually quite easy to generate identical looking pairs of pids:

erl -sname receiver
Erlang/OTP 18 Klarna-g16e0e6a [erts-7.3.1.3] [source-16e0e6a] [64-bit] [smp:4:4] [async-threads:10] [kernel-poll:false]

Eshell V7.3.1.3  (abort with ^G)
(receiver@dszoboszlay)1> register(receiver, self()).
true
(receiver@dszoboszlay)2> os:cmd("erl -sname sender -noinput -eval '{receiver, receiver@dszoboszlay} ! self(), init:stop().'").
[]
(receiver@dszoboszlay)3> os:cmd("erl -sname sender -noinput -eval '{receiver, receiver@dszoboszlay} ! self(), init:stop().'").
[]
(receiver@dszoboszlay)4> receive P1 -> P1 end.
<7411.2.0>
(receiver@dszoboszlay)5> receive P2 -> P2 end.
<7411.2.0>
(receiver@dszoboszlay)6> P1 == P2.
false

I don't know how could gproc (and you) miss the node's restart however.

@uwiger
Copy link
Owner Author

uwiger commented Sep 7, 2017 via email

@hanssv
Copy link

hanssv commented Sep 14, 2017

I wrote a simple QuickCheck property, basically starting a bunch of nodes (with a placeholder process that make sure gproc is started) and commands to do monitor(..., follow) and ets:tab2list(gproc) on the nodes randomly.

Could not provoke the behaviour Ulf observed. Ran for an hour so some 100k node starts were probably made.

Then I added stopping nodes to the mix, and could (not too surprisingly) mimic the behaviour:

gproc_eqc:start_node(b, []) -> <22171.68.0>
gproc_eqc:start_node(a, [b]) -> <22070.72.0>
gproc_eqc:monitor(#node{ id = a, worker = <22070.72.0>, monitors = []}, a) ->
  #Ref<22171.852298202.3371433987.64237>
gproc_eqc:stop_node(a) -> ok
gproc_eqc:start_node(a, [b]) -> <22070.72.0>
gproc_eqc:monitor(#node{ id = a, worker = <22070.72.0>, monitors = []}, a) ->
  #Ref<22171.852298202.3371433987.64255>
gproc_eqc:check_node(#node{ id = b, worker = <22171.68.0>, monitors = []}) ->
  {ok,
     [{{<22070.72.0>, {n, g, a}}, []},
      {{<22070.72.0>, {n, g, a}}, []},
      {{{n, g, a}, n},
       [{<22070.72.0>, #Ref<22171.852298202.3371433987.64255>, follow},
        {<22070.72.0>, #Ref<22171.852298202.3371433987.64237>,  follow}]}]}

Reason:
  Post-condition failed:
  [{["<22070.72.0>"], [<22070.72.0>, <22070.72.0>]}] /= []

With the difference that I have two entries in the last element in the list... So it is not exactly the same. However, a node restart will lead to Pid reuse as observed already in Freiburg just a hair over ten years ago :-) (Time flies!!)

@uwiger
Copy link
Owner Author

uwiger commented Sep 14, 2017

Ah, very good! :)

I will concede that I must have forgotten to stop both nodes before making a new attempt. I see an opportunity for some more tests here, and the likelihood that gproc isn't doing what it's supposed to. The locks_leader branch, OTOH, has support for split-brain healing. I'll have to check how it behaves in the same situation.

@hanssv
Copy link

hanssv commented Sep 14, 2017

I'll see if I get time to cleanup that model into a non-embarrasing state, if so I will share it and it can be extended to do something useful :-)

@hanssv
Copy link

hanssv commented Sep 15, 2017

Made a pull request (#144)

For me the existing QuickCheck property failed horribly (made some changes in the pull request) but I guess that is expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants