Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when fetching from slow HTTP server: shotgun:wait_response/3 not exported #96

Closed
kennethlakin opened this issue Sep 19, 2015 · 6 comments
Labels

Comments

@kennethlakin
Copy link
Contributor

This bug can be triggered by doing the following:

  • Find a HTTP server that takes many seconds to connect, or simply find one with a single IP address, and add an iptables DROP rule for that server
  • Compile shotgun e5ae062
  • Start an erlang shell, passing in the proper -pa args
  • Run the following, replacing very.slow.website.or.blocked.website with the FQDN or IP of the server in question:
application:ensure_all_started(shotgun).
{ok, Pid}=shotgun:open("very.slow.website.or.blocked.website", 80).
TestFun=fun(F) ->
            case shotgun:get(Pid, "/") of 
              {ok, Response} -> io:format("ok~n");
              {error, {timeout, _}} -> F(F)
            end
        end.
TestFun(TestFun).
f().

Lightly redacted, but otherwise complete conversation with an Erlang shell follows:

Erlang/OTP 18 [erts-7.0] [source] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.0  (abort with ^G)
(erl-node@host)1> application:ensure_all_started(shotgun).
{ok,[ranch,crypto,cowlib,asn1,public_key,ssl,gun,shotgun]}
(erl-node@host)2> {ok, Pid}=shotgun:open("very.slow.website.or.blocked.website", 80).
{ok,<0.68.0>}
(erl-node@host)3> TestFun=fun(F) ->
(erl-node@host)3>             case shotgun:get(Pid, "/") of 
(erl-node@host)3>               {ok, Response} -> io:format("ok~n");
(erl-node@host)3>               {error, {timeout, _}} -> F(F)
(erl-node@host)3>             end
(erl-node@host)3>         end.
#Fun<erl_eval.6.54118792>
(erl-node@host)4> TestFun(TestFun).

=ERROR REPORT==== 19-Sep-2015::04:13:11 ===
** State machine <0.68.0> terminating 
** Last message in was {'$gen_sync_event',
                           {<0.41.0>,#Ref<0.0.2.128>},
                           {get,{"/",[],[]}}}
** When State == wait_response
**      Data  == #{async => false,
                   async_mode => binary,
                   buffer => <<>>,
                   data => <<>>,
                   from => {<0.41.0>,#Ref<0.0.2.122>},
                   handle_event => undefined,
                   headers => undefined,
                   pid => <0.69.0>,
                   responses => {[],[]},
                   status_code => undefined,
                   stream => #Ref<0.0.2.123>}
** Reason for termination = 
** {'function not exported',
       [{shotgun,wait_response,
            [{get,{"/",[],[]}},
             {<0.41.0>,#Ref<0.0.2.128>},
             #{async => false,
               async_mode => binary,
               buffer => <<>>,
               data => <<>>,
               from => {<0.41.0>,#Ref<0.0.2.122>},
               handle_event => undefined,
               headers => undefined,
               pid => <0.69.0>,
               responses => {[],[]},
               status_code => undefined,
               stream => #Ref<0.0.2.123>}],
            []},
        {gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
** exception error: no case clause matching {error,{{undef,[{shotgun,wait_response,
                                                                     [{get,{"/",[],[]}},
                                                                      {<0.41.0>,#Ref<0.0.2.128>},
                                                                      #{async => false,
                                                                        async_mode => binary,
                                                                        buffer => <<>>,
                                                                        data => <<>>,
                                                                        from => {<0.41.0>,#Ref<0.0.2.122>},
                                                                        handle_event => undefined,
                                                                        headers => undefined,
                                                                        pid => <0.69.0>,
                                                                        responses => {[],[]},
                                                                        status_code => undefined,
                                                                        stream => #Ref<0.0.2.123>}],
                                                                     []},
                                                            {gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},
                                                            {proc_lib,init_p_do_apply,3,
                                                                      [{file,"proc_lib.erl"},{line,239}]}]},
                                                    {gen_fsm,sync_send_event,
                                                             [<0.68.0>,{get,{"/",[],[]}},5000]}}}
(erl-node@host)5> f().
ok
(erl-node@host)6> 

After making and exporting this zero-effort change in shotgun.erl

wait_response(Event, _, StateData) ->
  wait_response(Event, StateData).

I still get an error, which is as follows:

=ERROR REPORT==== 19-Sep-2015::04:26:54 ===
** State machine <0.68.0> terminating 
** Last message in was {'$gen_sync_event',
                           {<0.41.0>,#Ref<0.0.2.128>},
                           {get,{"/",[],[]}}}
** When State == wait_response
**      Data  == #{async => false,
                   async_mode => binary,
                   buffer => <<>>,
                   data => <<>>,
                   from => {<0.41.0>,#Ref<0.0.2.122>},
                   handle_event => undefined,
                   headers => undefined,
                   pid => <0.69.0>,
                   responses => {[],[]},
                   status_code => undefined,
                   stream => #Ref<0.0.2.123>}
** Reason for termination = 
** {unexpected,{get,{"/",[],[]}}}
** exception error: no case clause matching {error,{{unexpected,{get,{"/",[],[]}}},
                                                    {gen_fsm,sync_send_event,
                                                             [<0.68.0>,{get,{"/",[],[]}},5000]}}}

If I handle that case in the test program, then it looks like either shotgun or gun has crashed, as every subsequent call to shotgun:get returns a tuple that looks like:

{error,{noproc,{gen_fsm,sync_send_event,
                                   [<0.69.0>,{get,{"/",[],[]}},5000]}}}
@kennethlakin
Copy link
Contributor Author

I ran into this because I work with an Internet link that is sometimes flaky. I would like to have a chance to detect a bunch of failed connections and restart -just- the link, rather than having to contain the crashing client, and restart it and the Internet link.

(httpc:request also behaves badly in at least one circumstance; hanging forever despite being handed a timeout value. It is this behavior that set me on the search for a replacement.)

@jfacorro
Copy link
Contributor

jfacorro commented Oct 8, 2015

@kennethlakin Sorry for the late response. This is definitely a bug, but I'm not sure what the correct behavior should be though. Do you have any suggestions?

@jfacorro jfacorro added the bug label Oct 8, 2015
kennethlakin added a commit to kennethlakin/shotgun that referenced this issue Oct 11, 2015
This commit does two things:

* Addresses upstream issue inaka#96.
* Changes shotgun:open to optionally accept a map containing Ranch transport
  options (keyed to transport_opts) and/or a gun:await_up timeout (keyed to
  timeout). The default values are [] and 5000, respectively.

Issue inaka#96 is partially caused by calling gun:request before gun:open has
succeeded. shotgun:open now *blocks* until either gun establishes a connection,
or the user-specified timeout.

If the timeout is reached, the FSM will be stopped with reason
'gun_open_timeout'. If gun fails while attempting to establish a connection,
the FSM will be stopped with reason 'gun_open_failed'. See the comments near
the code that handles this stuff for some somewhat important information.

In addition to the changes surrounding the use of gun:await_up, one can now
pass options through to the underlying Ranch transport. This allows you to
-for instance- bind a particular shotgun connection to a particular interface
on a multi-homed machine.

Existing users of open/2 or open/3 will not be broken by this change.
@kennethlakin
Copy link
Contributor Author

This is definitely a bug, but I'm not sure what the correct behavior should be though. Do you have any suggestions?

I would expect shotgun:request to return an error of some kind, rather than having the FSM blow up. How should this be done? Sadly, I have no idea. I spent a little bit of time looking at both shotgun and gun's code, before coming up with a partial solution that does _not_ solve the root problem.

Figuring out the root cause of the error I reported will take a fair bit of work. When I was investigating, it looked like part of the problem was related to calling gun:request before gun:open had succeeded. Calling shotgun:request a second time after the previous request failed with a timeout causes the failure I reported.

I have made some changes that make it easier to fail to call shotgun:request before gun:open has succeeded:

shotgun:open now blocks on gun:await_up to ensure that gun:open succeeds. If gun:open fails shotgun:open returns {error, gun_open_failed}. If gun:open fails to complete before the specified timeout, shotgun:open returns {error, gun_open_timeout}. I'm not sure how useful this level of error reporting granularity will turn out to be, but a failure of gun to connect does ensure that the shotgun FSM shuts down.

PR #99 contains these changes, along with a few other useful modifications.

I hope to soon get some extra time, so I can see if I can strike at the root cause of the problem.

@kennethlakin
Copy link
Contributor Author

Okay. This might already be obvious, but here's what I've found.

Maybe I'm wrong, but the root cause appears to be our ability to send HTTP verb events to the shotgun FSM when it's in a state other than at_rest. It looks like the FSM-killing crash I reported happens when the FSM receives a GET request while it is in the wait_response state. The first crash I reported happens when the FSM gets a GET request while it's in the wait_response state. I suspect that if we were to send a GET request while we were in any other state (except for at_rest), we would just crash. (It looks like the other state handlers have function heads that only cover gun messages or 'DOWN' messages.)

So, I'm just spitballing here, and haven't given this a lot of thought, but what if these changes were made?

  • Add function heads to each state (that's not at_rest) that handle all of the available HTTP verbs.
    • In these function heads, reply to the sender with an error that tells the sender to wait for a bit and retry the request, 'cause we're not in a state to handle it right now.
  • Modify request/6 to understand this error and loop (maybe with a small sleep) until it gets a success reply.

Alternatively, what if those newly-added function heads appended the request to a request list in the State variable? at_rest could then be modified to check this list and -I guess- drain it before handling any other tasks. I can't immediately figure out how this would work, seeing as how shotgun:request and the HTTP verb calls blocks and returns the result. Maybe they it would perform an async call internally, and wait for wait_response to send the result? Perhaps at_rest would be in charge of populating the from variable in the FSM state from the oldest request in the requests list in the FSM state.

Does either one of these modifications sound reasonable? Like I said, I haven't spent all that much time thinking about them.

Oh. The kinda crummy graphviz diagram that I created to help understand the flow of the shotgun FSM is attached. Rectangles represent function calls (or returning {stop, Reason}). Ellipses represent FSM states. Arrow labels represent either a high-level description of the conditions for the transition, or the most pertinent arguments or conditions that lead to the transition.

I can provide the dot file if needed, but I'm not sure that it'll be any clearer than the graph.

shotgunstates

@kennethlakin
Copy link
Contributor Author

Okay. PR #100 fixes this issue.

As alluded to in the PR, I still have some concerns surrounding leftover mailbox messages from timed-out requests, and our inability to cancel in-progress gun requests. The documentation for gun:cancel states that there's no way to cancel an HTTP 1.1 stream, but gun will stop reporting events for it if you "cancel" a stream. This could be combined with a purge of the timed-out request from the shotgun work queue and a reset back to the at_rest work checking state. Or, one could maybe handle the gen_fsm call made in shotgun:request in a child process that is either killed by the exception from the timeout, or reports back if the request succeeds.

I created and ran a little test program that exercises the additions to shotgun:open and the shotgun FSM fixes, and demonstrates that shotgun now behaves more like I would expect. Here's an annotated conversation with an erlang shell from the run. Lengthy, useless spew is elided.

1> application:ensure_all_started(shotgun).
{ok,[ranch,cowlib,asn1,public_key,ssl,gun,shotgun]}
2> %Test program executed
Got timeout from shotgun:get.
Got timeout from shotgun:get.
%At this point, I remove the iptables DROP rule. A few seconds later...
Response get!
ok
3> os:cmd("sudo -k iptables -D OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction), 
3> flush().
Shell got {#Ref<0.0.2.443>,
%spew elided
Shell got {#Ref<0.0.2.447>,
%spew elided
ok
4> f().
ok
5> application:stop(shotgun).
ok
6> 
=INFO REPORT==== 14-Oct-2015::01:18:01 ===
    application: shotgun
    exited: stopped
    type: temporary
6> 

The little test program follows:

%This is intended to be pasted into an Erlang shell
%that already has shotgun and its libraries loaded.

application:ensure_all_started(shotgun).
IptablesAction="DROP",
Host="example.com",
Timeout=timer:seconds(5), %infinity works.
%%Test that our open timeout works correctly, or test
%%that our open failure detection code works correctly:
%os:cmd("sudo -k iptables -A OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
case shotgun:open(Host, 80, #{timeout => Timeout}) of
  {ok, Pid} ->
    %%Test that our get failure paths work as expected.
    os:cmd("sudo -k iptables -A OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
    TestFun=fun(F) ->
                case shotgun:get(Pid, "/", #{}, #{timeout => Timeout}) of 
                  {ok, Response} -> 
                    io:format("Response get!~n");
                  {error, {closed, Message}} -> 
                    io:format("Gun connection was closed. Message ~p~n", [Message]);
                  {error, {timeout, _}} -> 
                    io:format("Got timeout from shotgun:get.~n"),
                    F(F)
                end
            end,
    TestFun(TestFun);
   {error, gun_open_failed} -> bah_gun_open_failed;
   {error, gun_open_timeout} -> bah_gun_open_timeout
end.
os:cmd("sudo -k iptables -D OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
flush().
f().
application:stop(shotgun).
application:stop(gun).

@jfacorro
Copy link
Contributor

jfacorro commented Nov 2, 2015

@kennethlakin We are closing this issue. We can address the messages from timed-out requests in issue #109.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants