-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lag metric reporting incorrect statistics #1206
Comments
so... I took a look at sarra/get_airnow_http:
It is retrieving a message from the retry queue, and then the retrieve fails, and it puts it back on the retry queue, every five minutes.
so that is causing the huge lag report. Question: should 404 cause us to give up on retries? If it doesn't have it now, when would we expect it to have it. the new "stat" code just merged got rid of some useless cases when putting stuff on a retry queue is useless... this might be another. otoh, the danger is that, when a server is sick, it gives bogus 404's that eventually get fixed. Probably the most flexible way is to have a set of codes that make us retry, and others that make us give up... so it's analyst tunable. |
one problem with giving up, is if it is a permission problem... sometimes "security" reasons reports permission denied as 404 to avoid bad people mapping the URL's that exist on a site (perm denied for the ones that do exist, 404 for those that don't... let's you figure out albeit very painfully, a site map.) but 99% of the time, this type of error is permanent. |
I'm marking this as invalid, as the lag metric is actually correct. There is nothing wrong with that. It is bringing attention to some other problem, and I have patch for that... but I dunno... it's not really a bug, and what to do is debatable. The patch should "fix" things so that 404 failures, instead of being retried, are discarded, considered permanent. |
work-arounds:
|
I found a similar lag output in the
It's reporting the lag being over 6 days again. But the last sent file was about 14 hours ago.
|
In this case, having a retry longer than the retention period of the server is counter productive... an indication of a mis-configuration... if the retention period (number of days on the server) is < retry_ttl, then this will happen a lot. I'm working on an 404 fix for senders anyways though. |
@andreleblanc11 So this latest patch should address the second improper? lag mode... the file has disappeared on our machine, so retries fail. Changed Signature of send() routineHey @reidsunderland input welcome on this aspect... the flow/ download() and send() routines used to return booleans indicating success or failure. But some failures are permanent and others are temporary. I changed download a while ago to return: -1 permanent failure, 0-temporary failure, 1 success. In this branch, I changed the boolean to a boolean tuple: (succeeded, permanently) if failed and not permanently, then queue for retry, if failed permanently then discard (self.worklist.rejected.) With the above patch, the signature of the send routine is changed from bool=ok (aka succeeded or not) to ( ok, permanent) (both bools, indicating that the failure can be temporary (so it gets queued for retry) or permanent (so it gets discarded.) This does not, but should change the signature of plugins to support the same thing... ugh... did it half-way... need to discuss. Anyways. I changed some flowcb signatures but the thing that calls them wasn't changed, so changed it back. Anyways... the root thing is... I changed download signature... and I'm wondering:
|
I agree that download and send should be consistent. One argument in favour of using integer return values in this case is that plugins don't need to be updated unless they need to return permanent failure? And maybe in some cases it would be useful to return a report code? Return a negative code for failures (e.g. -404, -403, etc.) and any positive code is a success (e.g. 200). If we use |
I like using -num to communicate the error code... |
Another option is to maybe the right thing to do is actually to change how lag is calculated... and exclude retries? but if they all fail, there is no lag? dunno... That does not sound right. |
does the lag calculation also include rejected messages? |
lag is calculated in flowcb/log.py/after_accept() ... it goes over the incoming list at that point, which is after the accept/reject clauses have been evaluated... so my guess is rejected messages will not be counted. |
oh... and log is at the end of after_accept.. so exclusions via duplicate suppression will also not be counted. |
Looking at sarra/get_airnow ... the original error is that the file is bigger than the poll saw, so it is erroring on a size mismatch:
So we can just put AcceptSizeWrong on |
Also, the retrieve path involves today, so we know any links won't be any good after at most 24 hours... so lowering retry_ttl to at most 24 hours would be helpful as well. |
Sometimes, some configurations will report a high lag time, which triggers the
lag
state insr3 status
.Example
The first value looks to make sense. The other 4 values span from 3 to 6 days which don't make sense when looking at their
last
values.So far this problem looks to only be affecting sarra components. Still unsure why
We checked the output of
sr3 dump
it also looked to report a similar story.lagMax
andlagTotal
are of huge valuesWork around
The work around for this is to increase the values of runStateThreshold_lag to an astronomical value on each of the configurations so that the status is never reported.The text was updated successfully, but these errors were encountered: