lag metric reporting incorrect statistics #1206

andreleblanc11 · 2024-09-04T15:20:50Z

Sometimes, some configurations will report a high lag time, which triggers the lag state in sr3 status.

Example

Component/Config                              Processes   Connection        Lag                              Rates
                                              State   Run Retry  msg data   Que   LagMax   LagAvg  Last  %rej     pubsub   messages     RxData     TxData
sarra/get-cmems-oceanobs                      lag     1/1     0 100%   0%     0  171.24s   90.80s 2.43s  0.0%    564 B/s      0 m/s 40.3 KiB/s      0 B/s
sarra/get_airnow_http                         lag     1/1     1 100%  67%     0 541312.32s 541312.32s 4196s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s
sarra/get_airnow_prov_depot                   lag     2/2    13 100%  69%     0 586328.54s 310211.80s 6940s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s
sarra/get_marine_weather                      lag     1/1    53 100%  72%     0 533804.73s 212781.40s 8549s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s
sarra/get_rvas_rem_pxpaz                      lag     1/1     1 100%  91%     0 270340.14s 270340.14s 6424s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s

The first value looks to make sense. The other 4 values span from 3 to 6 days which don't make sense when looking at their last values.
So far this problem looks to only be affecting sarra components. Still unsure why

We checked the output of sr3 dump it also looked to report a similar story. lagMax and lagTotal are of huge values

"get_marine_weather" :  {"instance_pids": {"1": 16077}, "queueName": "q_WXO-DD.sarra.get_marine_weather.ddsr.cmc.ec.gc.ca", "has_state": false, "noVip": null, "instance_metrics": {"1": {"flow": {"stop_requested": false, "last_housekeeping": 1725462788.3809488, "transferConnected": true, "transferConnectStart": 1725462900.03323, "transferConnectTime": 62.7959840297699, "transferRxBytes": 0, "transferTxBytes": 0, "transferRxFiles": 0, "transferTxFiles": 0, "last_housekeeping_cpuTime": 27725.94, "cpuTime": 12.299999999999272, "next_housekeeping": 1725463088.3809488, "msgRate": 0.005914938262384488, "msgRateCpu": 0.1625919536811215, "current_sleep": 6.4}, "post.message": {"connected": true, "disconnectLast": 0, "disconnectTime": 0, "disconnectCount": 0, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 0, "txGoodCount": 0, "txBadCount": 0, "txLast": "20240904T124839.0826826096"}, "gather.message": {"connected": true, "disconnectLast": 0, "disconnectTime": 0, "disconnectCount": 0, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 0, "txGoodCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 0, "rxLast": "20240904T124838.50991869"}, "retry": {"msgs_in_download_retry": 53, "msgs_in_post_retry": 0}, 
"log": {"lagMax": 534114.6203896999, "lagTotal": 11293876.948493004, "lagMessageCount": 53, "rejectCount": 0}, "sarracenia.transfer": {"byteRateInstant": 2300926.898529939}, "status": {"mtime": 1725462900.121718}}}, "logAge": {"1": 65.95925784111023}, "metrics": {"byteRate": 0.0, "cpuTime": 12.299999999999272, "rejectCount": 0, "last_housekeeping": 1725462788.3809488, "messagesQueued": 0, "lagMean": 213092.01789609442, "latestTransfer": "8786s", "rejectPercent": 0, "transferRxByteRate": 0.0, "transferTxByteRate": 0.0, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 0, "txGoodCount": 0, "txBadCount": 0, "lagMax": 534114.6203896999, "lagTotal": 11293876.948493004, "lagMessageCount": 53, "disconnectTime": 0.0, "transferConnectTime": 62.7959840297699, "transferRxLast": 0, "transferTxLast": 0, "rxLast": 1725454119.0826826, "txLast": 1725454119.0826826, "transferRxBytes": 0, "transferRxFiles": 0, "transferTxBytes": 0, "transferTxFiles": 0, "msgs_in_post_retry": 0, "msgs_in_download_retry": 53, "brokerQueuedMessageCount": 0, "time_base": 116.35145330429077, "byteTotal": 0, "msgRate": 0.0, "msgRateCpu": 0.0, "retry": 53, "transferLast": 0, "connectPercent": 100, "byteConnectPercent": 53, "messageLast": 1725454119.0826826, "transferRxFileRate": 0.0, "transferTxFileRate": 0.0}, "missing_instances": [], "hung_instances": [], "resource_usage": {"uss": 60354560, "rss": 76558336, "vms": 325898240, "user_cpu": 27219.12, "system_cpu": 519.13}}

Work around

~~The work around for this is to increase the values of runStateThreshold_lag to an astronomical value on each of the configurations so that the status is never reported.~~
do a cleanup.
make the retry_ttl lower ? so it will age out faster?
retry off (prevent retries completely) if this is a poll and the data is old, then maybe not worth retrieving.

The text was updated successfully, but these errors were encountered:

petersilva · 2024-09-04T18:18:57Z

so... I took a look at sarra/get_airnow_http:

2024-09-04 00:02:34,498 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /local/home/sarra/.cache/sr3/sarra/get_airnow_http/diskqueue_work_retry_01 open r
ead
2024-09-04 00:02:34,500 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 473281.70 ) https://s3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/

It is retrieving a message from the retry queue, and then the retrieve fails, and it puts it back on the retry queue, every five minutes.

2024-09-04 00:02:34,504 [DEBUG] sarracenia.transfer.https __open__ https://s3-us-west-1.amazonaws.com///files.airnowtech.org/airnow/today/HourlyData_202408291
1.dat
2024-09-04 00:02:34,772 [ERROR] sarracenia.transfer.https __open__ Download failed 4 https://s3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/today/Hou
rlyData_2024082911.dat
2024-09-04 00:02:34,772 [ERROR] sarracenia.transfer.https __open__ Server couldn't fulfill the request. Error code: 404, Not Found

so that is causing the huge lag report.

Question: should 404 cause us to give up on retries? If it doesn't have it now, when would we expect it to have it. the new "stat" code just merged got rid of some useless cases when putting stuff on a retry queue is useless... this might be another. otoh, the danger is that, when a server is sick, it gives bogus 404's that eventually get fixed.

Probably the most flexible way is to have a set of codes that make us retry, and others that make us give up... so it's analyst tunable.

petersilva · 2024-09-04T19:40:06Z

one problem with giving up, is if it is a permission problem... sometimes "security" reasons reports permission denied as 404 to avoid bad people mapping the URL's that exist on a site (perm denied for the ones that do exist, 404 for those that don't... let's you figure out albeit very painfully, a site map.)
so some 404's could get fixed, and then data queued for retry would start flowing again...

but 99% of the time, this type of error is permanent.

petersilva · 2024-09-04T20:19:34Z

I'm marking this as invalid, as the lag metric is actually correct. There is nothing wrong with that. It is bringing attention to some other problem, and I have patch for that... but I dunno... it's not really a bug, and what to do is debatable.

The patch should "fix" things so that 404 failures, instead of being retried, are discarded, considered permanent.
This only fixes http... would need more fixes for other protocols.

petersilva · 2024-09-04T20:23:22Z

work-arounds:

do a cleanup.
make the retry_ttl lower ? so it will age out faster?
retry off (prevent retries completely) if this is a poll and the data is old, then maybe not worth retrieving.

andreleblanc11 · 2024-09-06T19:05:28Z

I found a similar lag output in the sr3 status, this time in a sender. So it's getting the files locally.

ddsr-cmc-stage02:                                    State   Run Retry  msg data   Que   LagMax   LagAvg  Last  %rej     pubsub   messages     RxData     TxData
ddsr-cmc-stage02: sender/dnd_to_ice                  lag     1/1    17 100%   0%     0 563059.95s 212969.82s >9999  0.0%      0 B/s      0 m/s      0 B/s      0 B/s

It's reporting the lag being over 6 days again. But the last sent file was about 14 hours ago.
There's also old files getting retried from the retry queue, despite them not existing anymore locally.

ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [ERROR] sarracenia.flow send could not chdir to /apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05 to write: [Errno 2] No such file or directory: '/apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05'
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [WARNING] sarracenia.flow do_send sending again, attempt 2
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [ERROR] sarracenia.flow send could not chdir to /apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05 to write: [Errno 2] No such file or directory: '/apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05'
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [WARNING] sarracenia.flow do_send sending again, attempt 3
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [ERROR] sarracenia.flow send could not chdir to /apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05 to write: [Errno 2] No such file or directory: '/apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05'

petersilva · 2024-09-09T16:17:11Z

In this case, having a retry longer than the retention period of the server is counter productive... an indication of a mis-configuration... if the retention period (number of days on the server) is < retry_ttl, then this will happen a lot. I'm working on an 404 fix for senders anyways though.

petersilva · 2024-09-09T17:59:32Z

@andreleblanc11 So this latest patch should address the second improper? lag mode... the file has disappeared on our machine, so retries fail.

Changed Signature of send() routine

Hey @reidsunderland input welcome on this aspect... the flow/ download() and send() routines used to return booleans indicating success or failure. But some failures are permanent and others are temporary. I changed download a while ago to return: -1 permanent failure, 0-temporary failure, 1 success. In this branch, I changed the boolean to a boolean tuple: (succeeded, permanently) if failed and not permanently, then queue for retry, if failed permanently then discard (self.worklist.rejected.)

With the above patch, the signature of the send routine is changed from bool=ok (aka succeeded or not) to ( ok, permanent) (both bools, indicating that the failure can be temporary (so it gets queued for retry) or permanent (so it gets discarded.)

This does not, but should change the signature of plugins to support the same thing... ugh... did it half-way... need to discuss. Anyways. I changed some flowcb signatures but the thing that calls them wasn't changed, so changed it back.

Anyways... the root thing is... I changed download signature... and I'm wondering:

maybe make download and send consistent (either-1,0,1 for both, or the tuple for both.)
maybe change the flowcb overrides' signatures to match as well?

reidsunderland · 2024-09-09T18:18:34Z

I agree that download and send should be consistent. One argument in favour of using integer return values in this case is that plugins don't need to be updated unless they need to return permanent failure?

And maybe in some cases it would be useful to return a report code? Return a negative code for failures (e.g. -404, -403, etc.) and any positive code is a success (e.g. 200).

If we use result >= 1 to check for success, it still works if a plugin returns a boolean.

petersilva · 2024-09-09T18:21:26Z

I like using -num to communicate the error code...

petersilva · 2024-09-18T19:05:49Z

Another option is to maybe the right thing to do is actually to change how lag is calculated... and exclude retries? but if they all fail, there is no lag? dunno... That does not sound right.

reidsunderland · 2024-09-19T18:13:16Z

does the lag calculation also include rejected messages?

petersilva · 2024-09-20T03:24:12Z

lag is calculated in flowcb/log.py/after_accept() ... it goes over the incoming list at that point, which is after the accept/reject clauses have been evaluated... so my guess is rejected messages will not be counted.

petersilva · 2024-09-20T03:25:15Z

oh... and log is at the end of after_accept.. so exclusions via duplicate suppression will also not be counted.

petersilva · 2024-09-23T14:13:32Z

Looking at sarra/get_airnow ... the original error is that the file is bigger than the poll saw, so it is erroring on a size mismatch:

2024-09-19 15:07:27,589 [DEBUG] sarracenia.flow download hasAccel=True, thresh=0, len=50309, remote_off=0, local_off=0 inflight=HourlyData_2024091914.dat
2024-09-19 15:07:27,589 [DEBUG] sarracenia.transfer.https get get HourlyData_2024091914.dat HourlyData_2024091914.dat 0
2024-09-19 15:07:27,589 [DEBUG] sarracenia.transfer.https get sr_http self.path //files.airnowtech.org/airnow/today
2024-09-19 15:07:27,589 [DEBUG] sarracenia.transfer.https __open__ https://s3-us-west-1.amazonaws.com///files.airnowtech.org/airnow/today/HourlyData_2024091914.dat
2024-09-19 15:07:27,858 [DEBUG] sarracenia.transfer read_write sr_proto read_write
2024-09-19 15:07:27,939 [WARNING] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing HourlyData_2024091914.dat. Message said to expect 50309 bytes.  Got 76208 bytes.
2

So we can just put AcceptSizeWrong on

petersilva · 2024-09-23T14:14:32Z

Also, the retrieve path involves today, so we know any links won't be any good after at most 24 hours... so lowering retry_ttl to at most 24 hours would be helpful as well.

andreleblanc11 added bug Something isn't working work-around a work-around is provided, mitigating the issue. Priority 5 - Defect Missing and/or broken functionality - do not forget v3only Only affects v3 branches. labels Sep 4, 2024

petersilva added a commit that referenced this issue Sep 4, 2024

#1206 maybe not retry 404 errors?

9476358

petersilva added invalid This doesn't seem right and removed bug Something isn't working labels Sep 4, 2024

petersilva added a commit that referenced this issue Sep 9, 2024

#1206 sender 404 of source file

845c67c

petersilva added the Design impacts API, or code structure changes label Sep 9, 2024

petersilva added a commit that referenced this issue Sep 18, 2024

#1206 maybe not retry 404 errors?

07da79d

petersilva added a commit that referenced this issue Sep 18, 2024

#1206 sender 404 of source file

ce5b5d6

petersilva added a commit that referenced this issue Sep 18, 2024

#1206 maybe not retry 404 errors?

bd1cd3a

petersilva added a commit that referenced this issue Sep 18, 2024

#1206 change send signature to support permanent failures

2e76c6a

petersilva mentioned this issue Sep 18, 2024

Issue1206 2nd try 404 #1229

Closed

petersilva added the likely-fixed likely fix is in the repository, success not confirmed yet. label Sep 27, 2024

petersilva removed the likely-fixed likely fix is in the repository, success not confirmed yet. label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lag metric reporting incorrect statistics #1206

lag metric reporting incorrect statistics #1206

andreleblanc11 commented Sep 4, 2024 •

edited

Loading

petersilva commented Sep 4, 2024

petersilva commented Sep 4, 2024

petersilva commented Sep 4, 2024

petersilva commented Sep 4, 2024

andreleblanc11 commented Sep 6, 2024 •

edited

Loading

petersilva commented Sep 9, 2024

petersilva commented Sep 9, 2024

reidsunderland commented Sep 9, 2024

petersilva commented Sep 9, 2024

petersilva commented Sep 18, 2024

reidsunderland commented Sep 19, 2024

petersilva commented Sep 20, 2024

petersilva commented Sep 20, 2024

petersilva commented Sep 23, 2024

petersilva commented Sep 23, 2024

lag metric reporting incorrect statistics #1206

lag metric reporting incorrect statistics #1206

Comments

andreleblanc11 commented Sep 4, 2024 • edited Loading

Work around

petersilva commented Sep 4, 2024

petersilva commented Sep 4, 2024

petersilva commented Sep 4, 2024

petersilva commented Sep 4, 2024

andreleblanc11 commented Sep 6, 2024 • edited Loading

petersilva commented Sep 9, 2024

petersilva commented Sep 9, 2024

Changed Signature of send() routine

reidsunderland commented Sep 9, 2024

petersilva commented Sep 9, 2024

petersilva commented Sep 18, 2024

reidsunderland commented Sep 19, 2024

petersilva commented Sep 20, 2024

petersilva commented Sep 20, 2024

petersilva commented Sep 23, 2024

petersilva commented Sep 23, 2024

andreleblanc11 commented Sep 4, 2024 •

edited

Loading

andreleblanc11 commented Sep 6, 2024 •

edited

Loading