Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lag metric reporting incorrect statistics #1206

Open
andreleblanc11 opened this issue Sep 4, 2024 · 15 comments
Open

lag metric reporting incorrect statistics #1206

andreleblanc11 opened this issue Sep 4, 2024 · 15 comments
Labels
Design impacts API, or code structure changes invalid This doesn't seem right Priority 5 - Defect Missing and/or broken functionality - do not forget v3only Only affects v3 branches. work-around a work-around is provided, mitigating the issue.

Comments

@andreleblanc11
Copy link
Member

andreleblanc11 commented Sep 4, 2024

Sometimes, some configurations will report a high lag time, which triggers the lag state in sr3 status.

Example

Component/Config                              Processes   Connection        Lag                              Rates
                                              State   Run Retry  msg data   Que   LagMax   LagAvg  Last  %rej     pubsub   messages     RxData     TxData
sarra/get-cmems-oceanobs                      lag     1/1     0 100%   0%     0  171.24s   90.80s 2.43s  0.0%    564 B/s      0 m/s 40.3 KiB/s      0 B/s
sarra/get_airnow_http                         lag     1/1     1 100%  67%     0 541312.32s 541312.32s 4196s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s
sarra/get_airnow_prov_depot                   lag     2/2    13 100%  69%     0 586328.54s 310211.80s 6940s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s
sarra/get_marine_weather                      lag     1/1    53 100%  72%     0 533804.73s 212781.40s 8549s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s
sarra/get_rvas_rem_pxpaz                      lag     1/1     1 100%  91%     0 270340.14s 270340.14s 6424s  0.0%      0 B/s      0 m/s      0 B/s      0 B/s

The first value looks to make sense. The other 4 values span from 3 to 6 days which don't make sense when looking at their last values.
So far this problem looks to only be affecting sarra components. Still unsure why

We checked the output of sr3 dump it also looked to report a similar story. lagMax and lagTotal are of huge values

"get_marine_weather" :  {"instance_pids": {"1": 16077}, "queueName": "q_WXO-DD.sarra.get_marine_weather.ddsr.cmc.ec.gc.ca", "has_state": false, "noVip": null, "instance_metrics": {"1": {"flow": {"stop_requested": false, "last_housekeeping": 1725462788.3809488, "transferConnected": true, "transferConnectStart": 1725462900.03323, "transferConnectTime": 62.7959840297699, "transferRxBytes": 0, "transferTxBytes": 0, "transferRxFiles": 0, "transferTxFiles": 0, "last_housekeeping_cpuTime": 27725.94, "cpuTime": 12.299999999999272, "next_housekeeping": 1725463088.3809488, "msgRate": 0.005914938262384488, "msgRateCpu": 0.1625919536811215, "current_sleep": 6.4}, "post.message": {"connected": true, "disconnectLast": 0, "disconnectTime": 0, "disconnectCount": 0, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 0, "txGoodCount": 0, "txBadCount": 0, "txLast": "20240904T124839.0826826096"}, "gather.message": {"connected": true, "disconnectLast": 0, "disconnectTime": 0, "disconnectCount": 0, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 0, "txGoodCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 0, "rxLast": "20240904T124838.50991869"}, "retry": {"msgs_in_download_retry": 53, "msgs_in_post_retry": 0}, 
"log": {"lagMax": 534114.6203896999, "lagTotal": 11293876.948493004, "lagMessageCount": 53, "rejectCount": 0}, "sarracenia.transfer": {"byteRateInstant": 2300926.898529939}, "status": {"mtime": 1725462900.121718}}}, "logAge": {"1": 65.95925784111023}, "metrics": {"byteRate": 0.0, "cpuTime": 12.299999999999272, "rejectCount": 0, "last_housekeeping": 1725462788.3809488, "messagesQueued": 0, "lagMean": 213092.01789609442, "latestTransfer": "8786s", "rejectPercent": 0, "transferRxByteRate": 0.0, "transferTxByteRate": 0.0, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 0, "txGoodCount": 0, "txBadCount": 0, "lagMax": 534114.6203896999, "lagTotal": 11293876.948493004, "lagMessageCount": 53, "disconnectTime": 0.0, "transferConnectTime": 62.7959840297699, "transferRxLast": 0, "transferTxLast": 0, "rxLast": 1725454119.0826826, "txLast": 1725454119.0826826, "transferRxBytes": 0, "transferRxFiles": 0, "transferTxBytes": 0, "transferTxFiles": 0, "msgs_in_post_retry": 0, "msgs_in_download_retry": 53, "brokerQueuedMessageCount": 0, "time_base": 116.35145330429077, "byteTotal": 0, "msgRate": 0.0, "msgRateCpu": 0.0, "retry": 53, "transferLast": 0, "connectPercent": 100, "byteConnectPercent": 53, "messageLast": 1725454119.0826826, "transferRxFileRate": 0.0, "transferTxFileRate": 0.0}, "missing_instances": [], "hung_instances": [], "resource_usage": {"uss": 60354560, "rss": 76558336, "vms": 325898240, "user_cpu": 27219.12, "system_cpu": 519.13}}

Work around

  • The work around for this is to increase the values of runStateThreshold_lag to an astronomical value on each of the configurations so that the status is never reported.
  • do a cleanup.
  • make the retry_ttl lower ? so it will age out faster?
  • retry off (prevent retries completely) if this is a poll and the data is old, then maybe not worth retrieving.
@andreleblanc11 andreleblanc11 added bug Something isn't working work-around a work-around is provided, mitigating the issue. Priority 5 - Defect Missing and/or broken functionality - do not forget v3only Only affects v3 branches. labels Sep 4, 2024
@petersilva
Copy link
Contributor

so... I took a look at sarra/get_airnow_http:

2024-09-04 00:02:34,498 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /local/home/sarra/.cache/sr3/sarra/get_airnow_http/diskqueue_work_retry_01 open r
ead
2024-09-04 00:02:34,500 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 473281.70 ) https://s3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/

It is retrieving a message from the retry queue, and then the retrieve fails, and it puts it back on the retry queue, every five minutes.

2024-09-04 00:02:34,504 [DEBUG] sarracenia.transfer.https __open__ https://s3-us-west-1.amazonaws.com///files.airnowtech.org/airnow/today/HourlyData_202408291
1.dat
2024-09-04 00:02:34,772 [ERROR] sarracenia.transfer.https __open__ Download failed 4 https://s3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/today/Hou
rlyData_2024082911.dat
2024-09-04 00:02:34,772 [ERROR] sarracenia.transfer.https __open__ Server couldn't fulfill the request. Error code: 404, Not Found

so that is causing the huge lag report.

Question: should 404 cause us to give up on retries? If it doesn't have it now, when would we expect it to have it. the new "stat" code just merged got rid of some useless cases when putting stuff on a retry queue is useless... this might be another. otoh, the danger is that, when a server is sick, it gives bogus 404's that eventually get fixed.

Probably the most flexible way is to have a set of codes that make us retry, and others that make us give up... so it's analyst tunable.

@petersilva
Copy link
Contributor

one problem with giving up, is if it is a permission problem... sometimes "security" reasons reports permission denied as 404 to avoid bad people mapping the URL's that exist on a site (perm denied for the ones that do exist, 404 for those that don't... let's you figure out albeit very painfully, a site map.)
so some 404's could get fixed, and then data queued for retry would start flowing again...

but 99% of the time, this type of error is permanent.

petersilva added a commit that referenced this issue Sep 4, 2024
@petersilva petersilva added invalid This doesn't seem right and removed bug Something isn't working labels Sep 4, 2024
@petersilva
Copy link
Contributor

I'm marking this as invalid, as the lag metric is actually correct. There is nothing wrong with that. It is bringing attention to some other problem, and I have patch for that... but I dunno... it's not really a bug, and what to do is debatable.

The patch should "fix" things so that 404 failures, instead of being retried, are discarded, considered permanent.
This only fixes http... would need more fixes for other protocols.

@petersilva
Copy link
Contributor

work-arounds:

  • do a cleanup.
  • make the retry_ttl lower ? so it will age out faster?
  • retry off (prevent retries completely) if this is a poll and the data is old, then maybe not worth retrieving.

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented Sep 6, 2024

I found a similar lag output in the sr3 status, this time in a sender. So it's getting the files locally.

ddsr-cmc-stage02:                                    State   Run Retry  msg data   Que   LagMax   LagAvg  Last  %rej     pubsub   messages     RxData     TxData
ddsr-cmc-stage02: sender/dnd_to_ice                  lag     1/1    17 100%   0%     0 563059.95s 212969.82s >9999  0.0%      0 B/s      0 m/s      0 B/s      0 B/s

It's reporting the lag being over 6 days again. But the last sent file was about 14 hours ago.
There's also old files getting retried from the retry queue, despite them not existing anymore locally.

ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [ERROR] sarracenia.flow send could not chdir to /apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05 to write: [Errno 2] No such file or directory: '/apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05'
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [WARNING] sarracenia.flow do_send sending again, attempt 2
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [ERROR] sarracenia.flow send could not chdir to /apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05 to write: [Errno 2] No such file or directory: '/apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05'
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [WARNING] sarracenia.flow do_send sending again, attempt 3
ddsr-cmc-stage01: 2024-09-06 18:50:40,445 [ERROR] sarracenia.flow send could not chdir to /apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05 to write: [Errno 2] No such file or directory: '/apps/sarra/public_data/20240903/DND/ICE/sfc-prog/05'

@petersilva
Copy link
Contributor

In this case, having a retry longer than the retention period of the server is counter productive... an indication of a mis-configuration... if the retention period (number of days on the server) is < retry_ttl, then this will happen a lot. I'm working on an 404 fix for senders anyways though.

petersilva added a commit that referenced this issue Sep 9, 2024
@petersilva
Copy link
Contributor

@andreleblanc11 So this latest patch should address the second improper? lag mode... the file has disappeared on our machine, so retries fail.

Changed Signature of send() routine

Hey @reidsunderland input welcome on this aspect... the flow/ download() and send() routines used to return booleans indicating success or failure. But some failures are permanent and others are temporary. I changed download a while ago to return: -1 permanent failure, 0-temporary failure, 1 success. In this branch, I changed the boolean to a boolean tuple: (succeeded, permanently) if failed and not permanently, then queue for retry, if failed permanently then discard (self.worklist.rejected.)

With the above patch, the signature of the send routine is changed from bool=ok (aka succeeded or not) to ( ok, permanent) (both bools, indicating that the failure can be temporary (so it gets queued for retry) or permanent (so it gets discarded.)

This does not, but should change the signature of plugins to support the same thing... ugh... did it half-way... need to discuss. Anyways. I changed some flowcb signatures but the thing that calls them wasn't changed, so changed it back.

Anyways... the root thing is... I changed download signature... and I'm wondering:

  • maybe make download and send consistent (either-1,0,1 for both, or the tuple for both.)
  • maybe change the flowcb overrides' signatures to match as well?

@petersilva petersilva added the Design impacts API, or code structure changes label Sep 9, 2024
@reidsunderland
Copy link
Member

I agree that download and send should be consistent. One argument in favour of using integer return values in this case is that plugins don't need to be updated unless they need to return permanent failure?

And maybe in some cases it would be useful to return a report code? Return a negative code for failures (e.g. -404, -403, etc.) and any positive code is a success (e.g. 200).

If we use result >= 1 to check for success, it still works if a plugin returns a boolean.

@petersilva
Copy link
Contributor

I like using -num to communicate the error code...

@petersilva
Copy link
Contributor

Another option is to maybe the right thing to do is actually to change how lag is calculated... and exclude retries? but if they all fail, there is no lag? dunno... That does not sound right.

@reidsunderland
Copy link
Member

does the lag calculation also include rejected messages?

@petersilva
Copy link
Contributor

lag is calculated in flowcb/log.py/after_accept() ... it goes over the incoming list at that point, which is after the accept/reject clauses have been evaluated... so my guess is rejected messages will not be counted.

@petersilva
Copy link
Contributor

oh... and log is at the end of after_accept.. so exclusions via duplicate suppression will also not be counted.

@petersilva
Copy link
Contributor

Looking at sarra/get_airnow ... the original error is that the file is bigger than the poll saw, so it is erroring on a size mismatch:

2024-09-19 15:07:27,589 [DEBUG] sarracenia.flow download hasAccel=True, thresh=0, len=50309, remote_off=0, local_off=0 inflight=HourlyData_2024091914.dat
2024-09-19 15:07:27,589 [DEBUG] sarracenia.transfer.https get get HourlyData_2024091914.dat HourlyData_2024091914.dat 0
2024-09-19 15:07:27,589 [DEBUG] sarracenia.transfer.https get sr_http self.path //files.airnowtech.org/airnow/today
2024-09-19 15:07:27,589 [DEBUG] sarracenia.transfer.https __open__ https://s3-us-west-1.amazonaws.com///files.airnowtech.org/airnow/today/HourlyData_2024091914.dat
2024-09-19 15:07:27,858 [DEBUG] sarracenia.transfer read_write sr_proto read_write
2024-09-19 15:07:27,939 [WARNING] sarracenia.transfer read_writelocal util/writelocal mismatched file length writing HourlyData_2024091914.dat. Message said to expect 50309 bytes.  Got 76208 bytes.
2

So we can just put AcceptSizeWrong on

@petersilva
Copy link
Contributor

Also, the retrieve path involves today, so we know any links won't be any good after at most 24 hours... so lowering retry_ttl to at most 24 hours would be helpful as well.

@petersilva petersilva added the likely-fixed likely fix is in the repository, success not confirmed yet. label Sep 27, 2024
@petersilva petersilva removed the likely-fixed likely fix is in the repository, success not confirmed yet. label Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Design impacts API, or code structure changes invalid This doesn't seem right Priority 5 - Defect Missing and/or broken functionality - do not forget v3only Only affects v3 branches. work-around a work-around is provided, mitigating the issue.
Projects
None yet
Development

No branches or pull requests

3 participants