Memory grows until OOM with slow telemetry collector and lots of data. #26

seiferteric · 2021-03-02T23:25:47Z

I am opening this issue to get an opinion on how to handle an issue I have observed. We had a telemetry collector (telegraf with custom gNMI plugin) that was running with limited CPU quota (I think limited to ~20% CPU time). The collector was using SAMPLE mode to get a large BGP table. Over time we noticed that the memory of the telemetry process was increasing until we hit OOM and the process was killed.

I identified the issue in that when in client_subscribe.go send() function we call err = stream.Send(resp). This call will actually block in the case I described above when the collector is not processing data quickly enough. The problem then is that the telemetry process will keep adding data to the PriorityQueue which causes the memory to grow. To rectify this issue, I introduced a new "LimitedQueue" instead of the current PriorityQueue in our (Dell) sonic-telemetry. The LimitedQueue will check the size of the Queue and reject adding newitems if the size is greater than the predefined maximum size (default I set to 100MB).

This is working, however it means that the collector will start to miss telemetry updates. Recently Broadcom recommended instead I close the connection with gRCP code RESOURCE_EXHAUSTED instead of silently dropping updates.

Wanting to know what is the community preferred way to do this before opening a PR.

The text was updated successfully, but these errors were encountered:

lguohan · 2021-03-03T20:02:17Z

in case of closing the connection, will all memory allocated be released?

when will the telemety reconnect?

seiferteric · 2021-03-04T00:03:59Z

Yes the queue will be disposed and so the memory will be freed. It would be up to the collector to reconnect after receiving the RESOURCE_EXHAUSTED error since this is for dial-in telemetry.

macikgozwa · 2021-03-04T00:17:57Z

I think sending RESOURCE_EXHAUSTED and terminating the subscription is a good option. That would be analogous to HTTP 429 (throttling) response from a Rest API.

One potential concern would be a client which mixes a high volume data and low volume data in the same SubscriptionList request. Since the queue is per SubscriptionList request, the telemetry service would end up canceling both high volume and low volume paths. The collector side should be aware of this case and not mix them in the same SubsciptionList request.

#### Why I did it Fix https://github.com/Azure/sonic-telemetry/issues/71 #### How I did it Added memory limit for telemetry docker. Historical docker memory usage shows telemetry docker consuming 150-200MB memory. Adding some extra buffer.

qiluo-msft · 2022-02-01T00:15:59Z

@seiferteric Could you help connect to right person in Broadcom

Recently Broadcom recommended instead I close the connection with gRCP code RESOURCE_EXHAUSTED instead of silently dropping updates

I think the disconnecting gRPC client is a better choice.

sneelam20 · 2024-07-24T15:59:50Z

Assigning this @anand-kumar-subramanian based on the last comment.

anand-kumar-subramanian · 2024-07-24T20:59:02Z

This issue was already fixed by Dell. Assigning this to @kwangsuk to make sure this issue is fixed.

kwangsuk · 2024-07-25T16:27:34Z

This issue was already fixed by Dell. Assigning this to @kwangsuk to make sure this issue is fixed.

The fix is not yet made in the community. We will plan to replicate the fix.

pra-moh mentioned this issue Mar 15, 2021

[Telemetry docker] add memory and memory swap limits sonic-net/sonic-buildimage#7062

Merged

1 task

qiluo-msft closed this as completed in sonic-net/sonic-buildimage#7062 May 11, 2021

pra-moh reopened this May 18, 2021

kellyyeh closed this as completed in kellyyeh/sonic-buildimage@0c59278 Jun 3, 2021

qiluo-msft reopened this Jan 31, 2022

ganglyu transferred this issue from sonic-net/sonic-telemetry Aug 31, 2022

sneelam20 assigned anand-kumar-subramanian Jul 24, 2024

sneelam20 added Triage BRCM labels Jul 24, 2024

anand-kumar-subramanian assigned kwangsuk and unassigned anand-kumar-subramanian Jul 24, 2024

anand-kumar-subramanian added Dell and removed BRCM labels Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory grows until OOM with slow telemetry collector and lots of data. #26

Memory grows until OOM with slow telemetry collector and lots of data. #26

seiferteric commented Mar 2, 2021 •

edited

Loading

lguohan commented Mar 3, 2021

seiferteric commented Mar 4, 2021

macikgozwa commented Mar 4, 2021

qiluo-msft commented Feb 1, 2022

sneelam20 commented Jul 24, 2024

anand-kumar-subramanian commented Jul 24, 2024

kwangsuk commented Jul 25, 2024

Memory grows until OOM with slow telemetry collector and lots of data. #26

Memory grows until OOM with slow telemetry collector and lots of data. #26

Comments

seiferteric commented Mar 2, 2021 • edited Loading

lguohan commented Mar 3, 2021

seiferteric commented Mar 4, 2021

macikgozwa commented Mar 4, 2021

qiluo-msft commented Feb 1, 2022

sneelam20 commented Jul 24, 2024

anand-kumar-subramanian commented Jul 24, 2024

kwangsuk commented Jul 25, 2024

seiferteric commented Mar 2, 2021 •

edited

Loading