Very large memory-consumption in SplunkHttp #265

UnitedMarsupials · 2022-11-22T23:04:41Z

Depending on whether Splunk-logging is enabled in log4j2.xml:

    <SplunkHttp
    ...
                     batch_size_count="2099"
                     batch_interval="3"/>

our application's total heap-usage rises by about 4Gb, as can be seen on these charts from jconsole! Granted, the application is verbose with multiple events per second at times, but 4Gb still seems excessive...

The receiving HEC is the same for the entire enterprise -- I doubt, that's the bottleneck. Please, advise.

Edit: the #249, which seems to be about the same/similar problem, is closed -- perhaps, prematurely. There is no good way to handle the Splunk server being down (or slow), perhaps, there should be options to simply drop log-entries, when such a situation happens -- to preserve heap -- based either on the events' age and/or severity.

The total count of such dropped messages can be kept -- and logged on its own, when possible: "HEC latency necessitated dropping of %u events".

The text was updated successfully, but these errors were encountered:

oliver-brm · 2023-01-20T11:46:23Z

Seeing the same issue. In our case, HTTP requests get stuck in the HTTP client's queue (okhttp3.Dispatcher.readyAsyncCalls). Same on your side?

twaslowski · 2023-02-02T12:59:47Z

Experiencing the same issue. In our case this even caused our application running in AWS Fargate to crash as the G1 Garbage Collector would take up 100% of available CPU cycles, which would eventually lead to ECS shutting down the application because it wasn't responding to health checks in a timely fashion.

Our temporary fix was to decrease the amount of logs we're sending to Splunk and increase the heap size, but the long-term solution will likely be to log to Cloudwatch only and export the logs with a dedicated Lambda. I'd be interested if there's any other fixes for this issue though.

oliver-brm · 2023-02-02T17:16:02Z

As a workaround, batching log events helped in our case. That way, the amount of requests to Splunk is reduced and they’re less likely to pile up in the HTTP client.

UnitedMarsupials · 2023-02-02T19:33:44Z

As a workaround, batching log events helped in our case. That way, the amount of requests to Splunk is reduced and they’re less likely to pile up in the HTTP client.

Just curious, what batch-size did you find useful? We raised the number from the original 17 to 2099 -- and that still leads to the above-charted heap-usage... Our batch_interval here is 3, though -- maybe, we ought to raise that too...

oliver-brm · 2023-02-04T08:33:07Z

We started off with the recommended values (see here)

Looking at the settings you posted initially, it appears to me that you might be sending too fast. batch_interval is interpreted in milliseconds, so you might want to raise that to 3000.

Tomboyo · 2024-02-09T21:08:07Z

We encountered this recently, so for anyone trying to understand the problem, it looks like the default behavior of the appender is to dispatch every logging event as a discrete http request to the HEC endpoint. Those requests are enqueued in the okhttp dispatcher, which by default (send_mode=serial) only uses one thread to send the requests to the API. Those requests pile up in the okhttp buffer until the application hits an OOME.

You fix this by configuring batch_size_count or batch_size_bytes to a nonzero value, by also configuring batch_interval to a nonzero number of ms (to guarantee that an incomplete buffer is flushed if no events arrive to fill it over some period), and by optionally configuring send_mode to "parallel" so that more threads can step in to help with bursts of logging.

As I understand it, anyway.

UnitedMarsupials · 2024-02-09T21:11:11Z

You fix this by configuring batch_size_count or batch_size_bytes

None of these methods will fix the problem. They'll help reduce it, yes, but not eliminate completely. A fix would involve dropping events, when JVM is getting closer to hitting OOM, but no one dares to propose such a dataloss :(

rquinio1A mentioned this issue May 30, 2024

Potential memory issues when slow HEC responses quarkiverse/quarkus-logging-splunk#255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very large memory-consumption in SplunkHttp #265

Very large memory-consumption in SplunkHttp #265

UnitedMarsupials commented Nov 22, 2022 •

edited

Loading

oliver-brm commented Jan 20, 2023

twaslowski commented Feb 2, 2023

oliver-brm commented Feb 2, 2023 via email

UnitedMarsupials commented Feb 2, 2023

oliver-brm commented Feb 4, 2023

Tomboyo commented Feb 9, 2024

UnitedMarsupials commented Feb 9, 2024

Very large memory-consumption in SplunkHttp #265

Very large memory-consumption in SplunkHttp #265

Comments

UnitedMarsupials commented Nov 22, 2022 • edited Loading

oliver-brm commented Jan 20, 2023

twaslowski commented Feb 2, 2023

oliver-brm commented Feb 2, 2023 via email

UnitedMarsupials commented Feb 2, 2023

oliver-brm commented Feb 4, 2023

Tomboyo commented Feb 9, 2024

UnitedMarsupials commented Feb 9, 2024

UnitedMarsupials commented Nov 22, 2022 •

edited

Loading