-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very large memory-consumption in SplunkHttp #265
Comments
Seeing the same issue. In our case, HTTP requests get stuck in the HTTP client's queue ( |
Experiencing the same issue. In our case this even caused our application running in AWS Fargate to crash as the G1 Garbage Collector would take up 100% of available CPU cycles, which would eventually lead to ECS shutting down the application because it wasn't responding to health checks in a timely fashion. Our temporary fix was to decrease the amount of logs we're sending to Splunk and increase the heap size, but the long-term solution will likely be to log to Cloudwatch only and export the logs with a dedicated Lambda. I'd be interested if there's any other fixes for this issue though. |
As a workaround, batching log events helped in our case. That way, the amount of requests to Splunk is reduced and they’re less likely to pile up in the HTTP client.
|
Just curious, what batch-size did you find useful? We raised the number from the original 17 to 2099 -- and that still leads to the above-charted heap-usage... Our |
We started off with the recommended values (see here) Looking at the settings you posted initially, it appears to me that you might be sending too fast. |
We encountered this recently, so for anyone trying to understand the problem, it looks like the default behavior of the appender is to dispatch every logging event as a discrete http request to the HEC endpoint. Those requests are enqueued in the okhttp dispatcher, which by default (send_mode=serial) only uses one thread to send the requests to the API. Those requests pile up in the okhttp buffer until the application hits an OOME. You fix this by configuring batch_size_count or batch_size_bytes to a nonzero value, by also configuring batch_interval to a nonzero number of ms (to guarantee that an incomplete buffer is flushed if no events arrive to fill it over some period), and by optionally configuring send_mode to "parallel" so that more threads can step in to help with bursts of logging. As I understand it, anyway. |
None of these methods will fix the problem. They'll help reduce it, yes, but not eliminate completely. A fix would involve dropping events, when JVM is getting closer to hitting OOM, but no one dares to propose such a dataloss :( |
Depending on whether Splunk-logging is enabled in
log4j2.xml
:our application's total heap-usage rises by about 4Gb, as can be seen on these charts from
jconsole
! Granted, the application is verbose with multiple events per second at times, but 4Gb still seems excessive...The receiving HEC is the same for the entire enterprise -- I doubt, that's the bottleneck. Please, advise.
Edit: the #249, which seems to be about the same/similar problem, is closed -- perhaps, prematurely. There is no good way to handle the Splunk server being down (or slow), perhaps, there should be options to simply drop log-entries, when such a situation happens -- to preserve heap -- based either on the events' age and/or severity.
The total count of such dropped messages can be kept -- and logged on its own, when possible: "HEC latency necessitated dropping of %u events".
The text was updated successfully, but these errors were encountered: