-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to send multiple log messages in a single HTTP request #332
Comments
We are open to contribute this feature if our proposal is accepted. Since the syslog RFC 5424 doesn't specify a batched format we have to define a structure that is accepted by the community. From our point of view, the easiest implementation would be to use multi-line batches. |
Hi @bgoerzig, thanks for taking the time to write up this proposal. I think My first take is that we would be open to a PR to change the I need to re-read up on these schemes, I can no longer recall exactly why |
We've discussed the same issue in our team when @fhambrec and @dark5un were doing Syslog Agent performance tests. The difference between the unencrypted and encrypted traffic and the protocol https vs syslog-tls is significant. That's why at the end the recommendation is to use syslog-tls scheme for higher loads. At the end, our discussion went into direction of refactoring Syslog Agent's egress so that we have roughly speaking a connection poll with load balancing. The Syslog agent opens multiple connections and sends the messages in parallel. We can revive topic in our team ans see what can we do about it. |
The reason that http drains perform badly is that the requests are sent one after the other over one tcp connection. The throughput is therefore 1/(roundtrip time) which is less than 200/s inside an AWS datacenter. We reproduced that in performance tests. With syslog > 4000/s could be sent (independent if it is with (m)tls or not). I see three options to overcome this, maybe there are more:
I would favor http2. We had already planned to look into it on SAP side. This is in my opinion the most standard way. We could e.g. add the scheme "https2" to the list addiotionally to https, syslog and syslog+tls
I would go one step further and would request that the feature is per drain. We are running multi-tenant foundations and cannot break existing customers using an http syslog drain. Maybe an option would be to configure it with a query parameter in the syslog drain url (like the drain-type). |
Thank you for the refresher @stefanlay!
I agree with this. Using http2 seems the more "standard" way to pump up the performance of this drain option.
That's a very reasonable request. I'm in favor of this as well. |
Howdy! Just want to inject some context and thoughts. Firstly, honestly speaking, I'm very much not opposed to adding the ability to batch into the syslog-agent code base. Not only do i suspect that batching will give us huge performance gains for https drains, if and when we start to look more into adding additional protocol support to the drain-agent(syslog-agent) we'd have a reusable consistant tool which we could easily add to other interfaces that require or would benefit from batching(such as for example, otlp). Back in cf-for-k8s times, we spent a lot of time looking into our path to shared nothing for log-cache. As we were working on deprecating the firehose and moving over to syslog, and we saw that the performance of syslog ingress for log-cache was somewhat abysmal. The syslog server could accept traffic quite rapidly and efficiently but the GRPC throughput from syslog-server to log-cache was horrendous. The way that we fixed that performance bottleneck is to batch envelopes from the syslog-server to the log-cache rather then send them one by one. I think it's easy to look at this and apply the same lesson to http drains, and while performance testing seems reasonable, I highly suspect based on this that our performance will show marked increase. The main reason I feel like we have not pursued this is just the general lack of priority and interest in https drains in general. Https drains are I think still largely something I see as a "proof of concept" that happens to be present in production code. I'm not aware of a rfc supported method to send syslog over https in general. It seems like we "send syslog formatted messages" over https, and so does heroku(!?!?!?), but you'd think anyone who uses https as their interface would prefer fully formatted json messages. As I say this, there's some point about the concern of adding performance cost to the drain agent from marshalling messages into json(something we've seen can be extremely costly), but I digress. As long as we're shoving syslog formatted messages into a json message, https feels a bit awkward to me. There's a second thing that makes it feel like a proof of concept interface. Internally, the model use case for http drains is and somewhat has been sending logs and metrics to a aws lambda service. Now I did a little research and one thing I noticed is that lambda doesn't support http2. Now this isn't necessarily a problem, as far as I understand no actual user has set up a aws lambda service to accept their logs and metrics. One major problem is I'd suspect such a use-case would be extremely extremely expensive. But I'd be interested in some sort of mental model and use-case explanation to replace the existing one I have, and I wonder if in pursuing that mental model and use-case we could come up with possibly more acceptable models and solutions. Right now we're talking about how to better support performance with http, and we might go into a solution workflow where we do continue to support https, or support http/2(s), but I could imagine us also talking about other protocols that still don't require tcp level routing but still would be a good community standard, for example. Now as for the configurability of batching, I really do think that unconfigurable could and probubly would be fine. We've gotten away with universal batching size for v2 loggregator for a long time now, and in fact, not configuring the size of the batches also means we don't have to configure the size of our grpc buffers either(although we did have to change them to be accurate). I don't even think we have to specifically tune them per say to be the most performant, I think in many cases good enough will be good enough. The biggest risk in my mind would be users who expect 1 envelope and suddenly start receiving say, 100, and stop processing lots of envelopes. Let me know what you think though. |
Sorry for taking so long to move on this topic, we now have a draft proposal ready to talk about this improvement once more. We did some testing on our side with a custom built loggr-syslog-agent binary which contains the changes I provided in this draft PR: These changes only implement the syslog-batching approach we described above, with a pretty minimal and non invasive code change to be enabled via query parameter for a user-provided-service. Please let me know what you think of this approach, and how we could improve it to make it a contributable part of the coming loggregator versions. Greetings, Nicklas |
It's probably not going to have as big an impact as batching, but I found that we were not using fasthttp appropriately in our HTTPSWriter. That may have had a negative impact on the performance of Syslog Agent's HTTPS drains. Should be resolved in this PR: #573. |
@nicklas-dohrn I've re-read through this issue and the current implementation of Syslog Agent, and I think #491 is slightly off track. Instead of using a new The reason I think a new protocol would be better is twofold:
|
Yes, looking from a more broad perspective, I see the appeal of having it split out from a higher up point in the call chain. I refactored all the respective code, and the approach looks rather appealing so far. |
@nicklas-dohrn and other folks on this issue, what do you think of this change? |
I think that might prove helpful on the issue, but will for sure not resolve our issues, as we are currently allowing cross consumption into different availability zones, like from US to EU or else. I will nonetheless try out this approach with my dev setup and see how it performs in our usecase, as the only comparison I have found so far is: From what I guess so far, is that it would increase the throuput probably quite substantially, but I doubt it will increase by more than 200%, and we unfortunately likely need way more than that (at least 5x). but I will come back to you with my results, once I have some :) |
Thank you! I understand, the batching PR is still desirable. |
@nicklas-dohrn have you been able to try out #573 in your dev setup to see whether it improves performance for your usecase? |
I tried it, yes, was not really conclusive yet. |
I meant to close #573 not this Issue. My bad. This issue is prioritized. |
Problem Statement
The maximum throughput for https-based syslog drains is limited by the message roundtrip time:
This is a consequence of the syslog agent using one connection per drain and sending http-requests sequentially.
This limits the throughput especially in high latency scenarios. Messages are buffered if the message rate exceeds the throughput. Once the the syslog agent buffer is full, all messages are dropped.
Proposal
To increase the throughput and avoid loss we propose to introduce an option that allows the agent to batch multiple log messages into a single HTTP request.
Pull Request by @nicklas-dohrn: #491
@bgoerzig @juergen-walter @nicklas-dohrn
The text was updated successfully, but these errors were encountered: