Allow to send multiple log messages in a single HTTP request #332

bgoerzig · 2023-06-21T12:47:51Z

Problem Statement

The maximum throughput for https-based syslog drains is limited by the message roundtrip time:

max_logs_per_second = 1 / message_roundtrip_time_in_seconds.

This is a consequence of the syslog agent using one connection per drain and sending http-requests sequentially.
This limits the throughput especially in high latency scenarios. Messages are buffered if the message rate exceeds the throughput. Once the the syslog agent buffer is full, all messages are dropped.

Proposal

To increase the throughput and avoid loss we propose to introduce an option that allows the agent to batch multiple log messages into a single HTTP request.

Pull Request by @nicklas-dohrn: #491

@bgoerzig @juergen-walter @nicklas-dohrn

The text was updated successfully, but these errors were encountered:

bgoerzig · 2023-06-21T12:54:36Z

We are open to contribute this feature if our proposal is accepted. Since the syslog RFC 5424 doesn't specify a batched format we have to define a structure that is accepted by the community. From our point of view, the easiest implementation would be to use multi-line batches.

ctlong · 2023-06-21T21:20:40Z

Hi @bgoerzig, thanks for taking the time to write up this proposal. I think https is the least optimal of the schemes that Syslog Agent offers for streaming syslog messages. Because of that we typically encourage use of the other schemes. May I ask why you're interested in using this scheme in particular?

My first take is that we would be open to a PR to change the https scheme to batch messages, as long as the change is feature-flagged. AFAIK there's no RFC for syslog over HTTPS, so our implementation of it is already special and there's nothing stopping us from making it more special. Feature-flagging the change would be necessary because otherwise we're talking about a breaking change for pre-existing https drains. Multi-line seems like a reasonable implementation approach as well.

I need to re-read up on these schemes, I can no longer recall exactly why https is less optimal than the others 😅.

chombium · 2023-06-22T05:57:55Z

We've discussed the same issue in our team when @fhambrec and @dark5un were doing Syslog Agent performance tests. The difference between the unencrypted and encrypted traffic and the protocol https vs syslog-tls is significant. That's why at the end the recommendation is to use syslog-tls scheme for higher loads.
We've talked about the possible improvements, thought about batching messages in single requests as well, but the we think that's not the way to go, as there is nothing defined in the Syslog RFC and the client (Syslog Agent in our case) and the server (generally speaking, unknown 3rd party Syslog server) would both have to know how to pack and unpack the request body. As we only control the client in this case, we've concluded that this solution is not the right one.

At the end, our discussion went into direction of refactoring Syslog Agent's egress so that we have roughly speaking a connection poll with load balancing. The Syslog agent opens multiple connections and sends the messages in parallel. We can revive topic in our team ans see what can we do about it.

stefanlay · 2023-06-26T06:51:56Z

I need to re-read up on these schemes, I can no longer recall exactly why https is less optimal than the others

The reason that http drains perform badly is that the requests are sent one after the other over one tcp connection. The throughput is therefore 1/(roundtrip time) which is less than 200/s inside an AWS datacenter. We reproduced that in performance tests. With syslog > 4000/s could be sent (independent if it is with (m)tls or not).

I see three options to overcome this, maybe there are more:

Using more than one connection for an http sylog drain.

Advantage: Server behaviour unchanged, they can just receive one line per request like currently
Disadvantage: Handling of connection pool, unclear how big it should be; order of logs may get lost

Batching the requests, sending all available requests via one http request

Advantage: (Relatively) simple to implement on syslog agent side
Disadvantage: Non-standard, therefore adaption on server side needed

Using http2

Advantage: Standard way to de-serialize http request over one connection, simple to implement on syslog agent side (just using https://pkg.go.dev/golang.org/x/net/http2)
Disadvantage: Needs to be supported on server side. I do not have experience there.

I would favor http2. We had already planned to look into it on SAP side. This is in my opinion the most standard way. We could e.g. add the scheme "https2" to the list addiotionally to https, syslog and syslog+tls

My first take is that we would be open to a PR to change the https scheme to batch messages, as long as the change is feature-flagged.

I would go one step further and would request that the feature is per drain. We are running multi-tenant foundations and cannot break existing customers using an http syslog drain. Maybe an option would be to configure it with a query parameter in the syslog drain url (like the drain-type).

ctlong · 2023-06-26T16:37:01Z

Thank you for the refresher @stefanlay!

I would favor http2.

I agree with this. Using http2 seems the more "standard" way to pump up the performance of this drain option.

I would go one step further and would request that the feature is per drain... Maybe an option would be to configure it with a query parameter in the syslog drain url (like the drain-type).

That's a very reasonable request. I'm in favor of this as well.

Benjamintf1 · 2023-07-05T17:21:35Z

Howdy!

Just want to inject some context and thoughts.

Firstly, honestly speaking, I'm very much not opposed to adding the ability to batch into the syslog-agent code base. Not only do i suspect that batching will give us huge performance gains for https drains, if and when we start to look more into adding additional protocol support to the drain-agent(syslog-agent) we'd have a reusable consistant tool which we could easily add to other interfaces that require or would benefit from batching(such as for example, otlp).

Back in cf-for-k8s times, we spent a lot of time looking into our path to shared nothing for log-cache. As we were working on deprecating the firehose and moving over to syslog, and we saw that the performance of syslog ingress for log-cache was somewhat abysmal. The syslog server could accept traffic quite rapidly and efficiently but the GRPC throughput from syslog-server to log-cache was horrendous. The way that we fixed that performance bottleneck is to batch envelopes from the syslog-server to the log-cache rather then send them one by one. I think it's easy to look at this and apply the same lesson to http drains, and while performance testing seems reasonable, I highly suspect based on this that our performance will show marked increase.

The main reason I feel like we have not pursued this is just the general lack of priority and interest in https drains in general. Https drains are I think still largely something I see as a "proof of concept" that happens to be present in production code. I'm not aware of a rfc supported method to send syslog over https in general. It seems like we "send syslog formatted messages" over https, and so does heroku(!?!?!?), but you'd think anyone who uses https as their interface would prefer fully formatted json messages. As I say this, there's some point about the concern of adding performance cost to the drain agent from marshalling messages into json(something we've seen can be extremely costly), but I digress. As long as we're shoving syslog formatted messages into a json message, https feels a bit awkward to me.

There's a second thing that makes it feel like a proof of concept interface. Internally, the model use case for http drains is and somewhat has been sending logs and metrics to a aws lambda service. Now I did a little research and one thing I noticed is that lambda doesn't support http2. Now this isn't necessarily a problem, as far as I understand no actual user has set up a aws lambda service to accept their logs and metrics. One major problem is I'd suspect such a use-case would be extremely extremely expensive. But I'd be interested in some sort of mental model and use-case explanation to replace the existing one I have, and I wonder if in pursuing that mental model and use-case we could come up with possibly more acceptable models and solutions. Right now we're talking about how to better support performance with http, and we might go into a solution workflow where we do continue to support https, or support http/2(s), but I could imagine us also talking about other protocols that still don't require tcp level routing but still would be a good community standard, for example.

Now as for the configurability of batching, I really do think that unconfigurable could and probubly would be fine. We've gotten away with universal batching size for v2 loggregator for a long time now, and in fact, not configuring the size of the batches also means we don't have to configure the size of our grpc buffers either(although we did have to change them to be accurate). I don't even think we have to specifically tune them per say to be the most performant, I think in many cases good enough will be good enough. The biggest risk in my mind would be users who expect 1 envelope and suddenly start receiving say, 100, and stop processing lots of envelopes.

Let me know what you think though.

nicklas-dohrn · 2024-03-27T09:46:38Z

Sorry for taking so long to move on this topic, we now have a draft proposal ready to talk about this improvement once more.

We did some testing on our side with a custom built loggr-syslog-agent binary which contains the changes I provided in this draft PR:
#491

These changes only implement the syslog-batching approach we described above, with a pretty minimal and non invasive code change to be enabled via query parameter for a user-provided-service.

Please let me know what you think of this approach, and how we could improve it to make it a contributable part of the coming loggregator versions.

Greetings, Nicklas

ctlong · 2024-05-03T17:47:53Z

It's probably not going to have as big an impact as batching, but I found that we were not using fasthttp appropriately in our HTTPSWriter. That may have had a negative impact on the performance of Syslog Agent's HTTPS drains. Should be resolved in this PR: #573.

ctlong · 2024-05-03T22:07:35Z

@nicklas-dohrn I've re-read through this issue and the current implementation of Syslog Agent, and I think #491 is slightly off track.

Instead of using a new batching=true query param in the syslog drain URL to indicate to the HTTPSWriter that batching is enabled, a new protocol should be defined. I like https-batch, but open to other naming schemes. Then syslog drain URLs matching https-batch://..., for example, will be redirected to a HTTPSBatchWriter, or equivalent, that will batch messages over HTTPS using \n as a separator.

The reason I think a new protocol would be better is twofold:

The current implementation doesn't have other examples of using query params to change the behaviour of the writers. That all happens at higher levels.
Conceptually, we are defining a new communication protocol here: multiple syslog messages separated by newlines delivered in the same HTTPS request.

nicklas-dohrn · 2024-05-06T08:13:51Z

Yes, looking from a more broad perspective, I see the appeal of having it split out from a higher up point in the call chain.
I will look into how I can incorporate this way of switching Writers, but from thinking about it from the top of my head right now, this should be rather straight forward to be implemented.

I refactored all the respective code, and the approach looks rather appealing so far.
Let me know what you think so far.

ctlong · 2024-05-06T22:15:52Z

It's probably not going to have as big an impact as batching, but I found that we were not using fasthttp appropriately in our HTTPSWriter. That may have had a negative impact on the performance of Syslog Agent's HTTPS drains. Should be resolved in this PR: #573.

@nicklas-dohrn and other folks on this issue, what do you think of this change?

nicklas-dohrn · 2024-05-07T13:35:09Z

I think that might prove helpful on the issue, but will for sure not resolve our issues, as we are currently allowing cross consumption into different availability zones, like from US to EU or else.
Sending single log entries with single https requests is depending on the size of the logs mostly inefficient, and might even be costly on datacenters with high IO costs.
We also have seen a positive impact on our routing infrastructure, as having bundled logs sent through nginx is way more efficient.

I will nonetheless try out this approach with my dev setup and see how it performs in our usecase, as the only comparison I have found so far is:
https://github.com/nurmohammed840/web-benchmark?tab=readme-ov-file#go-v116-linux

From what I guess so far, is that it would increase the throuput probably quite substantially, but I doubt it will increase by more than 200%, and we unfortunately likely need way more than that (at least 5x).

but I will come back to you with my results, once I have some :)

ctlong · 2024-05-07T15:21:06Z

Thank you!

I understand, the batching PR is still desirable.

ctlong · 2024-06-24T23:14:54Z

@nicklas-dohrn have you been able to try out #573 in your dev setup to see whether it improves performance for your usecase?

nicklas-dohrn · 2024-06-25T04:49:03Z

I tried it, yes, was not really conclusive yet.

ctlong · 2024-09-05T20:36:31Z

I meant to close #573 not this Issue. My bad. This issue is prioritized.

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Jul 26, 2023

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Jul 26, 2023

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Sep 19, 2023

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Sep 19, 2023

ctlong moved this from Inbox to Pending Review | Discussion in Application Runtime Platform Working Group Oct 4, 2023

ctlong moved this from Pending Review | Discussion to Waiting for Changes | Open for Contribution in Application Runtime Platform Working Group Nov 2, 2023

ctlong assigned nicklas-dohrn May 3, 2024

ctlong closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2024

github-project-automation bot moved this from Waiting for Changes | Open for Contribution to Done in Application Runtime Platform Working Group Sep 5, 2024

ctlong reopened this Sep 5, 2024

ctlong closed this as completed in 75f9a92 Oct 3, 2024

github-project-automation bot moved this from Pending Merge | Prioritized to Done in Application Runtime Platform Working Group Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to send multiple log messages in a single HTTP request #332

Allow to send multiple log messages in a single HTTP request #332

bgoerzig commented Jun 21, 2023 •

edited

Loading

bgoerzig commented Jun 21, 2023 •

edited

Loading

ctlong commented Jun 21, 2023 •

edited

Loading

chombium commented Jun 22, 2023 •

edited

Loading

stefanlay commented Jun 26, 2023 •

edited

Loading

ctlong commented Jun 26, 2023

Benjamintf1 commented Jul 5, 2023

nicklas-dohrn commented Mar 27, 2024

ctlong commented May 3, 2024

ctlong commented May 3, 2024

nicklas-dohrn commented May 6, 2024 •

edited

Loading

ctlong commented May 6, 2024

nicklas-dohrn commented May 7, 2024

ctlong commented May 7, 2024

ctlong commented Jun 24, 2024

nicklas-dohrn commented Jun 25, 2024

ctlong commented Sep 5, 2024

Allow to send multiple log messages in a single HTTP request #332

Allow to send multiple log messages in a single HTTP request #332

Comments

bgoerzig commented Jun 21, 2023 • edited Loading

Problem Statement

Proposal

bgoerzig commented Jun 21, 2023 • edited Loading

ctlong commented Jun 21, 2023 • edited Loading

chombium commented Jun 22, 2023 • edited Loading

stefanlay commented Jun 26, 2023 • edited Loading

ctlong commented Jun 26, 2023

Benjamintf1 commented Jul 5, 2023

nicklas-dohrn commented Mar 27, 2024

ctlong commented May 3, 2024

ctlong commented May 3, 2024

nicklas-dohrn commented May 6, 2024 • edited Loading

ctlong commented May 6, 2024

nicklas-dohrn commented May 7, 2024

ctlong commented May 7, 2024

ctlong commented Jun 24, 2024

nicklas-dohrn commented Jun 25, 2024

ctlong commented Sep 5, 2024

bgoerzig commented Jun 21, 2023 •

edited

Loading

bgoerzig commented Jun 21, 2023 •

edited

Loading

ctlong commented Jun 21, 2023 •

edited

Loading

chombium commented Jun 22, 2023 •

edited

Loading

stefanlay commented Jun 26, 2023 •

edited

Loading

nicklas-dohrn commented May 6, 2024 •

edited

Loading