Add syslog batching implementation #491

nicklas-dohrn · 2024-02-13T08:36:30Z

Description

This is our proposal to implement syslog batching for sending logs via https.
it includes a switch between the normal syslog one log per request mode via a syslog query parameter.
This can be done with the https-batch:

https-batch://<your-drain-url>/syslog

If you enable the syslog batching behaviour, it will currently write syslogbatches, where single messages are newline delimited (\n).
Currently, the batch sizes are hardwired to be around 256kb, which is already sufficient for speeding up throughput by a factor of 10x at least.
making it configurable would be an option, but I did not see the need so far.
please let me know what you think of the current approach.

linux-foundation-easycla · 2024-02-13T08:36:34Z

The committers listed above are authorized under a signed CLA.

✅ login: nicklas-dohrn / name: Nicklas Dohrn (0d7fb2f, 5432918, 2b31603, b1fd1a7, 21162e3, 731bb04, de826d4, 419f450, b78b61d, 8fcac9f, dd4169c, 3a0e43e, 8de9bf2, ae18981)

ctlong

In general, it looks fine to me. I don't seem where it adds the newline character to delimit between syslog lines though...

Would love to see a demo at the next ARP WG meeting.

(I sort of disregarded that this was a POC at points in time and some of my comments are more implementation-focused, sorry about that 😅 )

src/pkg/egress/syslog/https.go

nicklas-dohrn · 2024-04-11T07:59:39Z

The newline is already part of the syslog messages, so these are added already by a method beforhand (linked for anyone curious):
msg := appendNewline(removeNulls(env.GetLog().Payload))

this is true for all possible syslog messages, so I do not even need to add this, which is really convenient.

nicklas-dohrn · 2024-04-11T08:29:17Z

Adressed all the comments and additions by @ctlong above, if sufficient, please close the threads :)

ctlong

I've still some specific concerns, which I've left as comments in this review.

In general, the implementation looks fine, though I'm not sure that I understand the necessity of the new TriggerTimer struct.

src/pkg/egress/syslog/triggerTimer.go

src/pkg/egress/syslog/https.go

ctlong · 2024-04-17T17:29:41Z

@nicklas-dohrn can you please sign the CLA. We can't merge this unless you've done so.

chombium · 2024-04-17T18:52:56Z

@ctlong I will take care about the CLA. @nicklas-dohrn has to be added to one of our GitHub orgs.

ctlong

Conceptually, I think this proof of concept is correct. Implementation-wise the timer still has some issues.

Once those are fixed, I would suggest rebasing this off #573 and testing the two changes together to see if it achieves the throughput you want. Then we're all ready for a real implementation (with tests).

🙏 Could you please also update the PR description, thanks.

src/pkg/egress/syslog/https_batch.go

src/pkg/egress/syslog/https_batch_test.go

src/pkg/egress/syslog/syslog_connector.go

src/pkg/egress/syslog/https_batch.go

nicklas-dohrn · 2024-06-26T04:57:22Z

I reimplemented the changes using a similar approach to what @ctlong proposed.
This will make the change way more concise, but also gets rid of all -race conflicts shown by go test -race.
There are some race complaints left, but these only refer to the implementation of the tests themselves, which inherently are data-races by design.
@ctlong and @acrmp, If your concerns above are adressed, please close everything that is done, so we can keep this organised.

nicklas-dohrn · 2024-08-05T10:45:47Z

I did some elaborate testing on the current and new approach for syslog-batching, sending from our dev cf landscape with 4 diego cells and 4 loggregator agents to a cls instance.
I also tested #573 (HTTPS drains reuse and release fasthttp),
there were no mayor improvements of throughput to be seen, cpu consumption of both the new and old version is minimal due to being network bounded.
There are some good news, concerning batching, that it will considerably speed up throughput, and also reduce drops.
(see attached table for information)
The concurrent refers to a version of the https drain, which I modified with a go routine to allow usage of more than one cpu.
The current approach would increase the throughput considerably, but this results in a new issue:
If multiple applications on one diego cell would bind to a cls instance, they would share the throughput constraints of one cpu (I tested that this is indeed the behaviour).
This leads to bottleneck issues on bigger cf landscapes, as these use way bigger diego cells, consequently using bigger loggregator instances with more cpu cores, which does not scale for this approach at all.
@ctlong and @acrmp,
I would like to hear your thoughts on this issue, and how we can proceed making this one work at scale.

juergen-walter · 2024-08-05T11:12:53Z

@ctlong and @acrmp have many customers suffering and complaining about log drops. We would highly appreciate if this PR could be finalized/merged in in a timely manner. Appreciate your efforts so far, best regards.

chombium

Generally it looks fine, I found two little things.

I will wait on @ctlong for his review

src/pkg/egress/syslog/https_batch_test.go

src/pkg/egress/syslog/https_test.go

src/pkg/egress/syslog/https.go

This is a new approach to switch between http and http batching. It only is different in this regard from the previous attempts, and only contains refactorings besides this change.

nicklas-dohrn · 2024-09-12T04:51:12Z

I rebased off the current main of loggregator agent, and adopted all changes to work with the https batching.

ctlong · 2024-09-12T17:00:11Z

@nicklas-dohrn the tests and linting are still failing.

Also, I'm not sure what's going on with the history of this branch but the full PR does not reflect the latest commit you added to adopt the changes I'd suggested. Like if you go to Files Changed tab or pull the target branch they're not there 🤔

nicklas-dohrn · 2024-09-16T05:09:04Z

Will try getting it to work now.

nicklas-dohrn · 2024-09-16T14:32:49Z

I got all the tests to work again, was only due to changing settings for testing purposes, and simply forgot to change them back.

nicklas-dohrn · 2024-09-16T14:37:16Z

There is a real issue turned up by the linting errors:
The write function for the batching implementation cannot return an error because of two issues with the changed logic:

returning a send error for async sends would need a callback function.
even if the returning would work, the error would be for the batch and not the single message to which the error would be returned, breaking the retry logic.

This is unsolvable in the current architecture, as the inversion of batch creation and retry logic would be the way better approach. This is difficult due to the creation of batching in front of unmarshalling the envelopes.
This also turned up an issue with the current retry implementation, I raised it in a different issue to keep topics on track

@ctlong what is your opinion on this issue?

P.S.: I could disable it by appending //nolint:errcheck, but this seems not like a not needed error check, as it triggers retry attempts.

ctlong · 2024-09-17T22:41:09Z

I see what you mean and I agree that finding a way to return an error does seems like something that should be added eventually. However, how badly do you want retries and error logs on write failures? If you want to disable this error with a TODO to come back and refactor it, I'm willing to approve that to get this change through since you've been waiting a long time for it.

I don't think the "right" fix is very straightforward unfortunately. It seems like this writer should either have its own retry and logging functionality, or else inline batch writes similar to how you had them before. The former approach would be more complex while the latter approach comes with the obvious downside of envelopes potentially never being sent if there isn't a constant stream of them. What do you think? Maybe you have a better idea?

nicklas-dohrn · 2024-09-18T12:51:50Z

For now, I would go to implementing the retry directly as a temporary within the http batching code, omitting the error reporting to the retry writer for now, essentially making it inert.
That would lead to the wanted behaviour without making to many changes in other parts of the program.
I still need to try out the effects of retries, as there might be some issues with the current implementation on error cases for throughput.

nicklas-dohrn · 2024-09-26T07:16:23Z

Hey @ctlong,
I finally was able to create a final draft, so I would like to merge the current state, with any remarks you have to be implemented added for sure.

I opened an issue for discussion on the state of the retry stack, so this issue gets discoupled from that, as it is a different issue to solve:
#613

ctlong · 2024-09-30T21:36:09Z

@nicklas-dohrn this PR is still failing linting and unit tests

nicklas-dohrn · 2024-10-01T07:32:22Z

will fix that, was thinking that was fixed by the additions I made.

Addressed

nicklas-dohrn mentioned this pull request Mar 27, 2024

Allow to send multiple log messages in a single HTTP request #332

Closed

nicklas-dohrn marked this pull request as ready for review April 4, 2024 05:19

nicklas-dohrn requested a review from a team as a code owner April 4, 2024 05:19

ctlong requested changes Apr 9, 2024

View reviewed changes

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

acrmp reviewed Apr 10, 2024

View reviewed changes

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

nicklas-dohrn requested a review from ctlong April 11, 2024 08:28

ctlong requested changes Apr 17, 2024

View reviewed changes

src/pkg/egress/syslog/triggerTimer.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

ctlong reviewed May 7, 2024

View reviewed changes

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved

nicklas-dohrn force-pushed the main branch 3 times, most recently from c937231 to 21666c8 Compare May 12, 2024 06:32

ctlong requested changes Jun 17, 2024

View reviewed changes

src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved

src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved

src/pkg/egress/syslog/https_batch.go Show resolved Hide resolved

nicklas-dohrn requested a review from ctlong June 20, 2024 04:47

nicklas-dohrn force-pushed the main branch from 2474ffb to 3c9934f Compare June 20, 2024 05:25

acrmp previously requested changes Jun 22, 2024

View reviewed changes

nicklas-dohrn force-pushed the main branch from 3073a80 to 519dda6 Compare June 26, 2024 04:49

nicklas-dohrn requested a review from acrmp June 26, 2024 04:57

nicklas-dohrn force-pushed the main branch from 3a62859 to 52fbfbe Compare July 17, 2024 15:00

nicklas-dohrn changed the title ~~Add syslog batching poc implementation~~ Add syslog batching implementation Aug 5, 2024

chombium previously requested changes Aug 15, 2024

View reviewed changes

src/pkg/egress/syslog/https_batch_test.go Outdated Show resolved Hide resolved

src/pkg/egress/syslog/https_test.go Outdated Show resolved Hide resolved

nicklas-dohrn requested a review from chombium August 21, 2024 05:02

ctlong reviewed Sep 5, 2024

View reviewed changes

src/pkg/egress/syslog/https.go Show resolved Hide resolved

src/pkg/egress/syslog/https.go Outdated Show resolved Hide resolved

nicklas-dohrn added 5 commits September 12, 2024 06:45

Refactor approach to use different protocol instead of parameter.

b1fd1a7

This is a new approach to switch between http and http batching. It only is different in this regard from the previous attempts, and only contains refactorings besides this change.

Fix trigger timer issues

de826d4

Add tests and fix test related issues

21162e3

Change batch dispatch implementation

419f450

Rebase and adopt changes from upstream

8fcac9f

nicklas-dohrn force-pushed the main branch from f010a31 to 8fcac9f Compare September 12, 2024 04:49

nicklas-dohrn requested a review from ctlong September 12, 2024 05:04

nicklas-dohrn added 3 commits September 16, 2024 15:08

Reset sending interval for https_batching to 1 second

ae18981

Fix remarks from @chombium

8de9bf2

Rollback httpClient changes and address feedback

dd4169c

Added Error messages and cleaned up some code

3a0e43e

nicklas-dohrn force-pushed the main branch from a0f80f0 to 3a0e43e Compare September 26, 2024 07:26

Fix testing and linting in tests

b78b61d

ameowlia removed the request for review from acrmp October 3, 2024 14:34

ctlong approved these changes Oct 3, 2024

View reviewed changes

ctlong assigned ctlong and unassigned nicklas-dohrn Oct 3, 2024

ctlong merged commit 75f9a92 into cloudfoundry:main Oct 3, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add syslog batching implementation #491

Add syslog batching implementation #491

nicklas-dohrn commented Feb 13, 2024 •

edited by ctlong

Loading

linux-foundation-easycla bot commented Feb 13, 2024 •

edited

Loading

ctlong left a comment

nicklas-dohrn commented Apr 11, 2024

nicklas-dohrn commented Apr 11, 2024

ctlong left a comment

ctlong commented Apr 17, 2024

chombium commented Apr 17, 2024

ctlong left a comment •

edited

Loading

nicklas-dohrn commented Jun 26, 2024

nicklas-dohrn commented Aug 5, 2024

juergen-walter commented Aug 5, 2024

chombium left a comment

nicklas-dohrn commented Sep 12, 2024

ctlong commented Sep 12, 2024

nicklas-dohrn commented Sep 16, 2024

nicklas-dohrn commented Sep 16, 2024

nicklas-dohrn commented Sep 16, 2024 •

edited

Loading

ctlong commented Sep 17, 2024

nicklas-dohrn commented Sep 18, 2024

nicklas-dohrn commented Sep 26, 2024 •

edited

Loading

ctlong commented Sep 30, 2024

nicklas-dohrn commented Oct 1, 2024

Add syslog batching implementation #491

Add syslog batching implementation #491

Conversation

nicklas-dohrn commented Feb 13, 2024 • edited by ctlong Loading

Description

linux-foundation-easycla bot commented Feb 13, 2024 • edited Loading

ctlong left a comment

Choose a reason for hiding this comment

nicklas-dohrn commented Apr 11, 2024

nicklas-dohrn commented Apr 11, 2024

ctlong left a comment

Choose a reason for hiding this comment

ctlong commented Apr 17, 2024

chombium commented Apr 17, 2024

ctlong left a comment • edited Loading

Choose a reason for hiding this comment

nicklas-dohrn commented Jun 26, 2024

nicklas-dohrn commented Aug 5, 2024

juergen-walter commented Aug 5, 2024

chombium left a comment

Choose a reason for hiding this comment

nicklas-dohrn commented Sep 12, 2024

ctlong commented Sep 12, 2024

nicklas-dohrn commented Sep 16, 2024

nicklas-dohrn commented Sep 16, 2024

nicklas-dohrn commented Sep 16, 2024 • edited Loading

ctlong commented Sep 17, 2024

nicklas-dohrn commented Sep 18, 2024

nicklas-dohrn commented Sep 26, 2024 • edited Loading

ctlong commented Sep 30, 2024

nicklas-dohrn commented Oct 1, 2024

nicklas-dohrn commented Feb 13, 2024 •

edited by ctlong

Loading

linux-foundation-easycla bot commented Feb 13, 2024 •

edited

Loading

ctlong left a comment •

edited

Loading

nicklas-dohrn commented Sep 16, 2024 •

edited

Loading

nicklas-dohrn commented Sep 26, 2024 •

edited

Loading