Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add byte-based ingestion limits to the queue and output #39776

Draft
wants to merge 121 commits into
base: main
Choose a base branch
from

Conversation

faec
Copy link
Contributor

@faec faec commented May 30, 2024

Allow the memory queue's size to be specified in bytes rather than event count. Add bulk_max_bytes to the Elasticsearch output config, to specify ingest request sizes in bytes.

The main technical difficulties in this change were:

  • dynamically growing the size of the memory queue's circular buffer, since there is no longer a hard limit on its length. This is now implemented with a new circularBuffer helper that handles the index arithmetic as the buffer grows, so that event indices can be used unchanged no matter the buffer's current size.
  • letting incoming insert requests accumulate when they need to block. This involved linked list boilerplate that was so very similar to existing helpers that I merged them into a generic FIFO helper.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@faec faec added enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 30, 2024
@faec faec self-assigned this May 30, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 30, 2024
Copy link
Contributor

mergify bot commented May 30, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented May 31, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b queue-byte-limits upstream/queue-byte-limits
git merge upstream/main
git push upstream queue-byte-limits

@faec faec marked this pull request as ready for review June 18, 2024 22:03
@faec faec requested a review from a team as a code owner June 18, 2024 22:03
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@pierrehilbert pierrehilbert requested review from leehinman and removed request for rdner June 19, 2024 06:41
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do a thorough review, but a couple test cases came to mind and I don't think we currently have tests to cover them

  1. queue is exactly full (from byte perspective) and we try to add another event
  2. queue is empty and we try to add an event that is larger than queue can hold
  3. queue is partially full and we try to add an event that would put us over the limit

@cmacknz
Copy link
Member

cmacknz commented Jun 19, 2024

I am also interested in seeing if there are any performance changes with for these two scenarios:

  1. A baseline comparison of this branch vs the commit on main it was branched from when using event based parameters.
  2. A build from this branch configured to use event based parameters and the same build configured to use byte based parameters.

@@ -80,12 +81,12 @@ func NetworkClients(netclients []NetworkClient) []Client {
// The first argument is expected to contain a queue config.Namespace.
// The queue config is passed to assign the queue factory when
// elastic-agent reloads the output.
func SuccessNet(cfg config.Namespace, loadbalance bool, batchSize, retry int, encoderFactory queue.EncoderFactory, netclients []NetworkClient) (Group, error) {
func SuccessNet(cfg config.Namespace, loadbalance bool, batchEvents, batchBytes, retry int, encoderFactory queue.EncoderFactory, netclients []NetworkClient) (Group, error) {
Copy link
Member

@cmacknz cmacknz Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth defining a wrapper struct for batchEvents and batchBytes and passing through a copy of that so that nobody can ever accidentally reverse them in the argument list anywhere they are passed together like this?


func (c *config) Validate() error {
if c.MaxGetRequest > c.Events {
return errors.New("flush.min_events must be less events")
if c.Bytes != nil && *c.Bytes < minQueueBytes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation and configuration could be unit tested, it is now more complex than before.

if broker.useByteLimits() {
// The queue is using byte limits, start with a buffer of 2^10 and
// we will expand it as needed.
eventBufSize = 1 << 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your memory no longer has space for the powers of two, this reads bigger than it is. Why not just use 1024 directly?

// The buffer position is the entry's index modulo the buffer size: for
// a queue with buffer size N, the entries stored in buf[0] will have
// entry indices 0, N, 2*N, 3*N, ...
func (l *runLoop) growEventBuffer() {
Copy link
Member

@cmacknz cmacknz Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable but seems like a good reason to introduce benchmark tests (testing.B)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarking growEventBuffer specifically? We could do that, although (since buffer growth is one-way) we expect this to be called a ~constant number of times on any run. (If the settings and event sizes are such that the queue needs to store a million events simultaneously, that's still only 10 calls to growEventBuffer over the lifetime of the program, the main bottlenecks are still in actually processing the events.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily growEventBuffer, I just wanted us to stop and evaluate if there was any place microbenchmarking would be beneficial and save us time going through iterations of the end to end performance framework. If the answer is they aren't beneficial that is fine.

I think in this case the most impactful choice is probably the starting size of the buffer since that defines how many initial doublings need to happen to get to whatever steady state is.

@cmacknz
Copy link
Member

cmacknz commented Jul 19, 2024

As discussed, exposing these configurations immediately may complicate some of our future plans. We could still merge this while we finalize those so the code doesn't rot, but not document the options and explicitly mark them as technical preview so that we are free to change them if we need to.

@cmacknz
Copy link
Member

cmacknz commented Jul 19, 2024

I think if we merge this, any use of bytes based options should log a warning that the feature is in technical preview.

@pierrehilbert pierrehilbert marked this pull request as draft August 30, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants