Flush queues on collector shutdown #125

maxamins · 2022-01-18T16:24:38Z

When shutdown is initiated stop more messages from being sent
Drain the shards and send the batches in under 15 seconds

pintohutch · 2022-03-24T13:45:02Z

What's the status on this PR?

pkg/export/export.go

pkg/export/export_test.go

pintohutch · 2023-06-29T17:28:25Z

Aside from rebasing - we'll want to benchmark this change to catch any performance regressions - cc @saketjajoo

1. When shutdown is initiated stop more messages from being sent 2. Drain the shards and send the batches in under 15 seconds

Signed-off-by: Ridwan Sharif <[email protected]>

ridwanmsharif · 2024-10-03T18:19:30Z

@pintohutch @bwplotka PTAL, there is GoogleCloudPlatform/prometheus#203 that has some benchmarking results but spoilers: there is no discernible change in performance - which makes sense cause this is exclusively only run on shutdown.

pintohutch · 2024-10-03T21:33:56Z

That's great to hear! I'll defer to @bwplotka as the SME to review.

bwplotka · 2024-10-07T11:04:49Z

pkg/export/export.go

+	// This avoids data loss on shutdown.
+	cancelTimeout = 15 * time.Second
+	// Time after the final shards are drained before the exporter is closed on shutdown.
+	flushTimeout = 100 * time.Millisecond


is 100ms making any difference?

bwplotka · 2024-10-07T11:07:11Z

pkg/export/export.go

 	for {
 		select {
 		// NOTE(freinartz): we will terminate once context is cancelled and not flush remaining
 		// buffered data. In-flight requests will be aborted as well.
 		// This is fine once we persist data submitted via Export() but for now there may be some
 		// data loss on shutdown.
 		case <-e.ctx.Done():
+			// on termination, try to drain the remaining shards within the CancelTimeout.


Suggested change

// on termination, try to drain the remaining shards within the CancelTimeout.

// On termination, try to drain the remaining shards within the CancelTimeout.

bwplotka · 2024-10-07T11:19:20Z

pkg/export/export.go

+			}
+		}
+	}
+
 	for {
 		select {
 		// NOTE(freinartz): we will terminate once context is cancelled and not flush remaining


Let's update this commentary, it's incorrect with the new logic 🤗

bwplotka · 2024-10-07T11:20:13Z

pkg/export/export.go

@@ -765,13 +783,61 @@ func (e *Exporter) Run() error {
 		curBatch = newBatch(e.logger, opts.Efficiency.ShardCount, opts.Efficiency.BatchSize)
 	}

+	// Try to drain the remaining data before exiting or the time limit (15 seconds) expires.
+	// A sleep timer is added after draining the shards to ensure it has time to be sent.
+	drainShardsBeforeExiting := func() {


nit: Not a must, but it might be easier to iterate on if we move send and drain to separate Exporter private methods 🤔

bwplotka

Thanks!

Some suggestions might require some work, so let me know if this is something you have time to do, or we accept some imperfection/code spaghetti.

The main issue is to me with proceeding on send work despite reusing e.ctx which was cancelled already. 🤔

bwplotka · 2024-10-07T11:22:35Z

pkg/export/export.go

 	for {
 		select {
 		// NOTE(freinartz): we will terminate once context is cancelled and not flush remaining
 		// buffered data. In-flight requests will be aborted as well.
 		// This is fine once we persist data submitted via Export() but for now there may be some
 		// data loss on shutdown.
 		case <-e.ctx.Done():
+			// on termination, try to drain the remaining shards within the CancelTimeout.
+			// This is done to prevent data loss during a shutdown.
+			drainShardsBeforeExiting()


Are we sure this works? All sends normally take e.ctx context, so they will be cancelled? I assume here you (intentionally?) accept that and drain the buffer with new sends with a separate context.

Would it be cleaner to have a custom, separate context from the beginning? 🤔

in fact we always use e.ctx even in drain if I look correctly, are we use this code works? 🤔

bwplotka · 2024-10-07T11:25:20Z

pkg/export/export.go

+					}
+				}
+				if totalRemaining == 0 && !pending {
+					// NOTE(ridwanmsharif): the sending of the batches happen asyncronously


Yea, can we improve this? And get that result state from send (or create separate send?)

bwplotka · 2024-10-07T11:25:28Z

pkg/export/export.go

+				}
+			}
+		}()
+		for {


I would suggest we don't do another goroutine and we do all in this for loop? Send is async anyway?

bwplotka · 2024-10-07T11:28:18Z

pkg/export/export.go

@@ -548,15 +561,15 @@ func (e *Exporter) SetLabelsByIDFunc(f func(storage.SeriesRef) labels.Labels) {

 // Export enqueues the samples and exemplars to be written to Cloud Monitoring.
 func (e *Exporter) Export(metadata MetadataFunc, batch []record.RefSample, exemplarMap map[storage.SeriesRef]record.RefExemplar) {
+	if e.opts.Disable {
+		gcmExportCalledWhileDisabled.Inc()


That's unrelated to the main PR goal, I see Max did this, let's at least update description.

bwplotka · 2024-10-07T11:30:05Z

pkg/export/export.go

+			// This is done to prevent data loss during a shutdown.
+			drainShardsBeforeExiting()
+			// This channel is used for unit test case.
+			e.exitc <- struct{}{}


Ideally we don't have testing specific code in critical, production flow. However, IF we make sure sending propagates some results to us back, we could have a nice production logic that emits log line telling us all was flushed successfully, can we do this?

maxamins force-pushed the macxamin/flush_queues branch 6 times, most recently from 51048be to 828c352 Compare January 25, 2022 05:00

maxamins requested a review from fabxc January 25, 2022 05:03

andysim3d added the bug Something isn't working label Mar 23, 2022

maxamins force-pushed the macxamin/flush_queues branch 2 times, most recently from ad99eb4 to df56956 Compare May 25, 2022 06:03

maxamins requested a review from pintohutch May 25, 2022 06:04

maxamins commented May 25, 2022

View reviewed changes

pkg/export/export.go Show resolved Hide resolved

maxamins commented May 25, 2022

View reviewed changes

pkg/export/export_test.go Outdated Show resolved Hide resolved

maxamins force-pushed the macxamin/flush_queues branch 2 times, most recently from c14aba0 to 29e9302 Compare May 25, 2022 21:55

maxamins force-pushed the macxamin/flush_queues branch from 29e9302 to f8cbdad Compare July 17, 2023 20:37

github-actions bot assigned shishichen Jul 17, 2023

github-actions bot requested a review from shishichen July 17, 2023 20:37

maxamins force-pushed the macxamin/flush_queues branch from f8cbdad to 326d4a9 Compare July 18, 2023 22:01

Flush queues on collector shutdown

66ff8f5

1. When shutdown is initiated stop more messages from being sent 2. Drain the shards and send the batches in under 15 seconds

ridwanmsharif force-pushed the macxamin/flush_queues branch 5 times, most recently from a16b725 to 1d025e8 Compare September 23, 2024 19:40

ridwanmsharif mentioned this pull request Sep 24, 2024

[WIP] feat: flush metrics on shutdown GoogleCloudPlatform/prometheus#203

Draft

feat(pkg/export): flush on export (rebase and fixup)

d9695bf

Signed-off-by: Ridwan Sharif <[email protected]>

ridwanmsharif force-pushed the macxamin/flush_queues branch from 1d025e8 to d9695bf Compare September 30, 2024 17:51

ridwanmsharif requested review from bwplotka and removed request for fabxc and shishichen September 30, 2024 18:09

pintohutch removed their request for review October 3, 2024 21:34

pintohutch unassigned shishichen Oct 3, 2024

bwplotka reviewed Oct 7, 2024

View reviewed changes

bwplotka requested changes Oct 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush queues on collector shutdown #125

Flush queues on collector shutdown #125

maxamins commented Jan 18, 2022

pintohutch commented Mar 24, 2022

pintohutch commented Jun 29, 2023

ridwanmsharif commented Oct 3, 2024

pintohutch commented Oct 3, 2024

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

bwplotka left a comment

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

bwplotka Oct 7, 2024

	// on termination, try to drain the remaining shards within the CancelTimeout.
	// On termination, try to drain the remaining shards within the CancelTimeout.

Flush queues on collector shutdown #125

Are you sure you want to change the base?

Flush queues on collector shutdown #125

Conversation

maxamins commented Jan 18, 2022

pintohutch commented Mar 24, 2022

pintohutch commented Jun 29, 2023

ridwanmsharif commented Oct 3, 2024

pintohutch commented Oct 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment