Compute WAL size and use it during retention size checks #5886

dipack95 · 2019-08-13T15:16:43Z

Signed-off-by: Dipack P Panjabi [email protected]

Closes: #5771

Compute the size of the WAL and use it while calculating if exceeding the max retention size limit.

(Port of prometheus-junkyard/tsdb#651)

krasi-georgiev · 2019-08-22T13:01:37Z

unless someone has a better idea, this LGTM after the conflict resolut.

dipack95 · 2019-08-25T19:55:34Z

@krasi-georgiev I've corrected the merge conflicts!

krasi-georgiev · 2019-08-25T22:02:06Z

LGTM

ping @bwplotka @codesome

codesome · 2019-08-27T11:57:45Z

Sorry I have not been following PRs lately.

I had only 1 comment in the old PR

I am wondering if it would error out if we are calculating size while files are being added or removed

tsdb/db.go

dipack95 · 2019-08-29T14:50:21Z

@codesome In your scenario, are you referring to when WAL chunks are added/removed? If so, I tested for that particular scenario, and it does not crash.

codesome · 2019-08-29T15:20:16Z

WAL chunks are added/removed

Not chunks, but when new files are being added/removed (mostly the case of removing files while we are already iterating the items in the directory).

dipack95 · 2019-08-29T15:28:09Z

From what I can tell, the size computation logic is a blocking call, so no files are added/removed while it iterates over the contents of the wal/ dir.

krasi-georgiev · 2019-09-19T11:25:47Z

@codesome did you double check about this race?

codesome · 2019-09-19T12:10:36Z

I don't think I will be able to investigate that race right now.

codesome · 2019-09-23T07:12:28Z

Had a look now, the possibility of a race is very slim here and not critical. There wont be any issue with deletion of segments as it doesn't happen while the size is being computed (as size of WAL is checked inside reload() which doesn't happen in parallel with WAL truncation). And addition of WAL segment should not be a problem for filepath.Walk (I guess?).

the size computation logic is a blocking call, so no files are added/removed while it iterates over the contents of the wal/ dir.

Having gone through the souce of filepath.Walk, not sure if it is blocking to changes on the disk. But not a problem in this case now.

Signed-off-by: Dipack P Panjabi <[email protected]>

dipack95 · 2019-09-23T14:36:41Z

Sorry, I haven't been following the conversation on this PR for a while now; is there a change required in the WAL truncation logic? I can see that @codesome has tested for the race condition during size computation, and found that that shouldn't be an issue.

krasi-georgiev · 2019-10-09T09:14:07Z

@dipack95 just need to add a test that involves checkpointing and it is ready for merging.

https://github.com/prometheus/prometheus/pull/5886/files#r327057583

Signed-off-by: Dipack P Panjabi <[email protected]>

krasi-georgiev

now that I see it can't we remove a lot of the code by just using (h *Head) Truncate ?

dipack95 · 2019-10-15T15:00:50Z

I suppose so, considering this was in the WAL test file, I decided to just use the WAL functions directly instead of using Head, or its appenders.

krasi-georgiev · 2019-10-16T11:13:14Z

can you check if my suggestion would remove all that boilerplate than?

krasi-georgiev · 2019-10-22T08:11:41Z

ping @dipack95

dipack95 · 2019-10-22T15:00:25Z

Hey sorry, I haven't had the time for the last week. I should have the time to update this, and #5887 in a few days.

krasi-georgiev · 2019-10-22T17:39:10Z

Thanks. Appreciated

krasi-georgiev · 2019-10-29T13:34:30Z

ping @dipack95

dipack95 · 2019-10-29T15:09:53Z

@krasi-georgiev When I create a Head, and try to go about it that way, it runs into an import cycle. Looks like we might have to keep the test the way it is right now, ugly as it is!

krasi-georgiev · 2019-10-29T15:25:41Z

good point, thanks for trying.

Had another look and I would say that we want to add all this to the TestSizeRetention this way can ensure that the size retention works as expected and accounts for the size of the WAL.

dipack95 · 2019-10-29T15:33:18Z

Do you mean duplicate the entire test to TestSizeRetention in db_test.go?

krasi-georgiev · 2019-10-30T15:24:27Z

don't think it needs to be duplicated just add some more logic to the existing TestSizeRetention
It needs to test that
the total db size blocks+wal is exp=act
retention accounts for the wal size so that if blocks+wal size is bigger than the retention is deletes a block
when the wal checkpoint is run the retention works as expected - and also the total db size after a checkpoint is exp=act

Signed-off-by: Dipack P Panjabi <[email protected]>

dipack95 · 2019-10-30T18:00:03Z

@krasi-georgiev That makes sense, I've added the WAL size to the expected size as well.

However, I've noticed that over the course of that test, it directly writes blocks to the disk instead of using the Head (and consequently the WAL), and thus the WAL size is always zero. Thus the test never failed with the introduction of this PR.

krasi-georgiev · 2019-10-31T08:30:27Z

yep you will need to add some records to the wal as well. Look at some other tests for ideas.

Signed-off-by: Dipack P Panjabi <[email protected]>

dipack95 · 2019-11-02T22:12:27Z

@krasi-georgiev Please take a look, I've updated the test.

tsdb/wal/wal_test.go

Signed-off-by: Dipack P Panjabi <[email protected]>

tsdb/db_test.go

Signed-off-by: Dipack P Panjabi <[email protected]>

dipack95 · 2019-11-06T23:18:09Z

@krasi-georgiev Please take a look now.

krasi-georgiev · 2019-11-07T01:48:10Z

LGTM, Thanks

unless @codesome , @brian-brazil or anyone else has any other comments will merge this in few days.

It's good to keep things up to date generally, but there's a specific thing I want here: this chart upgrade also upgrades prometheus from 2.13.2 to 2.15.2. Prometheus 2.15 has a feature which considers the WAL size when calculating how much disk it's using (see prometheus/prometheus#5886 ).

life0215 · 2020-09-18T06:07:12Z

Not sure if it's only me finding that the WAL size is still not used during retention size checks:

/prometheus $ prometheus --version
prometheus, version 2.17.1 (branch: HEAD, revision: ae041f97cfc6f43494bed65ec4ea4e3a0cf2ac69)
  build user:       root@806b02dfe114
  build date:       20200326-16:18:19
  go version:       go1.13.9
/prometheus $ 
/prometheus $ ps
PID   USER     TIME  COMMAND
    1 1000      8d02 /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --storage.tsdb.retention.size=15GB --config.file=/etc/prometheus/config_out/prometheus.env.yaml --
   79 1000      0:00 sh
   87 1000      0:00 sh
   99 1000      0:00 ps
/prometheus $ du -h -d 1
79.5M   ./01EB7D47PPV49QCNY2MZVJCYTE.tmp
524.5M  ./01EBBR0J3VHNPNBYM6CZXM56T9.tmp
62.1M   ./01EJFJF03NFF5J29VE9JKEE2MS
512.5M  ./01EBBPV9FS1Z0QY4EK9QC7RMQA.tmp
1.2G    ./01EJ1DCVBP2C3XS5WVJE6H85AE
1.0G    ./01EHG166DB6JYRB79VEMWME7Z6
158.7M  ./01EJFJG3Y1CQ3PRRPDQH3NACNQ
512.5M  ./01EBAM8DHP0GAKCWWS6P3F7D0E.tmp
1.1G    ./01EHVM027KCG4CAT3VGZW6VMJ6
516.5M  ./01EBBQ7Q7CG0J99RHM57FE0TF0.tmp
57.5M   ./01EJFSAQGFA3PT9BW8JMDKN824
1.0G    ./01EGYN0SAP6C43V8RF335SNFFW
1.1G    ./01EHA7SGSXJ4CR88GQ4B936SNZ
79.5M   ./01EB7CYVEXJENABPGR037E14P8.tmp
453.4M  ./01EJEXYPAFCM86Y7T5REJXNJHA
8.0K    ./01EBCB2SVA5XVV84EC54GR2S0T.tmp
8.0K    ./01EBBS9M0417RNJVVTCYEK33RH.tmp
1.3G    ./01EJD06W4FTYPFAKYVM7FZNE01
1.1G    ./01EH4EE1ZHCXEWP80NJJJJFSW5
516.5M  ./01EBAM3540NXPQM49TSAPHSDBH.tmp
1.1G    ./01EHNTJTJ7CGDRPSB4H6TA7YE0
516.5M  ./01EBATZ99KJ4PX2FG44TW0SZAE.tmp
1.0G    ./01EGRVM1JBVNQWD1Y7N5FYN4GN
1.9G    ./wal
1.2G    ./01EJ76T4CNYNDX6EJ5KMWXPKMZ
16.8G   .

Please note that we've set --storage.tsdb.retention.size=15GB, but the actual storage usage is 16.8G

codesome · 2020-09-18T06:22:52Z

I dont think the *.tmp directories count towards the size calculation. If they are leftovers from any errors before, they can be deleted. Ideally we should have a cleanup of those temporary directories on startup if any exists.

krasi-georgiev self-assigned this Aug 22, 2019

cstyan reviewed Aug 28, 2019

View reviewed changes

tsdb/db.go Show resolved Hide resolved

Compute WAL size by reading metadata off the disk

8b6282c

Signed-off-by: Dipack P Panjabi <[email protected]>

WAL size test - write, checkpoint, truncate

8b8d1a0

Signed-off-by: Dipack P Panjabi <[email protected]>

krasi-georgiev reviewed Oct 15, 2019

View reviewed changes

Add WAL size consideration to Size Retention Test

a573f72

Signed-off-by: Dipack P Panjabi <[email protected]>

Add some data to WAL and test size retention strategy

0073aff

Signed-off-by: Dipack P Panjabi <[email protected]>

krasi-georgiev reviewed Nov 4, 2019

View reviewed changes

tsdb/wal/wal_test.go Outdated Show resolved Hide resolved

Remove unneeded test

91c0769

Signed-off-by: Dipack P Panjabi <[email protected]>

krasi-georgiev reviewed Nov 5, 2019

View reviewed changes

tsdb/db_test.go Outdated Show resolved Hide resolved

Simplify test

31df2c7

Signed-off-by: Dipack P Panjabi <[email protected]>

krasi-georgiev merged commit ce7bab0 into prometheus:master Nov 12, 2019

dipack95 deleted the wal_size branch November 12, 2019 02:54

philandstuff mentioned this pull request Feb 25, 2020

Upgrade prometheus-operator chart alphagov/gsp#1021

Merged

javsalgar mentioned this pull request Apr 21, 2020

prometheus: switch from time-based to size-based retention vmware-archive/kube-prod-runtime#452

Open

venkatbvc mentioned this pull request Sep 7, 2020

storage.tsdb.retention.size usage is not properly updated in document #7906

Closed

frittentheke mentioned this pull request Aug 3, 2022

Compactions cause the configured storage.tsdb.retention.size to be exceeded (and risk of running out of disk space) #11112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute WAL size and use it during retention size checks #5886

Compute WAL size and use it during retention size checks #5886

dipack95 commented Aug 13, 2019 •

edited by krasi-georgiev

Loading

krasi-georgiev commented Aug 22, 2019 •

edited

Loading

dipack95 commented Aug 25, 2019

krasi-georgiev commented Aug 25, 2019

codesome commented Aug 27, 2019

dipack95 commented Aug 29, 2019

codesome commented Aug 29, 2019

dipack95 commented Aug 29, 2019

krasi-georgiev commented Sep 19, 2019

codesome commented Sep 19, 2019

codesome commented Sep 23, 2019

dipack95 commented Sep 23, 2019

krasi-georgiev commented Oct 9, 2019

krasi-georgiev left a comment •

edited

Loading

dipack95 commented Oct 15, 2019

krasi-georgiev commented Oct 16, 2019

krasi-georgiev commented Oct 22, 2019

dipack95 commented Oct 22, 2019

krasi-georgiev commented Oct 22, 2019

krasi-georgiev commented Oct 29, 2019

dipack95 commented Oct 29, 2019

krasi-georgiev commented Oct 29, 2019

dipack95 commented Oct 29, 2019 •

edited

Loading

krasi-georgiev commented Oct 30, 2019

dipack95 commented Oct 30, 2019

krasi-georgiev commented Oct 31, 2019

dipack95 commented Nov 2, 2019

dipack95 commented Nov 6, 2019

krasi-georgiev commented Nov 7, 2019

life0215 commented Sep 18, 2020

codesome commented Sep 18, 2020

Compute WAL size and use it during retention size checks #5886

Compute WAL size and use it during retention size checks #5886

Conversation

dipack95 commented Aug 13, 2019 • edited by krasi-georgiev Loading

krasi-georgiev commented Aug 22, 2019 • edited Loading

dipack95 commented Aug 25, 2019

krasi-georgiev commented Aug 25, 2019

codesome commented Aug 27, 2019

dipack95 commented Aug 29, 2019

codesome commented Aug 29, 2019

dipack95 commented Aug 29, 2019

krasi-georgiev commented Sep 19, 2019

codesome commented Sep 19, 2019

codesome commented Sep 23, 2019

dipack95 commented Sep 23, 2019

krasi-georgiev commented Oct 9, 2019

krasi-georgiev left a comment • edited Loading

Choose a reason for hiding this comment

dipack95 commented Oct 15, 2019

krasi-georgiev commented Oct 16, 2019

krasi-georgiev commented Oct 22, 2019

dipack95 commented Oct 22, 2019

krasi-georgiev commented Oct 22, 2019

krasi-georgiev commented Oct 29, 2019

dipack95 commented Oct 29, 2019

krasi-georgiev commented Oct 29, 2019

dipack95 commented Oct 29, 2019 • edited Loading

krasi-georgiev commented Oct 30, 2019

dipack95 commented Oct 30, 2019

krasi-georgiev commented Oct 31, 2019

dipack95 commented Nov 2, 2019

dipack95 commented Nov 6, 2019

krasi-georgiev commented Nov 7, 2019

life0215 commented Sep 18, 2020

codesome commented Sep 18, 2020

dipack95 commented Aug 13, 2019 •

edited by krasi-georgiev

Loading

krasi-georgiev commented Aug 22, 2019 •

edited

Loading

krasi-georgiev left a comment •

edited

Loading

dipack95 commented Oct 29, 2019 •

edited

Loading