Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute WAL size and use it during retention size checks #5886

Merged
merged 6 commits into from
Nov 12, 2019
Merged

Compute WAL size and use it during retention size checks #5886

merged 6 commits into from
Nov 12, 2019

Conversation

dipack95
Copy link
Contributor

@dipack95 dipack95 commented Aug 13, 2019

Signed-off-by: Dipack P Panjabi [email protected]

Closes: #5771

Compute the size of the WAL and use it while calculating if exceeding the max retention size limit.

(Port of prometheus-junkyard/tsdb#651)

@krasi-georgiev krasi-georgiev self-assigned this Aug 22, 2019
@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Aug 22, 2019

unless someone has a better idea, this LGTM after the conflict resolut.

@dipack95
Copy link
Contributor Author

@krasi-georgiev I've corrected the merge conflicts!

@krasi-georgiev
Copy link
Contributor

LGTM

ping @bwplotka @codesome

@codesome
Copy link
Member

Sorry I have not been following PRs lately.

I had only 1 comment in the old PR

I am wondering if it would error out if we are calculating size while files are being added or removed

tsdb/db.go Show resolved Hide resolved
@dipack95
Copy link
Contributor Author

@codesome In your scenario, are you referring to when WAL chunks are added/removed? If so, I tested for that particular scenario, and it does not crash.

@codesome
Copy link
Member

WAL chunks are added/removed

Not chunks, but when new files are being added/removed (mostly the case of removing files while we are already iterating the items in the directory).

@dipack95
Copy link
Contributor Author

From what I can tell, the size computation logic is a blocking call, so no files are added/removed while it iterates over the contents of the wal/ dir.

@krasi-georgiev
Copy link
Contributor

@codesome did you double check about this race?

@codesome
Copy link
Member

I don't think I will be able to investigate that race right now.

@codesome
Copy link
Member

Had a look now, the possibility of a race is very slim here and not critical. There wont be any issue with deletion of segments as it doesn't happen while the size is being computed (as size of WAL is checked inside reload() which doesn't happen in parallel with WAL truncation). And addition of WAL segment should not be a problem for filepath.Walk (I guess?).

the size computation logic is a blocking call, so no files are added/removed while it iterates over the contents of the wal/ dir.

Having gone through the souce of filepath.Walk, not sure if it is blocking to changes on the disk. But not a problem in this case now.

@dipack95
Copy link
Contributor Author

Sorry, I haven't been following the conversation on this PR for a while now; is there a change required in the WAL truncation logic? I can see that @codesome has tested for the race condition during size computation, and found that that shouldn't be an issue.

@krasi-georgiev
Copy link
Contributor

@dipack95 just need to add a test that involves checkpointing and it is ready for merging.

https://github.com/prometheus/prometheus/pull/5886/files#r327057583

Copy link
Contributor

@krasi-georgiev krasi-georgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I see it can't we remove a lot of the code by just using (h *Head) Truncate ?

@dipack95
Copy link
Contributor Author

I suppose so, considering this was in the WAL test file, I decided to just use the WAL functions directly instead of using Head, or its appenders.

@krasi-georgiev
Copy link
Contributor

can you check if my suggestion would remove all that boilerplate than?

@krasi-georgiev
Copy link
Contributor

ping @dipack95

@dipack95
Copy link
Contributor Author

Hey sorry, I haven't had the time for the last week. I should have the time to update this, and #5887 in a few days.

@krasi-georgiev
Copy link
Contributor

Thanks. Appreciated

@krasi-georgiev
Copy link
Contributor

ping @dipack95

@dipack95
Copy link
Contributor Author

@krasi-georgiev When I create a Head, and try to go about it that way, it runs into an import cycle. Looks like we might have to keep the test the way it is right now, ugly as it is!

@krasi-georgiev
Copy link
Contributor

good point, thanks for trying.

Had another look and I would say that we want to add all this to the TestSizeRetention this way can ensure that the size retention works as expected and accounts for the size of the WAL.

@dipack95
Copy link
Contributor Author

dipack95 commented Oct 29, 2019

Do you mean duplicate the entire test to TestSizeRetention in db_test.go?

@krasi-georgiev
Copy link
Contributor

don't think it needs to be duplicated just add some more logic to the existing TestSizeRetention
It needs to test that
the total db size blocks+wal is exp=act
retention accounts for the wal size so that if blocks+wal size is bigger than the retention is deletes a block
when the wal checkpoint is run the retention works as expected - and also the total db size after a checkpoint is exp=act

@dipack95
Copy link
Contributor Author

@krasi-georgiev That makes sense, I've added the WAL size to the expected size as well.

However, I've noticed that over the course of that test, it directly writes blocks to the disk instead of using the Head (and consequently the WAL), and thus the WAL size is always zero. Thus the test never failed with the introduction of this PR.

@krasi-georgiev
Copy link
Contributor

yep you will need to add some records to the wal as well. Look at some other tests for ideas.

@dipack95
Copy link
Contributor Author

dipack95 commented Nov 2, 2019

@krasi-georgiev Please take a look, I've updated the test.

tsdb/wal/wal_test.go Outdated Show resolved Hide resolved
Signed-off-by: Dipack P Panjabi <[email protected]>
tsdb/db_test.go Outdated Show resolved Hide resolved
Signed-off-by: Dipack P Panjabi <[email protected]>
@dipack95
Copy link
Contributor Author

dipack95 commented Nov 6, 2019

@krasi-georgiev Please take a look now.

@krasi-georgiev
Copy link
Contributor

LGTM, Thanks

unless @codesome , @brian-brazil or anyone else has any other comments will merge this in few days.

@krasi-georgiev krasi-georgiev merged commit ce7bab0 into prometheus:master Nov 12, 2019
@dipack95 dipack95 deleted the wal_size branch November 12, 2019 02:54
philandstuff added a commit to alphagov/gsp that referenced this pull request Feb 25, 2020
It's good to keep things up to date generally, but there's a specific
thing I want here: this chart upgrade also upgrades prometheus from
2.13.2 to 2.15.2.  Prometheus 2.15 has a feature which considers the
WAL size when calculating how much disk it's using (see
prometheus/prometheus#5886 ).
@life0215
Copy link

Not sure if it's only me finding that the WAL size is still not used during retention size checks:

/prometheus $ prometheus --version
prometheus, version 2.17.1 (branch: HEAD, revision: ae041f97cfc6f43494bed65ec4ea4e3a0cf2ac69)
  build user:       root@806b02dfe114
  build date:       20200326-16:18:19
  go version:       go1.13.9
/prometheus $ 
/prometheus $ ps
PID   USER     TIME  COMMAND
    1 1000      8d02 /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --storage.tsdb.retention.size=15GB --config.file=/etc/prometheus/config_out/prometheus.env.yaml --
   79 1000      0:00 sh
   87 1000      0:00 sh
   99 1000      0:00 ps
/prometheus $ du -h -d 1
79.5M   ./01EB7D47PPV49QCNY2MZVJCYTE.tmp
524.5M  ./01EBBR0J3VHNPNBYM6CZXM56T9.tmp
62.1M   ./01EJFJF03NFF5J29VE9JKEE2MS
512.5M  ./01EBBPV9FS1Z0QY4EK9QC7RMQA.tmp
1.2G    ./01EJ1DCVBP2C3XS5WVJE6H85AE
1.0G    ./01EHG166DB6JYRB79VEMWME7Z6
158.7M  ./01EJFJG3Y1CQ3PRRPDQH3NACNQ
512.5M  ./01EBAM8DHP0GAKCWWS6P3F7D0E.tmp
1.1G    ./01EHVM027KCG4CAT3VGZW6VMJ6
516.5M  ./01EBBQ7Q7CG0J99RHM57FE0TF0.tmp
57.5M   ./01EJFSAQGFA3PT9BW8JMDKN824
1.0G    ./01EGYN0SAP6C43V8RF335SNFFW
1.1G    ./01EHA7SGSXJ4CR88GQ4B936SNZ
79.5M   ./01EB7CYVEXJENABPGR037E14P8.tmp
453.4M  ./01EJEXYPAFCM86Y7T5REJXNJHA
8.0K    ./01EBCB2SVA5XVV84EC54GR2S0T.tmp
8.0K    ./01EBBS9M0417RNJVVTCYEK33RH.tmp
1.3G    ./01EJD06W4FTYPFAKYVM7FZNE01
1.1G    ./01EH4EE1ZHCXEWP80NJJJJFSW5
516.5M  ./01EBAM3540NXPQM49TSAPHSDBH.tmp
1.1G    ./01EHNTJTJ7CGDRPSB4H6TA7YE0
516.5M  ./01EBATZ99KJ4PX2FG44TW0SZAE.tmp
1.0G    ./01EGRVM1JBVNQWD1Y7N5FYN4GN
1.9G    ./wal
1.2G    ./01EJ76T4CNYNDX6EJ5KMWXPKMZ
16.8G   .

Please note that we've set --storage.tsdb.retention.size=15GB, but the actual storage usage is 16.8G

@codesome
Copy link
Member

I dont think the *.tmp directories count towards the size calculation. If they are leftovers from any errors before, they can be deleted. Ideally we should have a cleanup of those temporary directories on startup if any exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage.tsdb.retention.size exceeds storage
5 participants