Skip to content
This repository has been archived by the owner on Mar 27, 2021. It is now read-only.

Experiment: find the best BigTable batch write size when Heroic is under load #707

Open
sming opened this issue Oct 21, 2020 · 1 comment

Comments

@sming
Copy link
Contributor

sming commented Oct 21, 2020

This is dependent on #703. Please see that issue for the motivation and context.

Ideally, we'd reproduce one of Heroic's metrics lag episodes with varying batch sizes, in a binary search fashion. We'd measure how good each setting is by seeing how much it helped Heroic during its time of need.

However, back in the Real World ™️:

  • we cannot replicate the lag episode
  • We have already deployed changes that target mitigating the metric lag episodes (ILM was reverted) (i.e. we're not comparing apples with apples)
  • We do not have a Staging environment that's a realistic copy of production

Hence my proposal is to simulate random BigTable write (aka Mutation) time-outs at varying frequencies e.g. 0.01, 0.1, 1.0, 3.0, 10.0, 25.0 %. We could use the http://wiremock.org/ stubbing library to stub out the BigTable API calls cleanly. Note: we should try to get the actual % of Mutation API requests that failed during an episode from Google. @malish8632 - how do I do that, any ideas?

Then we see which batch size performed best overall.
Finally we default DEFAULT_BATCH_SIZE to that number and deploy to production.

@sming sming self-assigned this Oct 21, 2020
@sming
Copy link
Contributor Author

sming commented Oct 21, 2020

Hello @malish8632 , @hexedpackets , @lmuhlha , what do you think of the proposal above? Crap? Genius? Meh?
Is there a cleaner/easier way of determining the best batch write size?
Is there something significant I've not considered?
Cheers!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant