-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0dt: in preflight checks, notice DDL changes and restart read-only envd #31293
0dt: in preflight checks, notice DDL changes and restart read-only envd #31293
Conversation
@@ -190,6 +241,19 @@ pub async fn preflight_0dt( | |||
// Take over the catalog. | |||
info!("promoted; attempting takeover"); | |||
|
|||
// NOTE: There _is_ a window where DDL can happen in the old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alex-hunt-materialize If cloud is set up for this, or wants to make that change, we could try and shorten the window for missed DDL a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should worry too much about DDL made in the last few seconds before promotion. At most, it will only be a few seconds lost of hydration.
On it, will push the new test into this PR. |
5ad597e
to
cbc391e
Compare
The existing parallel-workload test is already failing: https://buildkite.com/materialize/nightly/builds/11058#_
I'm seeing a very similar error in my new 0dt ddl scenario. Just pushed it, reproduces with Edit: Also an unexpected (?) panic in Checks 0dt upgrade across four versions:
Maybe just another instance of https://github.com/MaterializeInc/database-issues/issues/8836 |
Hmm, maybe, what happens mechanically is that the new |
@def- No forget what I said. I'm pretty sure it's because of a known bug that is fixed in #30576. I even meant to talk to you about that other fix, and how the 0dt tests should already be a good enough test for it, once they have restarts of the read-only |
32b38b1
to
0f91a42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to include the storage status reporting changes here? Everything LGTM but just wanted to double check
() = caught_up_max_wait_fut => { | ||
if panic_after_timeout { | ||
panic!("not caught up within {:?}", caught_up_max_wait); | ||
let mut check_ddl_changes_interval = tokio::time::interval(Duration::from_secs(5 * 60)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make this configurable via LaunchDarkly? Alternatively could we make this an optional CLI arg that can be provided to environmentd
. Having a knob here to tune this seems useful
.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip); | ||
|
||
loop { | ||
tokio::select! { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make this a biased select!
loop? The chances of two things resolving at once is low, but it would make it easier to reason about if we have a defined order
@def- what do you make of these remaining nightly failures? |
It's included here because it's a prerequisite fix, and I wanted to get a nightly run that has both that fix and the changes in this PR. |
@ParkMyCar I pushed individual commits that implement your suggestions, ptal 🙏 |
f4ca720
to
89b747d
Compare
Sorry, just saw your message. So far the new run is looking good, but I haven't seen the SQLancer error before (but probably not related to this PR): https://buildkite.com/materialize/nightly/builds/11085 |
Before, a hydrating read-only environment would read a catalog snapshot on bootstrap and then not subscribe to further changes. This means that we would not be hydrating collections/replicas that are created in the old version that is still running, and those would then have to be hydrated after cutting over to the new version. With this change, we periodically check if new collections/replicas where created and we also check right before announcing as ready to promote. When we _do_ notice there was relevant DDL, we halt. This will make it so we're restarted in read-only mode again, and can now read an up-to-date catalog snapshot. It's important to note that any running clusters are not restarted, so any work that has already gone into hydration will not be lost.
89b747d
to
1f69ebd
Compare
> platform-checks-mz_2-1 | environmentd: 2025-02-10T15:49:48.328178Z WARN mz_environmentd::deployment::preflight: halting process: there have been DDL that we need to react to; rebooting in read-only mode Seen in https://buildkite.com/materialize/nightly/builds/11099 This was part of MaterializeInc#31293, but somehow got lost in main after merge
Before, a hydrating read-only environment would read a catalog snapshot on bootstrap and then not subscribe to further changes. This means that we would not be hydrating collections/replicas that are created in the old version that is still running, and those would then have to be hydrated after cutting over to the new version.
With this change, we periodically check if new collections/replicas where created and we also check right before announcing as ready to promote. When we do notice there was relevant DDL, we halt. This will make it so we're restarted in read-only mode again, and can now read an up-to-date catalog snapshot.
It's important to note that any running clusters are not restarted, so any work that has already gone into hydration will not be lost.
I think there's no one left besides me who has seen this code, but @ParkMyCar might be best to review. @alex-hunt-materialize there's a question in there for you about what cloud can or wants to do. I'll tag you on the specific code line.
@def- We probably want more testing. The new behavior is described above, I imagine a test would be to create tables or things while hydrating, and make sure that they show up in the read-only environment.
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.