Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0dt: in preflight checks, notice DDL changes and restart read-only envd #31293

Merged
merged 7 commits into from
Feb 10, 2025

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Feb 5, 2025

Before, a hydrating read-only environment would read a catalog snapshot on bootstrap and then not subscribe to further changes. This means that we would not be hydrating collections/replicas that are created in the old version that is still running, and those would then have to be hydrated after cutting over to the new version.

With this change, we periodically check if new collections/replicas where created and we also check right before announcing as ready to promote. When we do notice there was relevant DDL, we halt. This will make it so we're restarted in read-only mode again, and can now read an up-to-date catalog snapshot.

It's important to note that any running clusters are not restarted, so any work that has already gone into hydration will not be lost.

I think there's no one left besides me who has seen this code, but @ParkMyCar might be best to review. @alex-hunt-materialize there's a question in there for you about what cloud can or wants to do. I'll tag you on the specific code line.

@def- We probably want more testing. The new behavior is described above, I imagine a test would be to create tables or things while hydrating, and make sure that they show up in the read-only environment.

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@aljoscha aljoscha requested a review from a team as a code owner February 5, 2025 16:05
@aljoscha aljoscha requested a review from ParkMyCar February 5, 2025 16:05
@@ -190,6 +241,19 @@ pub async fn preflight_0dt(
// Take over the catalog.
info!("promoted; attempting takeover");

// NOTE: There _is_ a window where DDL can happen in the old
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-hunt-materialize If cloud is set up for this, or wants to make that change, we could try and shorten the window for missed DDL a bit more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should worry too much about DDL made in the last few seconds before promotion. At most, it will only be a few seconds lost of hydration.

@def-
Copy link
Contributor

def- commented Feb 5, 2025

@def- We probably want more testing. The new behavior is described above, I imagine a test would be to create tables or things while hydrating, and make sure that they show up in the read-only environment.

On it, will push the new test into this PR.

@def- def- force-pushed the 0dt-notice-ddl-changes branch from 5ad597e to cbc391e Compare February 5, 2025 17:39
@def-
Copy link
Contributor

def- commented Feb 5, 2025

The existing parallel-workload test is already failing: https://buildkite.com/materialize/nightly/builds/11058#_

materialize.ui.UIError: Timed out waiting for materialized2 to reach Mz deployment status ReadyToPromote, still in status Initializing

I'm seeing a very similar error in my new 0dt ddl scenario. Just pushed it, reproduces with bin/mzcompose --find 0dt down && bin/mzcompose --find 0dt run ddl, after running a CREATE TABLE t1 (c INT) it never manages to promote. For the tests do I need to set Materialize to auto-restart? It's currently restart="on-failure"

Edit: Also an unexpected (?) panic in Checks 0dt upgrade across four versions:

platform-checks-mz_5-1              | thread 'coordinator' panicked at src/compute-client/src/as_of_selection.rs:392:25: failed to apply hard as-of constraint (id=u519, bounds=[[] .. []], constraint=Constraint { type_: Hard, bound_type: Upper, frontier: Antichain { elements: [1738777129450] }, reason: "storage export u519 write frontier" })

Maybe just another instance of https://github.com/MaterializeInc/database-issues/issues/8836

@def- def- requested a review from a team as a code owner February 5, 2025 17:41
@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 5, 2025

I'm seeing a very similar error in my new 0dt ddl scenario. Just pushed it, reproduces with bin/mzcompose --find 0dt down && bin/mzcompose --find 0dt run ddl, after running a CREATE TABLE t1 (c INT) it never manages to promote. For the tests do I need to set Materialize to auto-restart? It's currently restart="on-failure"

Hmm, maybe, what happens mechanically is that the new envd will halt and then expects to be restarted. So tentatively I'd say yes it needs restart="on-failure".

@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 5, 2025

@def- No forget what I said. I'm pretty sure it's because of a known bug that is fixed in #30576. I even meant to talk to you about that other fix, and how the 0dt tests should already be a good enough test for it, once they have restarts of the read-only envd (which is what this here PR introduces). It's clearly getting late... 🙈

@aljoscha aljoscha force-pushed the 0dt-notice-ddl-changes branch from 32b38b1 to 0f91a42 Compare February 6, 2025 13:00
@aljoscha aljoscha requested a review from a team as a code owner February 6, 2025 13:00
Copy link
Member

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to include the storage status reporting changes here? Everything LGTM but just wanted to double check

() = caught_up_max_wait_fut => {
if panic_after_timeout {
panic!("not caught up within {:?}", caught_up_max_wait);
let mut check_ddl_changes_interval = tokio::time::interval(Duration::from_secs(5 * 60));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this configurable via LaunchDarkly? Alternatively could we make this an optional CLI arg that can be provided to environmentd. Having a knob here to tune this seems useful

.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);

loop {
tokio::select! {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this a biased select! loop? The chances of two things resolving at once is low, but it would make it easier to reason about if we have a defined order

@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 6, 2025

@def- what do you make of these remaining nightly failures?

@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 6, 2025

@ParkMyCar

Did you mean to include the storage status reporting changes here? Everything LGTM but just wanted to double check

It's included here because it's a prerequisite fix, and I wanted to get a nightly run that has both that fix and the changes in this PR.

@aljoscha aljoscha added the release-blocker Critical issue that should block *any* release if not fixed label Feb 7, 2025
@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 7, 2025

@ParkMyCar I pushed individual commits that implement your suggestions, ptal 🙏

@aljoscha aljoscha force-pushed the 0dt-notice-ddl-changes branch from f4ca720 to 89b747d Compare February 7, 2025 15:41
@def-
Copy link
Contributor

def- commented Feb 7, 2025

@def- what do you make of these remaining nightly failures?

Sorry, just saw your message. So far the new run is looking good, but I haven't seen the SQLancer error before (but probably not related to this PR): https://buildkite.com/materialize/nightly/builds/11085

aljoscha and others added 7 commits February 10, 2025 11:40
Before, a hydrating read-only environment would read a catalog snapshot
on bootstrap and then not subscribe to further changes. This means that
we would not be hydrating collections/replicas that are created in the
old version that is still running, and those would then have to be
hydrated after cutting over to the new version.

With this change, we periodically check if new collections/replicas
where created and we also check right before announcing as ready to
promote. When we _do_ notice there was relevant DDL, we halt. This will
make it so we're restarted in read-only mode again, and can now read an
up-to-date catalog snapshot.

It's important to note that any running clusters are not restarted, so
any work that has already gone into hydration will not be lost.
@aljoscha aljoscha force-pushed the 0dt-notice-ddl-changes branch from 89b747d to 1f69ebd Compare February 10, 2025 10:59
@aljoscha aljoscha merged commit 445cbeb into MaterializeInc:main Feb 10, 2025
80 checks passed
@aljoscha aljoscha deleted the 0dt-notice-ddl-changes branch February 10, 2025 13:31
def- added a commit to def-/materialize that referenced this pull request Feb 10, 2025
> platform-checks-mz_2-1              | environmentd: 2025-02-10T15:49:48.328178Z  WARN mz_environmentd::deployment::preflight: halting process: there have been DDL that we need to react to; rebooting in read-only mode

Seen in https://buildkite.com/materialize/nightly/builds/11099

This was part of MaterializeInc#31293, but somehow
got lost in main after merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker Critical issue that should block *any* release if not fixed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants