Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically check "invariants" #13652

Open
1 of 3 tasks
Tracked by #13648
alamb opened this issue Dec 5, 2024 · 9 comments
Open
1 of 3 tasks
Tracked by #13648

Automatically check "invariants" #13652

alamb opened this issue Dec 5, 2024 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Dec 5, 2024

Is your feature request related to a problem or challenge?

I extracted this from #13651 so it was more visible

During upgrade, downstream systems often experience issues due to implicit changes (not explicit API changes) of LogicalPlans that DataFusion code begins relying on, and which result in unintended consequences when upgrading to a new version of DataFusion (see #13525).

Describe the solution you'd like

The idea is to make the current implicit assumptions ("Invariants" in more formal language)( explict and automatically check them.

Examples of implicit assumptions:

  1. Schema column names can't be repeated (this is explicitly mentioned on [DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525)
  2. Inputs to UnionExec must have the same schema
  3. ...

Describe alternatives you've considered

I like the approach @wiedld took in #13651 :

  • define the invariants
  • check the invariants for extensible interfaces (which may be user defined)
  • throw the error closer to the problem (rather than weird behavior later)

Additional context

Sub tasks:

@wiedld
Copy link
Contributor

wiedld commented Dec 5, 2024

take

@wiedld
Copy link
Contributor

wiedld commented Dec 18, 2024

Possible tasks

@wiedld
Copy link
Contributor

wiedld commented Dec 26, 2024

For the physical optimization invariants, we have that the output physical plan schema cannot change; meaning the output results cannot change, but how we get the results can.

What else should be included as a responsibility/check? Maintain input ordering if required?

(The idea is to have a check, perhaps run in debug mode, that would error if a user-defined physical plan or optimization pass fails to maintain the invariant. Throw error closer to the source when debugging.)

@alamb
Copy link
Contributor Author

alamb commented Dec 26, 2024

For the physical optimization invariants, we have that the output physical plan schema cannot change; meaning the output results cannot change, but how we get the results can.

What else should be included as a responsibility/check? Maintain input ordering if required?

Here are some ideas based on bugs we have hit / my memory of what has changed and caused us pain:

  1. Union inputs (can there be more than 2 inputs -- we use such plans in InfluxData but I am not sure what if anything else makes assumptions)
  2. Union input schemas (I think initially they need to be 'coercable' and to be executable they need to be identical)
  3. Projection_exec can't have zero exprs (is this an invariant?)

@wiedld
Copy link
Contributor

wiedld commented Dec 27, 2024

My impression was that the plan construction occurred with the LP (as we do), and not by constructing their own de novo physical plan. Is this correct?

If so, then I think the above list of invariants to check would most likely occur at the LP-level (not the physical plan). I can definitely put up a PR for those. Thank you!

Whereas for the physical plan invariants, (not LP), do we want any invariant checking there? Because I looked at the apache docs & physical plan APIs and from (my naive) understanding these are the only two invariants to check after physical plan mutations (a.k.a. PhysicalOptimizerRule applied). Is this correct? 🤔

@alamb
Copy link
Contributor Author

alamb commented Dec 28, 2024

My impression was that the plan construction occurred with the LP (as we do), and not by constructing their own de novo physical plan. Is this correct?

I am sorry -- I don't understand what you are asking (is LP LogicalPlan?) What does a de novo physical plan mean? You mean like creating a ExecutionPlan directly (not from a LogicalPlan)?

If so, then I think the above list of invariants to check would most likely occur at the LP-level (not the physical plan). I can definitely put up a PR for those. Thank you!

Whereas for the physical plan invariants, (not LP), do we want any invariant checking there? Because I looked at the apache docs & physical plan APIs and from (my naive) understanding these are the only two invariants to check after physical plan mutations (a.k.a. PhysicalOptimizerRule applied). Is this correct? 🤔

I am not sure what the actual invariants are (part of this project I think is to discover that information)

In my opinion we should be seeking to discover what the existing implicit assumptions are and encode them explicitly in the invariant check. Once we have all the existing assumptions encoded then we can move on to trying to add more assumptions

@alamb
Copy link
Contributor Author

alamb commented Jan 6, 2025

I just discovered that @houqp basically filed this same ticket 2 years ago:

@alamb
Copy link
Contributor Author

alamb commented Jan 6, 2025

I suggest we use this ticket to track the infrastructure for checking invariants (e.g. what @wiedld is doing in #13986) and then claim success.

We can introduce additional invariant checks as we discover them as their own issues / mini projects

@wiedld
Copy link
Contributor

wiedld commented Jan 6, 2025

I suggest we use this ticket to track the infrastructure for checking invariants

Agreed. Modifying this list above, we have infrastructure components of:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants