Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashloop Detection #200

Open
jfaltermeier opened this issue Jul 18, 2023 · 3 comments
Open

Crashloop Detection #200

jfaltermeier opened this issue Jul 18, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request prio-medium

Comments

@jfaltermeier
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The operator should be able to detect crash loops and skip over problematic custom resources.

Reproducer Session
theia-cloud branch osweek23-1
theia-cloud-helm branch osweek23-1
terraform test-configuration 2-01_try-now
kubectl apply -f osweek23/empty-session.yaml

Reproducer Workspace
theia-cloud branch osweek23-1
theia-cloud-helm branch osweek23-1
terraform test-configuration 2-01_try-now
kubectl apply -f osweek23/empty-workspace.yaml

Reproducer AppDefinition
theia-cloud branch osweek23-2
theia-cloud-helm branch osweek23-1
terraform test-configuration 2-01_try-now
kubectl apply -f osweek23/coffee-session.yaml

Describe the solution you'd like

We may use additional fields in the custom resources metadata to save the current step the operator way making and number of tries for this step.

If the operator sees that there were already multiple tries to read the container image from an app definition, it may assume that this will fail and skip over it.

Describe alternatives you've considered

Cluster provider

No response

Additional information

No response

@jfaltermeier jfaltermeier added the enhancement New feature or request label Jul 18, 2023
@jfaltermeier jfaltermeier added this to the OS Week 23 milestone Jul 18, 2023
@jfaltermeier jfaltermeier moved this to Todo in OS Week 23 Jul 18, 2023
@lucas-koehler lucas-koehler moved this from Todo to In Progress in OS Week 23 Jul 31, 2023
lucas-koehler added a commit that referenced this issue Aug 1, 2023
- Extend the resource clients for app definitions, sessions, and
workspaces to be able to handle the status subresource.
- Add basic status for our CRDs to track handling state of the operator
- Prevent crashloops by never handling resources that are in error or
handling state. The latter indicates an unexcepted crash during handling
(e.g. NPE).
- NOTE: The previous step was only dony for LAZY handlers. EAGER
handlers did not change
- Exemplary sub steps for volume claim an attachment for workspace
resources
- Increase resource versions for all CRs
- Move API, KIND, and CRD_NAME constants from Spec to resource classes
(e.g. from SessionSpec to Session) because they belong to the resource
itself. This became apparent when adding the Status to the resources

Note: Status handling can probably be refactored to be a bit cleaner and
make use of better code-reuse.

Part of #200
lucas-koehler added a commit that referenced this issue Aug 1, 2023
There are some issues with setting app definitions status that need to be fixed first
lucas-koehler added a commit to eclipse-theia/theia-cloud-helm that referenced this issue Aug 1, 2023
- Add status to all crds.
- Extend operator role to allow editing status for our crds

Part of eclipse-theia/theia-cloud#200
lucas-koehler added a commit that referenced this issue Aug 2, 2023
- Extend the resource clients for app definitions, sessions, and
workspaces to be able to handle the status subresource.
- Add basic status for our CRDs to track handling state of the operator
- Prevent crashloops by never handling resources that are in error or
handling state. The latter indicates an unexcepted crash during handling
(e.g. NPE).
- NOTE: The previous step was only dony for LAZY handlers. EAGER
handlers did not change
- Exemplary sub steps for volume claim an attachment for workspace
resources
- Increase resource versions for all CRs
- Move API, KIND, and CRD_NAME constants from Spec to resource classes
(e.g. from SessionSpec to Session) because they belong to the resource
itself. This became apparent when adding the Status to the resources

Note: Status handling can probably be refactored to be a bit cleaner and
make use of better code-reuse.

Part of #200
lucas-koehler added a commit that referenced this issue Aug 2, 2023
There are some issues with setting app definitions status that need to be fixed first
lucas-koehler added a commit that referenced this issue Aug 2, 2023
- AppDefinition resource version fix after rebase
- Catch all exceptions when adding session, workspace resources and add error state
- Do not handle resources in HANDLED state again.
- Smaller fixes
lucas-koehler added a commit to eclipse-theia/theia-cloud-helm that referenced this issue Aug 2, 2023
- Add status to all crds.
- Extend operator role to allow editing status for our crds

Part of eclipse-theia/theia-cloud#200
lucas-koehler added a commit to eclipse-theia/theia-cloud-helm that referenced this issue Aug 2, 2023
- Add status to all crds.
- Extend operator role to allow editing status for our crds

Part of eclipse-theia/theia-cloud#200
lucas-koehler added a commit to eclipse-theia/theia-cloud-helm that referenced this issue Aug 2, 2023
- Add status to all crds.
- Extend operator role to allow editing status for our crds
- Increase chart versions and use new app version

Part of eclipse-theia/theia-cloud#200
lucas-koehler added a commit that referenced this issue Aug 2, 2023
NOTE: Status handling for app definitions is commented out as there are still some issues

- Catch all exceptions when adding session, workspace, appdefinition resources and add error state
- Extend the resource clients for app definitions, sessions, and
workspaces to be able to handle the status subresource.
- Add basic status for our CRDs to track handling state of the operator
- Prevent crashloops by never handling resources that are in error or
handling state. The latter indicates an unexcepted crash during handling
(e.g. NPE).
- NOTE: The previous step was only dony for LAZY handlers. EAGER handlers did not change
- Exemplary sub steps for volume claim an attachment for workspace resources
- Increase resource versions for all CRs
- Move API, KIND, and CRD_NAME constants from Spec to resource classes
(e.g. from SessionSpec to Session) because they belong to the resource
itself. This became apparent when adding the Status to the resources

Part of #200
lucas-koehler added a commit that referenced this issue Aug 3, 2023
Enable checking and storing of AppDefinitions' handling in the operator
jfaltermeier pushed a commit that referenced this issue Aug 3, 2023
Enable checking and storing of AppDefinitions' handling in the operator
@jfaltermeier jfaltermeier moved this from In Progress to Done in OS Week 23 Aug 4, 2023
lucas-koehler added a commit that referenced this issue Aug 29, 2023
When handling of a resource is unexpectedly interrupted without throwing
an error, it currently remains in HANDLING state and is never handled
again. This could happen if the operator is shut down mid handling, e.g.
because of resource constraints.

This commit sets the resource to ERROR state and adds a message that
handling was unexpectedly interrupted before.

Contributed on behalf of STMicroelectronics

Signed-off-by: Lucas Koehler <[email protected]>
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
When handling of a resource is unexpectedly interrupted without throwing
an error, it currently remains in HANDLING state and is never handled
again. This could happen if the operator is shut down mid handling, e.g.
because of resource constraints.

This commit sets the resource to ERROR state and adds a message that
handling was unexpectedly interrupted before.

Contributed on behalf of STMicroelectronics

Signed-off-by: Lucas Koehler <[email protected]>
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
NOTE: Status handling for app definitions is commented out as there are still some issues

- Catch all exceptions when adding session, workspace, appdefinition resources and add error state
- Extend the resource clients for app definitions, sessions, and
workspaces to be able to handle the status subresource.
- Add basic status for our CRDs to track handling state of the operator
- Prevent crashloops by never handling resources that are in error or
handling state. The latter indicates an unexcepted crash during handling
(e.g. NPE).
- NOTE: The previous step was only dony for LAZY handlers. EAGER handlers did not change
- Exemplary sub steps for volume claim an attachment for workspace resources
- Increase resource versions for all CRs
- Move API, KIND, and CRD_NAME constants from Spec to resource classes
(e.g. from SessionSpec to Session) because they belong to the resource
itself. This became apparent when adding the Status to the resources

Part of #200
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
Enable checking and storing of AppDefinitions' handling in the operator
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
When handling of a resource is unexpectedly interrupted without throwing
an error, it currently remains in HANDLING state and is never handled
again. This could happen if the operator is shut down mid handling, e.g.
because of resource constraints.

This commit sets the resource to ERROR state and adds a message that
handling was unexpectedly interrupted before.

Contributed on behalf of STMicroelectronics

Signed-off-by: Lucas Koehler <[email protected]>
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
NOTE: Status handling for app definitions is commented out as there are still some issues

- Catch all exceptions when adding session, workspace, appdefinition resources and add error state
- Extend the resource clients for app definitions, sessions, and
workspaces to be able to handle the status subresource.
- Add basic status for our CRDs to track handling state of the operator
- Prevent crashloops by never handling resources that are in error or
handling state. The latter indicates an unexcepted crash during handling
(e.g. NPE).
- NOTE: The previous step was only dony for LAZY handlers. EAGER handlers did not change
- Exemplary sub steps for volume claim an attachment for workspace resources
- Increase resource versions for all CRs
- Move API, KIND, and CRD_NAME constants from Spec to resource classes
(e.g. from SessionSpec to Session) because they belong to the resource
itself. This became apparent when adding the Status to the resources

Part of #200
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
Enable checking and storing of AppDefinitions' handling in the operator
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
When handling of a resource is unexpectedly interrupted without throwing
an error, it currently remains in HANDLING state and is never handled
again. This could happen if the operator is shut down mid handling, e.g.
because of resource constraints.

This commit sets the resource to ERROR state and adds a message that
handling was unexpectedly interrupted before.

Contributed on behalf of STMicroelectronics

Signed-off-by: Lucas Koehler <[email protected]>
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
NOTE: Status handling for app definitions is commented out as there are still some issues

- Catch all exceptions when adding session, workspace, appdefinition resources and add error state
- Extend the resource clients for app definitions, sessions, and
workspaces to be able to handle the status subresource.
- Add basic status for our CRDs to track handling state of the operator
- Prevent crashloops by never handling resources that are in error or
handling state. The latter indicates an unexcepted crash during handling
(e.g. NPE).
- NOTE: The previous step was only dony for LAZY handlers. EAGER handlers did not change
- Exemplary sub steps for volume claim an attachment for workspace resources
- Increase resource versions for all CRs
- Move API, KIND, and CRD_NAME constants from Spec to resource classes
(e.g. from SessionSpec to Session) because they belong to the resource
itself. This became apparent when adding the Status to the resources

Part of #200
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
Enable checking and storing of AppDefinitions' handling in the operator
jfaltermeier pushed a commit that referenced this issue Aug 30, 2023
When handling of a resource is unexpectedly interrupted without throwing
an error, it currently remains in HANDLING state and is never handled
again. This could happen if the operator is shut down mid handling, e.g.
because of resource constraints.

This commit sets the resource to ERROR state and adds a message that
handling was unexpectedly interrupted before.

Contributed on behalf of STMicroelectronics

Signed-off-by: Lucas Koehler <[email protected]>
jfaltermeier pushed a commit to eclipse-theia/theia-cloud-helm that referenced this issue Aug 30, 2023
- Add status to all crds.
- Extend operator role to allow editing status for our crds
- Increase chart versions and use new app version

Part of eclipse-theia/theia-cloud#200
jfaltermeier pushed a commit to eclipse-theia/theia-cloud-helm that referenced this issue Aug 30, 2023
- Add status to all crds.
- Extend operator role to allow editing status for our crds
- Increase chart versions and use new app version

Part of eclipse-theia/theia-cloud#200
Copy link

This issue is stale because it has been open for 180 days with no activity.

@github-actions github-actions bot added the stale label Jan 28, 2024
@sgraband
Copy link
Contributor

Keep open

@github-actions github-actions bot removed the stale label Jan 29, 2024
Copy link

This issue is stale because it has been open for 180 days with no activity.

@github-actions github-actions bot added the stale label Jul 28, 2024
@jfaltermeier jfaltermeier removed this from the OS Week 23 milestone Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request prio-medium
Projects
None yet
Development

No branches or pull requests

3 participants