Improving installation of user MLA stack #133

csengerszabo · 2022-07-28T11:21:37Z

Reference: kubermatic/ps-team-flotilla#103

@stroebitzer commented on Wed Jul 06 2022

On working on the KKP Admin training I stumbled from one issue to the next on installing the User MLA stack into my KKP installation.

The current way of installing it is some kind of Alpha version. For providing a smooth experience to our customers we should enhance the installation process.

Maybe changing the way of installing stuff from some hack/deploy-seed.sh script towards our kubermatic-installer could be an option.

This ticket is about:

Collecting the issues we currently have with installing User-MLA
Create a proposal on how we can enhance the experience to install it

@talhalatiforakzai commented on Thu Jul 14 2022

Issues with installation of user mla

while deploying MLA stack through the helper script

This issue arrises with yq version 4.25.2 and to fix this edit line no 31 and 35 in hack/fetch-chart-dependencies.sh
line 31: chartname=$(yq read "$chartYAML" name) into chartname=$(yq '.name' "$chartYAML")
line 35: for url in $(yq r "$chartYAML" dependencies --tojson | jq -r .[].repository); do into for url in $(yq '.dependencies.[].repository' "$chartYAML"); do

fetching charts

Error: parsing expression: Lexer error: could not match text starting at 1:1 failing at 1:4.

unmatched text: "rea"

  

Installing Minio

Release "minio" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: minio

  

Installing Minio Bucket Lifecycle Manager

Release "minio-lifecycle-mgr" does not exist. Installing it now.

W0713 10:41:42.545257 29382 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

W0713 10:41:43.212232 29382 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

NAME: minio-lifecycle-mgr

LAST DEPLOYED: Wed Jul 13 10:41:40 2022

NAMESPACE: mla

STATUS: deployed

REVISION: 1

TEST SUITE: None

  

Installing Grafana

Release "grafana" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: grafana

  

Installing Grafana Dashboards

configmap/grafana-dashboards-kkp-kubernetes created

configmap/grafana-dashboards-kubernetes-overview created

  

Installing Consul for Cortex

Release "consul" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: consul

  

Installing Cortex

configmap/cortex-runtime-config created

Release "cortex" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: cortex, memcached, memcached, memcached, memcached, memcached, memcached, memcached

  

Installing Loki

Release "loki-distributed" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: loki-distributed

  

Installing Alertmanager Proxy

Release "alertmanager-proxy" does not exist. Installing it now.

walk.go:74: found symbolic link in path: /home/talha/kubermatic/user-mla-issues/mla/charts/alertmanager-proxy/test/test.sh resolves to /home/talha/kubermatic/user-mla-issues/mla/hack/test-chart-rendering.sh. Contents of linked file included and used

NAME: alertmanager-proxy

LAST DEPLOYED: Wed Jul 13 10:41:51 2022

NAMESPACE: mla

STATUS: deployed

REVISION: 1

TEST SUITE: None

Done (migrated to kubermatic-installer)

Partial installation of MLA stack incase of limited resources

MLA stack partially fails due to resource limitation due to which other resources that are dependent on them fails to start. Cleanup the installation and provision resources before retrying, maybe we can update the deploy script to check for resources availabiity before provisioning MLA stack.

Done (migrated to kubermatic-installer which handles it)

MLA stack causes other workloads to crash & restart

If MLA stack is not installed on dedicated machine deployments then it causes other worloads to run out of mem/cpu, for this reason user should be informed and asked to use seperate MD with minimum specs to avoid any issues.

documentation contains required resources for the MLA installation

Pods are not scheduled on nodes provisioned specifically for user mla

I have created a machine deployment for user mla, so that all the workloads related to user mla are scheduled on these nodes, but for some reason all the other workloads gets scheduled fine except for

cortex-memcached-blocks
cortex-memcached-blocks-index
cortex-memcached-blocks-metadata

MD Values

    spec:
      metadata:
        labels:
          machinepool: run-stables-mla-az-d
          machine: run-stables-mla-az-b
          workload: infra-mla
          node-role.kubernetes.io/infra-mla: ""
      taints:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra-mla

MLA Values

cortex: 
  memcached-blocks-metadata:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla
  
    memcached-blocks-index:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla
  
    memcached-blocks:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla

Quick fix is that you should move these things outside of cortex context for nodeselector and toleration

cortex:
  .....
  .....

memcached-blocks:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

memcached-blocks-index:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

memcached-blocks-metadata:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

this is due to how Helm works - improve docs

Consul chart fails to install incase of no default storage

The pods are in pending stage and when we describe pvc it shows no persistent volumes available for this claim and no storage class is set , basically when default storage is not set/applied on any storage class the consul chart rolls back the installation.

example solution

metadata:
  name: kubermatic-fast
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: 'true'

we can set it to use kubermatic-fast as default...

The text was updated successfully, but these errors were encountered:

csengerszabo · 2022-08-19T13:20:02Z

Note: consider shipping user cluster MLA within our forthcoming Applications feature.

csengerszabo · 2022-08-30T09:40:03Z

Adding it to Applications is not possible, because user MLA is on seed level, and Applications can only touch user level.
New way to tackle this: we should probably put User MLA and seed MLA installation into KKP installation itself
We should do it in the installer rather than in the operator because of getting parameters interactively and because of upgrades.
Differentiate CE and EE, single MLA installation from CE
Check if it makes sense to use the grafana monitoring stack as referred to in Can we expand the Grafana stack.... #126 before we move in to work on the installer

ewassef · 2022-09-02T17:38:25Z

This is a great issue and would be happy to help where possible. Another issue we ran into is the hard-coded Prometheus pod limits in the control plane. These get into a bad state and start failing when the WAL increases in size. 1Gi should be big enough but we regularly see it failing and have to kill the pod to delete the WAL

wurbanski · 2022-09-14T15:47:55Z

Check if it makes sense to use the grafana monitoring stack as referred to in #126 before we move in to work on the installer

we have decided after the initial research to focus on adding mla installation to kkp installer first, afterwards test and replace prometheus and promtail in the user cluster with grafana-agent instances: kubermatic/kubermatic#10971

Research about Tempo will be taken care of later (next release probably): kubermatic/kubermatic#10974

csengerszabo changed the title ~~Create a list of issues concerning installing the User MLA stack~~ Improving installation of user MLA stack Jul 28, 2022

csengerszabo added the Epic label Aug 1, 2022

csengerszabo added priority/high sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Aug 12, 2022

csengerszabo added the sig/app-management Denotes a PR or issue as being assigned to SIG App Management. label Aug 24, 2022

csengerszabo added the ee label Aug 30, 2022

wurbanski self-assigned this Nov 3, 2022

wurbanski closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving installation of user MLA stack #133

Improving installation of user MLA stack #133

csengerszabo commented Jul 28, 2022 •

edited by wurbanski

Loading

csengerszabo commented Aug 19, 2022

csengerszabo commented Aug 30, 2022 •

edited

Loading

ewassef commented Sep 2, 2022

wurbanski commented Sep 14, 2022

Improving installation of user MLA stack #133

Improving installation of user MLA stack #133

Comments

csengerszabo commented Jul 28, 2022 • edited by wurbanski Loading

Issues with installation of user mla

while deploying MLA stack through the helper script

Partial installation of MLA stack incase of limited resources

MLA stack causes other workloads to crash & restart

Pods are not scheduled on nodes provisioned specifically for user mla

Consul chart fails to install incase of no default storage

csengerszabo commented Aug 19, 2022

csengerszabo commented Aug 30, 2022 • edited Loading

ewassef commented Sep 2, 2022

wurbanski commented Sep 14, 2022

csengerszabo commented Jul 28, 2022 •

edited by wurbanski

Loading

csengerszabo commented Aug 30, 2022 •

edited

Loading