Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving installation of user MLA stack #133

Closed
5 tasks done
csengerszabo opened this issue Jul 28, 2022 · 4 comments
Closed
5 tasks done

Improving installation of user MLA stack #133

csengerszabo opened this issue Jul 28, 2022 · 4 comments
Assignees
Labels
ee Epic priority/high sig/app-management Denotes a PR or issue as being assigned to SIG App Management. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.

Comments

@csengerszabo
Copy link
Contributor

csengerszabo commented Jul 28, 2022

Reference: kubermatic/ps-team-flotilla#103

@stroebitzer commented on Wed Jul 06 2022

On working on the KKP Admin training I stumbled from one issue to the next on installing the User MLA stack into my KKP installation.

The current way of installing it is some kind of Alpha version. For providing a smooth experience to our customers we should enhance the installation process.

Maybe changing the way of installing stuff from some hack/deploy-seed.sh script towards our kubermatic-installer could be an option.

This ticket is about:

  • Collecting the issues we currently have with installing User-MLA
  • Create a proposal on how we can enhance the experience to install it

@talhalatiforakzai commented on Thu Jul 14 2022

Issues with installation of user mla

while deploying MLA stack through the helper script

This issue arrises with yq version 4.25.2 and to fix this edit line no 31 and 35 in hack/fetch-chart-dependencies.sh
line 31: chartname=$(yq read "$chartYAML" name) into chartname=$(yq '.name' "$chartYAML")
line 35: for url in $(yq r "$chartYAML" dependencies --tojson | jq -r .[].repository); do into for url in $(yq '.dependencies.[].repository' "$chartYAML"); do

fetching charts

Error: parsing expression: Lexer error: could not match text starting at 1:1 failing at 1:4.

unmatched text: "rea"

  

Installing Minio

Release "minio" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: minio

  

Installing Minio Bucket Lifecycle Manager

Release "minio-lifecycle-mgr" does not exist. Installing it now.

W0713 10:41:42.545257 29382 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

W0713 10:41:43.212232 29382 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

NAME: minio-lifecycle-mgr

LAST DEPLOYED: Wed Jul 13 10:41:40 2022

NAMESPACE: mla

STATUS: deployed

REVISION: 1

TEST SUITE: None

  

Installing Grafana

Release "grafana" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: grafana

  

Installing Grafana Dashboards

configmap/grafana-dashboards-kkp-kubernetes created

configmap/grafana-dashboards-kubernetes-overview created

  

Installing Consul for Cortex

Release "consul" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: consul

  

Installing Cortex

configmap/cortex-runtime-config created

Release "cortex" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: cortex, memcached, memcached, memcached, memcached, memcached, memcached, memcached

  

Installing Loki

Release "loki-distributed" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: loki-distributed

  

Installing Alertmanager Proxy

Release "alertmanager-proxy" does not exist. Installing it now.

walk.go:74: found symbolic link in path: /home/talha/kubermatic/user-mla-issues/mla/charts/alertmanager-proxy/test/test.sh resolves to /home/talha/kubermatic/user-mla-issues/mla/hack/test-chart-rendering.sh. Contents of linked file included and used

NAME: alertmanager-proxy

LAST DEPLOYED: Wed Jul 13 10:41:51 2022

NAMESPACE: mla

STATUS: deployed

REVISION: 1

TEST SUITE: None
  • Done (migrated to kubermatic-installer)

Partial installation of MLA stack incase of limited resources

MLA stack partially fails due to resource limitation due to which other resources that are dependent on them fails to start. Cleanup the installation and provision resources before retrying, maybe we can update the deploy script to check for resources availabiity before provisioning MLA stack.

  • Done (migrated to kubermatic-installer which handles it)

MLA stack causes other workloads to crash & restart

If MLA stack is not installed on dedicated machine deployments then it causes other worloads to run out of mem/cpu, for this reason user should be informed and asked to use seperate MD with minimum specs to avoid any issues.

  • documentation contains required resources for the MLA installation

Pods are not scheduled on nodes provisioned specifically for user mla

I have created a machine deployment for user mla, so that all the workloads related to user mla are scheduled on these nodes, but for some reason all the other workloads gets scheduled fine except for

  • cortex-memcached-blocks
  • cortex-memcached-blocks-index
  • cortex-memcached-blocks-metadata

MD Values

    spec:
      metadata:
        labels:
          machinepool: run-stables-mla-az-d
          machine: run-stables-mla-az-b
          workload: infra-mla
          node-role.kubernetes.io/infra-mla: ""
      taints:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra-mla

MLA Values

cortex: 
  memcached-blocks-metadata:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla
  
    memcached-blocks-index:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla
  
    memcached-blocks:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla

Quick fix is that you should move these things outside of cortex context for nodeselector and toleration

cortex:
  .....
  .....

memcached-blocks:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

memcached-blocks-index:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

memcached-blocks-metadata:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla
  • this is due to how Helm works - improve docs

Consul chart fails to install incase of no default storage

The pods are in pending stage and when we describe pvc it shows no persistent volumes available for this claim and no storage class is set , basically when default storage is not set/applied on any storage class the consul chart rolls back the installation.

example solution

metadata:
  name: kubermatic-fast
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: 'true'
  • we can set it to use kubermatic-fast as default...
@csengerszabo csengerszabo changed the title Create a list of issues concerning installing the User MLA stack Improving installation of user MLA stack Jul 28, 2022
@csengerszabo csengerszabo added priority/high sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Aug 12, 2022
@csengerszabo
Copy link
Contributor Author

Note: consider shipping user cluster MLA within our forthcoming Applications feature.

@csengerszabo csengerszabo added the sig/app-management Denotes a PR or issue as being assigned to SIG App Management. label Aug 24, 2022
@csengerszabo
Copy link
Contributor Author

csengerszabo commented Aug 30, 2022

  • Adding it to Applications is not possible, because user MLA is on seed level, and Applications can only touch user level.
  • New way to tackle this: we should probably put User MLA and seed MLA installation into KKP installation itself
  • We should do it in the installer rather than in the operator because of getting parameters interactively and because of upgrades.
  • Differentiate CE and EE, single MLA installation from CE
  • Check if it makes sense to use the grafana monitoring stack as referred to in Can we expand the Grafana stack.... #126 before we move in to work on the installer

@ewassef
Copy link

ewassef commented Sep 2, 2022

This is a great issue and would be happy to help where possible. Another issue we ran into is the hard-coded Prometheus pod limits in the control plane. These get into a bad state and start failing when the WAL increases in size. 1Gi should be big enough but we regularly see it failing and have to kill the pod to delete the WAL

@wurbanski
Copy link
Contributor

Check if it makes sense to use the grafana monitoring stack as referred to in #126 before we move in to work on the installer

we have decided after the initial research to focus on adding mla installation to kkp installer first, afterwards test and replace prometheus and promtail in the user cluster with grafana-agent instances: kubermatic/kubermatic#10971

Research about Tempo will be taken care of later (next release probably): kubermatic/kubermatic#10974

@wurbanski wurbanski self-assigned this Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ee Epic priority/high sig/app-management Denotes a PR or issue as being assigned to SIG App Management. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.
Projects
None yet
Development

No branches or pull requests

3 participants