fix: the log-cache instance number can not be adjusted #1310

JimmyMa · 2020-09-10T00:51:36Z

Description

This change makes the log-cache instance number adjustable in user's values.yml file.

Motivation and Context

the log-cache instance number can not be adjusted in user's values.yml file, and it's hard to have multiple log-cache instances.

How Has This Been Tested?

tested with values.yml without log-cahce instances defined
tested with values.yml with 3 log-cache instances

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code has security implications.
My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

jandubois · 2020-09-10T01:19:56Z

Note that log-cache was intentionally removed from the doppler instance group so that it can be made a singleton: #619

There is some theory that log-cache has a memory leak when running with more than one instance due to the way the caches communicate with each other: cloudfoundry/log-cache-release#29

Later experiments seem to indicate that log-cache doesn't have a leak, but uses the total memory size of the node it is running on to determine the cache size, making it look like it has unbounded memory usage: cloudfoundry/gosigar#48

This issue still seems unresolved, so I think it is pre-mature to allow the instance count to be increased right now.

Once the memory configuration question has been resolved, we should first decide if log-cache should be returned to the doppler instance group before deciding if log-cache needs to be independently scalable.

JimmyMa · 2020-09-10T14:05:44Z

@jandubois thank you for your comment, I don't know the history, and your comment helped me a lot.

I think only one log-cache POD is not good for a production environment, and seems like no quick solution for cloudfoundry/gosigar#48 . A workaround for the memory issue could be to set a lower memory percent for log-cache: https://github.com/cloudfoundry/log-cache-release/blob/develop/jobs/log-cache/spec#L35-L37

Another blocking issue for multiple log-cache PODs is #1307 , the log-cache will not be functional when there are multiple log-cache PODs, because their NODE_INDEX is same (0), please see: https://github.com/cloudfoundry/log-cache-release/blob/fa637cd7e2899d58930f893e52b9911cb8672755/src/internal/cache/log_cache.go#L181-L221

Adding below statements to log-cache.yml could make each log-cache POD has different NODE_INDEX, but this is only working for non-multi-az, when multi-az is enabled, the log-cache in different az will have deplicated NODE_INDEX.

- type: replace
  path: /instance_groups/name=log-cache/jobs/name=log-cache/properties/quarks?
  value:
    envs:
    - name: NODE_INDEX
      valueFrom:
        fieldRef:
          fieldPath: metadata.labels['quarks.cloudfoundry.org/pod-ordinal']

jandubois · 2020-09-10T18:08:14Z

@JimmyMa We talked about this PR in our planning meeting today and created #1312 to create a PR to fix the issue upstream.

If we do decide to merge this PR, it should be changed so that even when high_availability is set, the count remains at 1, and the values.yaml entry should have a warning about increasing the instance count.

Here is a workaround we used before to limit the log-cache memory growth:

bosh:
  instance_groups:
  - name: log-cache
    jobs:
    - name: log-cache
      properties:
        memory_limit_percent: 3

This of course requires that all your node have the same total memory size; otherwise you need to use taints and tolerations to limit your log-cache pods to nodes with a uniform size.

I think I would prefer to not make any changes until the upstream issue has been fixed; it all feels like a hack otherwise.

Once upstream can determine the memory available inside the pod, we can set a memory limit, and everything should start working. Is there a reason we shouldn't then move log-cache back into the doppler role (like cf-deployment) instead of having it in its own instance group?

jandubois

Just blocking the PR from getting accidentally merged. See comment above about what we would like to see if we are to merge before the upstream fix is in place.

…control_plane is enabled The multiple-cluster-mode-operations removes '/instance_groups/name=api', so any other operations after multiple-cluster-mode-operations gets error if they try to operate '/instance_groups/name=api'. The multiple-cluster-mode-operations must be the botton of chart/templates/bosh_deployment.yaml

jandubois · 2021-02-23T17:44:47Z

Since I was just looking at a BPM rendering issue, I wanted to leave this here, for whenever we get back to it:

The reason NODE_INDEX is wrong is because it is set in bpm.yml.erb from spec.id: https://github.com/cloudfoundry/log-cache-release/blob/3229f06/jobs/log-cache/templates/bpm.yml.erb#L9-L11

Quarks-operator is rendering bpm.yml.erb just once in the ig job and uses the rendered file to generate the pod spec for the stateful sets. That means spec.address, spec.index, spec.id, etc. in BPM template rendering always contain the value for instance 0.

fargozhu added the changelog Issue must be present in the release notes. label Sep 10, 2020

fargozhu requested a review from jandubois September 10, 2020 07:12

jandubois added the Status: Blocked Dependencies on other issues and/or pull requests label Sep 10, 2020

jandubois suggested changes Sep 10, 2020

View reviewed changes

fargozhu added this to the 2.6.0 milestone Oct 11, 2020

fargozhu removed this from the 2.6.0 milestone Oct 21, 2020

JimmyMa force-pushed the master branch from 056e6dd to e3a33fa Compare November 3, 2020 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: the log-cache instance number can not be adjusted #1310

fix: the log-cache instance number can not be adjusted #1310

JimmyMa commented Sep 10, 2020

jandubois commented Sep 10, 2020 •

edited

Loading

JimmyMa commented Sep 10, 2020

jandubois commented Sep 10, 2020

jandubois left a comment

jandubois commented Feb 23, 2021

fix: the log-cache instance number can not be adjusted #1310

Are you sure you want to change the base?

fix: the log-cache instance number can not be adjusted #1310

Conversation

JimmyMa commented Sep 10, 2020

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

jandubois commented Sep 10, 2020 • edited Loading

JimmyMa commented Sep 10, 2020

jandubois commented Sep 10, 2020

jandubois left a comment

Choose a reason for hiding this comment

jandubois commented Feb 23, 2021

jandubois commented Sep 10, 2020 •

edited

Loading