Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

fix: the log-cache instance number can not be adjusted #1310

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JimmyMa
Copy link
Collaborator

@JimmyMa JimmyMa commented Sep 10, 2020

Description

This change makes the log-cache instance number adjustable in user's values.yml file.

Motivation and Context

the log-cache instance number can not be adjusted in user's values.yml file, and it's hard to have multiple log-cache instances.

How Has This Been Tested?

  1. tested with values.yml without log-cahce instances defined
  2. tested with values.yml with 3 log-cache instances

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code has security implications.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@jandubois
Copy link
Member

jandubois commented Sep 10, 2020

Note that log-cache was intentionally removed from the doppler instance group so that it can be made a singleton: #619

There is some theory that log-cache has a memory leak when running with more than one instance due to the way the caches communicate with each other: cloudfoundry/log-cache-release#29

Later experiments seem to indicate that log-cache doesn't have a leak, but uses the total memory size of the node it is running on to determine the cache size, making it look like it has unbounded memory usage: cloudfoundry/gosigar#48

This issue still seems unresolved, so I think it is pre-mature to allow the instance count to be increased right now.

Once the memory configuration question has been resolved, we should first decide if log-cache should be returned to the doppler instance group before deciding if log-cache needs to be independently scalable.

@fargozhu fargozhu added the changelog Issue must be present in the release notes. label Sep 10, 2020
@JimmyMa
Copy link
Collaborator Author

JimmyMa commented Sep 10, 2020

@jandubois thank you for your comment, I don't know the history, and your comment helped me a lot.

I think only one log-cache POD is not good for a production environment, and seems like no quick solution for cloudfoundry/gosigar#48 . A workaround for the memory issue could be to set a lower memory percent for log-cache: https://github.com/cloudfoundry/log-cache-release/blob/develop/jobs/log-cache/spec#L35-L37

Another blocking issue for multiple log-cache PODs is #1307 , the log-cache will not be functional when there are multiple log-cache PODs, because their NODE_INDEX is same (0), please see: https://github.com/cloudfoundry/log-cache-release/blob/fa637cd7e2899d58930f893e52b9911cb8672755/src/internal/cache/log_cache.go#L181-L221

Adding below statements to log-cache.yml could make each log-cache POD has different NODE_INDEX, but this is only working for non-multi-az, when multi-az is enabled, the log-cache in different az will have deplicated NODE_INDEX.

- type: replace
  path: /instance_groups/name=log-cache/jobs/name=log-cache/properties/quarks?
  value:
    envs:
    - name: NODE_INDEX
      valueFrom:
        fieldRef:
          fieldPath: metadata.labels['quarks.cloudfoundry.org/pod-ordinal']

@jandubois jandubois added the Status: Blocked Dependencies on other issues and/or pull requests label Sep 10, 2020
@jandubois
Copy link
Member

@JimmyMa We talked about this PR in our planning meeting today and created #1312 to create a PR to fix the issue upstream.

If we do decide to merge this PR, it should be changed so that even when high_availability is set, the count remains at 1, and the values.yaml entry should have a warning about increasing the instance count.

Here is a workaround we used before to limit the log-cache memory growth:

bosh:
  instance_groups:
  - name: log-cache
    jobs:
    - name: log-cache
      properties:
        memory_limit_percent: 3

This of course requires that all your node have the same total memory size; otherwise you need to use taints and tolerations to limit your log-cache pods to nodes with a uniform size.

I think I would prefer to not make any changes until the upstream issue has been fixed; it all feels like a hack otherwise.

Once upstream can determine the memory available inside the pod, we can set a memory limit, and everything should start working. Is there a reason we shouldn't then move log-cache back into the doppler role (like cf-deployment) instead of having it in its own instance group?

Copy link
Member

@jandubois jandubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just blocking the PR from getting accidentally merged. See comment above about what we would like to see if we are to merge before the upstream fix is in place.

@fargozhu fargozhu added this to the 2.6.0 milestone Oct 11, 2020
@fargozhu fargozhu removed this from the 2.6.0 milestone Oct 21, 2020
…control_plane is enabled

  The multiple-cluster-mode-operations removes '/instance_groups/name=api', so any other operations after multiple-cluster-mode-operations gets error if they try to operate '/instance_groups/name=api'.
  The multiple-cluster-mode-operations must be the botton of chart/templates/bosh_deployment.yaml
@jandubois
Copy link
Member

Since I was just looking at a BPM rendering issue, I wanted to leave this here, for whenever we get back to it:

The reason NODE_INDEX is wrong is because it is set in bpm.yml.erb from spec.id: https://github.com/cloudfoundry/log-cache-release/blob/3229f06/jobs/log-cache/templates/bpm.yml.erb#L9-L11

Quarks-operator is rendering bpm.yml.erb just once in the ig job and uses the rendered file to generate the pod spec for the stateful sets. That means spec.address, spec.index, spec.id, etc. in BPM template rendering always contain the value for instance 0.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
changelog Issue must be present in the release notes. Status: Blocked Dependencies on other issues and/or pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants