-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add liveness and readiness probe to aws-for-fluent-bit #947
add liveness and readiness probe to aws-for-fluent-bit #947
Conversation
2280f30
to
f7a96e4
Compare
@@ -59,6 +59,19 @@ spec: | |||
{{- end }} | |||
resources: | |||
{{- toYaml .Values.resources | nindent 12 }} | |||
|
|||
{{- if $.Values.serviceMonitor }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorsk8jz so the service monitor is the prometheus operator? I don't understand why that flag should also enable/disable the liveness probe. They are separate things right? Fluent Bit can have a health check even if you don't also run prom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree they can be done separately - we could enable a service and the liveness probe without the serviceMonitor - I think it was originally set up this way as it doesnt appear that aws-for-fluent-bit needed a service for anything other than the serviceMonitor. However we also now cant enable the healthcheck without the service.
What would you think about me changing the way this works to have the service always created unreleated to the serviceMonitor and having the healthcheck always on as well?
For simplicity if we go this route - i'd suggest we also default the values on the extraService at service.extraService with its default values as i have not seen any issues with this yet.
service:
extraService |
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Health_Check On
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that's a good idea, enable the monitoring endpoint by default with healthcheck. Then user chooses if they want to use it (service monitor or/and liveness probe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that's a good idea, enable the monitoring endpoint by default with healthcheck. Then user chooses if they want to use it (service monitor or/and liveness probe).
are you ok with me keeping the liveness probe on by default set to the correct path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea.
Ok, I have made the requested adjustments - the k8s service will always be populated as well as the liveness check is now on by default. there was a conflict of naming in the values.yaml for service - updated to fix that similar to how fluent-bit handles their naming I also have been testing this by deploying it to our test environments to make sure i dont see any issues |
stable/aws-for-fluent-bit/README.md
Outdated
| `service.type`| Type of service to be created - options are ClusterIP, NodePort, LoadBalancer |`ClusterIP`| | ||
| `service.port`| TCP port of the serviceMonitor service. | 2020 | | ||
| `service.targetPort`| TCP targetPort for service to connect to fluent-bit. | 2020 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorsk8jz is there any way to avoid this backwards breaking change? I will have to check into how EKS Charts should handle that.
stable/aws-for-fluent-bit/README.md
Outdated
|`volumeMounts`| Volume mounts for the pods, provided as a list of volumeMount objects (see values.yaml) | volumes for /var/log and /var/lib/docker/containers are mounted, along with a fluentbit config volume | | ||
|`dnsPolicy`| Optional dnsPolicy |`ClusterFirst`| | ||
|`hostNetwork`| If true, use hostNetwork |`false` | | ||
| `s3.externalId`| Specify an external ID for the STS API, can be used w ith the role_arn parameter if your role requires an external ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you just edited this slightly here but there's a space between w ith
which would be ideal to fix
type: {{ .Values.serviceMonitor.service.type }} | ||
{{- end }} | ||
{{- end }} | ||
type: {{ .Values.service.type }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I do not understand this, why change from Values.serviceMonitor to the backwards breaking name Values.service??
e08db96
to
2ff57ec
Compare
If you let me know how to proceed i'd be happy to get this updated - should i make this more backwards compatible by leaving the |
@razorsk8jz I like this idea, I think we need to keep it backwards compatible:
|
Updated for backwards compatibility - and added k8sService |
Looking for updates - if any more changes are needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorsk8jz sorry I am very busy right now, please ping me if I forget about this PR again.
stable/aws-for-fluent-bit/README.md
Outdated
| `config.extraService` | Append to existing service with this value | HTTP_Server On <br> HTTP_Listen 0.0.0.0 <br> HTTP_PORT 2020 <br> Health_Check On <br> HC_Errors_Count 5 <br> HC_Retry_Failure_Count 5 <br> HC_Period 5 | | ||
| `config.parsersFiles` | List of available parser files | `/fluent-bit/parsers/parsers.conf` | | ||
| `config.extraParsers` | Adding more parsers with this value | `""` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorsk8jz I think this needs to be changed? We agreed we wouldn't break backwards compatibility right? I don't see these options changed in the yaml either.
stable/aws-for-fluent-bit/README.md
Outdated
| `updateStrategy`| Optional update strategy |`type: RollingUpdate`| | ||
| `affinity`| Map of node/pod affinities |`{}`| | ||
| `env`| Optional List of pod environment variables for the pods |`[]`| | ||
| `livenessProbe`| Optional yaml to define liveness probe |httpGet:<br> path: /api/v1/health <br> port: 2020 <br> scheme: HTTP <br> failureThreshold: 2 <br> initialDelaySeconds: 30 <br> timeoutSeconds: 10 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the descriptoin, can you add a link to here: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
And note that the monitoring server must be enabled to use this default health check.
I am worried that even though its in the values.yaml, someone might remove that and then break the default healthcheck. What do you think?
All requested changes made - let me know if anything else |
stable/aws-for-fluent-bit/README.md
Outdated
| `k8sService.type`| Type of service to be created - options are ClusterIP, NodePort, LoadBalancer |`ClusterIP`| | ||
| `k8sService.port`| TCP port of the serviceMonitor service. | 2020 | | ||
| `k8sService.targetPort`| TCP targetPort for service to connect to fluent-bit. | 2020 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I am still very confused by this change. Apologies for how much time its taking for you to help me to understand, but I need to understand before I can merge it.
- these options should go down next to the other
serviceMonitor
options right? Since logically they are related. - There are two new namespaces here for options for monitoring FLB with the service monitor-
k8sService.*
andserviceMonitor.*
- why is that? Could it all just be one namespace,serviceMonitor.*
? Is there some logical distinction between these? From looking at the changes, it look like thek8sService
ones are ports exposed by FLB for use by the service monitoring? I will make a comment on those yaml changes. If my understanding is correct, should we use the sameserviceMonitor
options for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorsk8jz Sorry, I am confused, so the k8sService is separate from serviceMonitor? The description here says serviceMonitor. Sorry, please explain exactly what these do to me, and I think the help text should be longer and more descriptive for these options.
the k8sService is not logically tied to the serviceMonitor as it is also needed for health check
I don't understand how its not tied to it. When I check how these configs affect the yaml, it looks like they only affect service monitor, not Fluent bit. I will make a comment on the yaml.
port: {{ .Values.serviceMonitor.service.port }} | ||
port: {{ .Values.k8sService.port }} | ||
protocol: TCP | ||
targetPort: {{ .Values.serviceMonitor.service.targetPort }} | ||
targetPort: {{ .Values.k8sService.targetPort }} | ||
selector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a backwards breaking change I think? I'm confused since they are not documented today: https://github.com/aws/eks-charts/tree/master/stable/aws-for-fluent-bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, please re-explain your thinking for adding these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorsk8jz this is backwards breaking change right?
type: {{ .Values.serviceMonitor.service.type }} | ||
{{- end }} | ||
{{- end }} | ||
type: {{ .Values.k8sService.type }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an option on FLB container or on the service monitor container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorsk8jz Sorry... I need some more help understanding this change. Thank for working on this and taking the time to educate me.
no worries - i would like you comfortable in the changes being made and understand why they are needed |
@razorsk8jz can you please reply to my previous comments asking for clarification on how these things work. Thanks!! |
Which changes do you need clarification on - the k8sService is not logically tied to the serviceMonitor as it is also needed for health check - since health check is on by default i left creation of service on by default - now you can create a serviceMonitor and the service will already be there for it to use as the health check is also using it. The major change is that the k8sService will always be created now |
@@ -10,12 +8,10 @@ metadata: | |||
spec: | |||
ports: | |||
- name: monitor-agent | |||
port: {{ .Values.serviceMonitor.service.port }} | |||
port: {{ .Values.k8sService.port }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, port and type for k8sService is not on FLB but is on the serviceMonitor, so to me it makes sense to keep it under the serviceMonitor
config namespace?
stable/aws-for-fluent-bit/README.md
Outdated
| `env`| Optional List of pod environment variables for the pods |`[]`| | ||
| `livenessProbe`| Optional yaml to define liveness probe, [details](https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit) |httpGet:<br> path: /api/v1/health <br> port: 2020 <br> scheme: HTTP <br> failureThreshold: 2 <br> initialDelaySeconds: 30 <br> timeoutSeconds: 10 | | ||
| `readinessProbe`| Optional yaml to define readiness probe |`{}`| | ||
| `serviceMonitor.service.type`| Type of service to be created - options are ClusterIP, NodePort, LoadBalancer |`ClusterIP`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please be more descriptive here: "Type of service to be created for the prometheus service monitor"
stable/aws-for-fluent-bit/README.md
Outdated
| `livenessProbe`| Optional yaml to define liveness probe, [details](https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit) |httpGet:<br> path: /api/v1/health <br> port: 2020 <br> scheme: HTTP <br> failureThreshold: 2 <br> initialDelaySeconds: 30 <br> timeoutSeconds: 10 | | ||
| `readinessProbe`| Optional yaml to define readiness probe |`{}`| | ||
| `serviceMonitor.service.type`| Type of service to be created - options are ClusterIP, NodePort, LoadBalancer |`ClusterIP`| | ||
| `serviceMonitor.service.port`| TCP port of the serviceMonitor service. | 2020 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What port is this- the port that prometheus will expose the metrics it collected for scraping? Can you be more descriptive?
| `serviceMonitor.service.type`| Type of service to be created - options are ClusterIP, NodePort, LoadBalancer |`ClusterIP`| | ||
| `serviceMonitor.service.port`| TCP port of the serviceMonitor service. | 2020 | | ||
| `serviceMonitor.service.targetPort`| TCP targetPort for service to connect to fluent-bit. | 2020 | | ||
| `serviceMonitor.enabled`| Whether serviceMonitor should be enabled or not, [details](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md) |`false`| ✔ |`[]`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the .enabled
option as the first serviceMonitor
option
stable/aws-for-fluent-bit/README.md
Outdated
| `updateStrategy`| Optional update strategy |`type: RollingUpdate`| | ||
| `affinity`| Map of node/pod affinities |`{}`| | ||
| `env`| Optional List of pod environment variables for the pods |`[]`| | ||
| `livenessProbe`| Optional yaml to define liveness probe, [details](https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit) |httpGet:<br> path: /api/v1/health <br> port: 2020 <br> scheme: HTTP <br> failureThreshold: 2 <br> initialDelaySeconds: 30 <br> timeoutSeconds: 10 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: may be, to be safe, we should add a sentence here with that link saying that the monitoring interface for Fluent Bit must be enabled in service.extraService
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @razorsk8jz I think this is nearly ready to go, just need a few readme improvements to make things very clear to novice users
@PettitWesley I have made your suggested changes to the README.md - if you see any other changes you would like please let me know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@razorsk8jz approved! Only thing left to merge this is to rebase (there was a recent release and new chart version). If you could also attach the dry run output from generating the template or other testing evidence that'd be awesome. |
@razorsk8jz I aplogize, I was asked to do another chart release to fix the app version: aws/aws-for-fluent-bit#694 (comment) So you'll have to rebase this one more time, ping when you do, and then I can merge it. |
@PettitWesley I have synced with master and fixed conflicts |
Here are the values i'm using
Here are the results of a helm template. Let me know if you see any issues
|
@razorsk8jz Sorry, but GitHub still won't allow me to merge this PR due to conflicts. Please rebase:
|
@razorsk8jz if needed, I can pull your commits, rebase them, and then put up a PR myself. You will still get credit via your commits. Let me know. |
Please do, I did a rebase and if it still won’t let you merge please do whatever needs done to get this this in, Thanks! |
Closing as #975 was merged. I will release it today. |
Issue
#946
Description of changes
Added ability to configure livenessProbe and readinessProbe for aws-for-fluent-bit - also adjusted the defaults for .Values.service.extraService to turn on the healthcheck endpoint
Checklist
README.md
for modified charts)version
inChart.yaml
for the modified chart(s)Testing
using helm template i made sure the new livenessProbe showed correctly and that the readinessProbe did not show up - i allowed for a readinessProbe but i dont think we need here
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.