Skip to content

Commit

Permalink
Update agg hc (#13)
Browse files Browse the repository at this point in the history
* added per pod check

* updated output message

* Increased timeout to 10 secs for healths

* Increased healthchecks timeout to 12 secs

* Updated agg-hc

* Removed unused code

* Updated readme
  • Loading branch information
davidbalazs93 authored Dec 4, 2017
1 parent efc1410 commit a6dd1ab
Show file tree
Hide file tree
Showing 9 changed files with 44 additions and 200 deletions.
21 changes: 6 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,10 @@ The purpose of this service is to aggregate the healthchecks from services and p
## Introduction
In this section, the aggregate-healthcheck functionalities are described.
### Get services health
A service is considered to be healthy if it has at least one pod that is able to serve requests. To determine which pods are able to serve requests,
there is a readinessProbe configured on the deployment, which checks the GoodToGo endpoint of the app running inside the pod. If the GoodToGo responds
with a 503 Service Unavailable status code, the pod will not serve requests anymore, until it will receive 200 OK status code on GoodToGo endpoint.
A service is considered to be healthy if it has all the pods healthy. To determine which pods are healthy, Aggregate Healthcheck service checks each pod's __health endpoint.

For a service, if there is at least one pod that can serve requests, the service will be considered healthy, but if there are pods that are unavailable,
a message will be displayed in the "Output" section of the corresponding service.
As an exception, if a service is marked as non-resilient (it has the __isResilient: "false"__ label), it will be considered unhealthy if there is at least one pod which is unhealthy.

Not that for services are grouped into categories, therefore there is the possibility to query the aggregate-healthcheck only for a certain list of categories.
If no category is provided, the healthchecks of all services will be displayed.
Note that for services are grouped into categories, therefore there is the possibility to query the aggregate-healthcheck only for a certain list of categories.
If no category is provided, the health status of all services will be displayed.

### Get pods health for a service
The healths of the pods are evaluated by querying the __health endpoint of apps inside the pods. Given a pod, if there is at least one check that fails,
Expand All @@ -25,16 +19,16 @@ The purpose of this service is to aggregate the healthchecks from services and p
the general status of the aggregate-healthcheck will become healthy (it will also mention that there are 'n' services acknowledged).
### Sticky categories
Categories can be sticky, meaning that if one of the services become unhealthy, the category will be disabled, meaning that it will be unhealthy,
until manual re-enabling it. There is an endpoint for enabling a category.
until manual re-enabling it. There is an endpoint for enabling a category.

## Running locally
To run the service locally, you will need to run the following commands first to get the vendored dependencies for this project:
`go get github.com/kardianos/govendor` and
`govendor sync`

There is a limited number of functionalities that can be used locally, because we are querying all the apps, inside the pods and there is no current
There is a limited number of functionality that can be used locally, because we are querying all the apps, inside the pods and there is no current
solution of accessing them outside of the cluster, without using port-forwarding.
The list of functionalities that can be used outside of the cluster are:
The list of all functionality that can be used outside of the cluster are:
* Add/Remove acknowledge
* Enable/Disable sticky categories

Expand All @@ -45,9 +39,6 @@ To run the service locally, you will need to run the following commands first to
* The Kubernetes service should have __hasHealthcheck: "true"__ label.
* The container should have Kubernetes `readinessProbe` configured to check the `__gtg` endpoint of the app
* The app should have `__gtg` and `__health` endpoints.
* Optionally the Kubernetes service can have:
- `isResilient: "false"` label which will cause the service to be unhealthy if there is at least one pod that is unhealthy. Default value for `isResilient` flag is `true`
- `isDaemon: "true"` label which indicates that the pods are managed by a daemonSet instead of a deployment. Default value for `isDaemon` flag is `false`, meaning that pods are managed by a Deployment.

## How to configure categories for aggregate-healthcheck
Categories are stored in Kubernetes ConfigMaps.
Expand Down
61 changes: 13 additions & 48 deletions checkerService.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,62 +19,27 @@ type healthcheckResponse struct {
}

func (hs *k8sHealthcheckService) checkServiceHealth(service service) (string, error) {
var noOfAvailablePods, noOfUnavailablePods int32
var err error
if service.isDaemon {
noOfAvailablePods, noOfUnavailablePods, err = hs.getPodAvailabilityForDaemonSet(service)
} else {
noOfAvailablePods, noOfUnavailablePods, err = hs.getPodAvailabilityForDeployment(service)
}

pods, err := hs.getPodsForService(service.name)
if err != nil {
return "", err
return "", fmt.Errorf("Cannot retrieve pods for service with name %s to perform healthcheck, error was: %s", service.name, err)
}

return checkServiceHealthByResiliency(service, noOfAvailablePods, noOfUnavailablePods)
}

func (hs *k8sHealthcheckService) getPodAvailabilityForDeployment(service service) (int32, int32, error) {
hs.deployments.RLock()
k8sDeployment, ok := hs.deployments.m[service.name]
defer hs.deployments.RUnlock()

if !ok {
return 0, 0, fmt.Errorf("Error retrieving deployment with name %s", service.name)
}

noOfUnavailablePods := k8sDeployment.numberOfUnavailableReplicas
noOfAvailablePods := k8sDeployment.numberOfAvailableReplicas

return noOfAvailablePods, noOfUnavailablePods, nil
}

func (hs *k8sHealthcheckService) getPodAvailabilityForDaemonSet(service service) (int32, int32, error) {
daemonSet, err := hs.k8sClient.ExtensionsV1beta1().DaemonSets("default").Get(service.name)
if err != nil {
return 0, 0, fmt.Errorf("Error retrieving daemonset with name %s", service.name)
}

noOfAvailablePods := daemonSet.Status.NumberReady
noOfUnavailablePods := daemonSet.Status.DesiredNumberScheduled - noOfAvailablePods

return noOfAvailablePods, noOfUnavailablePods, nil
}

func checkServiceHealthByResiliency(service service, noOfAvailablePods int32, noOfUnavailablePods int32) (string, error) {
if noOfAvailablePods == 0 {
return "", errors.New("All pods are unavailable")
}

if !service.isResilient && noOfUnavailablePods != 0 {
return "", fmt.Errorf("There are %v pods unavailable", noOfUnavailablePods)
noOfUnavailablePods := 0
for _, pod := range pods {
err := hs.checkPodHealth(pod, service.appPort)
if err != nil {
noOfUnavailablePods++
}
}

if service.isResilient && noOfUnavailablePods != 0 {
return fmt.Sprintf("There are %v pods unavailable", noOfUnavailablePods), nil
totalNoOfPods :=len(pods)
outputMsg := fmt.Sprintf("%v/%v pods available", totalNoOfPods - noOfUnavailablePods, totalNoOfPods)
if totalNoOfPods==0 || noOfUnavailablePods != 0 {
return "", errors.New(outputMsg)
}

return "", nil
return outputMsg, nil
}

func (hs *k8sHealthcheckService) checkPodHealth(pod pod, appPort int32) error {
Expand Down
8 changes: 0 additions & 8 deletions controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -127,14 +127,6 @@ func (m *MockService) checkServiceHealth(service service) (string, error) {
return "", errors.New("Error reading healthcheck response: ")
}

func (m *MockService) getPodAvailabilityForDeployment(service service) (int32, int32, error) {
return 0, 0, errors.New("")
}

func (m *MockService) getPodAvailabilityForDaemonSet(service service) (int32, int32, error) {
return 0, 0, errors.New("")
}

func (m *MockService) checkPodHealth(pod, int32) error {
return errors.New("Error reading healthcheck response: ")
}
Expand Down
2 changes: 1 addition & 1 deletion helm/upp-aggregate-healthcheck/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Declare variables to be passed into your templates.
service:
name: "" # The name of the service, should be defined in the specific app-configs folder.
hasHealthcheck: "true"
hasHealthcheck: "false"
replicaCount: 1
image:
repository: coco/upp-aggregate-healthcheck
Expand Down
18 changes: 16 additions & 2 deletions html-templates/healthcheck-template.html
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ <h1>{{.PageTitle}}
<tr>
<th>Name</th>
<th>Health status</th>
<th>Last Updated</th>
<th>Output</th>
<th>Last Updated</th>
<th>Ack msg</th>
<th>Action</th>
</tr>
Expand All @@ -66,8 +66,22 @@ <h1>{{.PageTitle}}
{{end}}
{{end}}
</td>
<td>
{{if eq .Status "ok"}}
<span style='color: green;'>{{.Output}}</span>
{{else}}
{{if eq .Status "warning"}}
<span style='color: orange;'>{{.Output}}</span>
{{else}}
{{if eq .Status "critical"}}
<span style='color: red;'>{{.Output}}</span>
{{else}}
<span style='color: blue;'>{{.Output}}</span>
{{end}}
{{end}}
{{end}}
</td>
<td>&nbsp;{{.LastUpdated}}</td>
<td>{{.Output}}</td>
<td>&nbsp;<span style='color: blue;'><em>{{.AckMessage}}</em></span></td>
{{if ne .AddOrRemoveAckPath ""}}
<td><a href="{{.AddOrRemoveAckPath}}">{{.AddOrRemoveAckPathName}}</a></td>
Expand Down
5 changes: 0 additions & 5 deletions model.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,6 @@ type deployment struct {
numberOfUnavailableReplicas int32
}

type deploymentsMap struct {
sync.RWMutex
m map[string]deployment
}

type service struct {
name string
ack string
Expand Down
48 changes: 1 addition & 47 deletions service.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ import (
"fmt"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/pkg/api/v1"
k8s "k8s.io/client-go/pkg/apis/extensions/v1beta1"
"k8s.io/client-go/pkg/watch"
"k8s.io/client-go/rest"
"net"
Expand All @@ -17,7 +16,6 @@ import (
type k8sHealthcheckService struct {
k8sClient kubernetes.Interface
httpClient *http.Client
deployments deploymentsMap
services servicesMap
}

Expand All @@ -30,8 +28,6 @@ type healthcheckService interface {
getPodsForService(string) ([]pod, error)
getPodByName(string) (pod, error)
checkServiceHealth(service) (string, error)
getPodAvailabilityForDeployment(service) (int32, int32, error)
getPodAvailabilityForDaemonSet(service) (int32, int32, error)
checkPodHealth(pod, int32) error
getIndividualPodSeverity(pod, int32) (uint8, error)
getHealthChecksForPod(pod, int32) (healthcheckResponse, error)
Expand Down Expand Up @@ -121,48 +117,9 @@ func (hs *k8sHealthcheckService) watchServices() {
hs.watchServices()
}

func (hs *k8sHealthcheckService) watchDeployments() {
watcher, err := hs.k8sClient.ExtensionsV1beta1().Deployments("default").Watch(v1.ListOptions{})

if err != nil {
errorLogger.Printf("Error while starting to watch deployments: %s", err.Error())
}

infoLogger.Print("Started watching deployments")
resultChannel := watcher.ResultChan()
for msg := range resultChannel {
switch msg.Type {
case watch.Added, watch.Modified:
k8sDeployment := msg.Object.(*k8s.Deployment)
deployment := deployment{
numberOfAvailableReplicas: k8sDeployment.Status.AvailableReplicas,
numberOfUnavailableReplicas: k8sDeployment.Status.UnavailableReplicas,
}

hs.deployments.Lock()
hs.deployments.m[k8sDeployment.Name] = deployment
hs.deployments.Unlock()

infoLogger.Printf("Deployment %s has been added or updated: No of available replicas: %d, no of unavailable replicas: %d", k8sDeployment.Name, k8sDeployment.Status.AvailableReplicas, k8sDeployment.Status.UnavailableReplicas)

case watch.Deleted:
k8sDeployment := msg.Object.(*k8s.Deployment)
hs.deployments.Lock()
delete(hs.deployments.m, k8sDeployment.Name)
hs.deployments.Unlock()
infoLogger.Printf("Deployment %s has been removed", k8sDeployment.Name)
default:
errorLogger.Print("Error received on watch deployments. Channel may be full")
}
}

infoLogger.Print("Deployments watching terminated. Reconnecting...")
hs.watchDeployments()
}

func initializeHealthCheckService() *k8sHealthcheckService {
httpClient := &http.Client{
Timeout: 5 * time.Second,
Timeout: 12 * time.Second,
Transport: &http.Transport{
MaxIdleConnsPerHost: 100,
Dial: (&net.Dialer{
Expand All @@ -182,17 +139,14 @@ func initializeHealthCheckService() *k8sHealthcheckService {
panic(fmt.Sprintf("Failed to create k8s client, error was: %v", err.Error()))
}

deployments := make(map[string]deployment)
services := make(map[string]service)

k8sService := &k8sHealthcheckService{
httpClient: httpClient,
k8sClient: k8sClient,
deployments: deploymentsMap{m: deployments},
services: servicesMap{m: services},
}

go k8sService.watchDeployments()
go k8sService.watchServices()
go k8sService.watchAcks()

Expand Down
67 changes: 0 additions & 67 deletions service_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ const (
validIP = "1.0.0.0"
validK8sServiceName = "validServiceName"
validK8sServiceNameWithAck = "validK8sServiceNameWithAck"
nonExistingK8sServiceName = "vnonExistingServiceName"
validSeverity = uint8(1)
ackMsg = "ack-msg"
validFailingHealthCheckResponseBody = `{
Expand Down Expand Up @@ -85,19 +84,6 @@ func initializeMockServiceWithK8sServices() *k8sHealthcheckService {
}
}

func initializeMockServiceWithDeployments() *k8sHealthcheckService {
deployments := make(map[string]deployment)
deployments[validK8sServiceName] = deployment{
numberOfUnavailableReplicas: 0,
numberOfAvailableReplicas: 2,
}
return &k8sHealthcheckService{
deployments: deploymentsMap{
m: deployments,
},
}
}

func initializeMockService(httpClient *http.Client) *k8sHealthcheckService {
mockK8sClient := fake.NewSimpleClientset()

Expand Down Expand Up @@ -194,59 +180,6 @@ func TestAddAckConfigMapNotFound(t *testing.T) {
assert.NotNil(t, err)
}

func TestCheckServiceHealthByResiliencyNoPodsAvailable(t *testing.T) {
_, err := checkServiceHealthByResiliency(service{}, 0, 3)
assert.NotNil(t, err)
}

func TestCheckServiceHealthByResiliencyWithNonResilientServiceAndUnvavailablePods(t *testing.T) {
s := service{
isResilient: false,
}
_, err := checkServiceHealthByResiliency(s, 1, 3)
assert.NotNil(t, err)
}

func TestCheckServiceHealthByResiliencyWithResilientServiceAndUnvavailablePods(t *testing.T) {
s := service{
isResilient: true,
}
msg, err := checkServiceHealthByResiliency(s, 1, 3)
assert.Nil(t, err)
assert.NotNil(t, msg)
}

func TestCheckServiceHealthByResiliencyHappyFlow(t *testing.T) {
s := service{
isResilient: false,
}
msg, err := checkServiceHealthByResiliency(s, 1, 0)
assert.Nil(t, err)
assert.Equal(t, "", msg)
}

func TestCheckServiceHealthWithDeploymentHappyFlow(t *testing.T) {
k8sHcService := initializeMockServiceWithDeployments()
s := service{
name: validK8sServiceName,
isResilient: false,
}

_, err := k8sHcService.checkServiceHealth(s)
assert.Nil(t, err)
}

func TestCheckServiceHealthWithDeploymentNonExistingServiceName(t *testing.T) {
k8sHcService := initializeMockServiceWithDeployments()
s := service{
name: nonExistingK8sServiceName,
isResilient: false,
}

_, err := k8sHcService.checkServiceHealth(s)
assert.NotNil(t, err)
}

func TestUpdateAcksForServicesEmptyAckList(t *testing.T) {
hcService := initializeMockServiceWithK8sServices()
acks := make(map[string]string)
Expand Down
Loading

0 comments on commit a6dd1ab

Please sign in to comment.