diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..084977f --- /dev/null +++ b/404.html @@ -0,0 +1,1343 @@ + + + + + + + + + + + + + + + + + + + Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ +

404 - Not found

+ +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/apis/application.html b/apis/application.html new file mode 100644 index 0000000..5691af1 --- /dev/null +++ b/apis/application.html @@ -0,0 +1,2018 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Application Management - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Application Management

+

Issue application operation command

+

POST /apis/v1/applications/operations

+

Request +

curl --location 'http://drove.local:7000/apis/v1/operations' \
+--header 'Content-Type: application/json' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data '{
+    "type": "SCALE",
+    "appId": "TEST_APP-1",
+    "requiredInstances": 1,
+    "opSpec": {
+        "timeout": "1m",
+        "parallelism": 20,
+        "failureStrategy": "STOP"
+    }
+}'
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1"
+    },
+    "message": "success"
+}
+

+
+

Tip

+

Relevant payloads for application commands can be found in application operations section.

+
+

Cancel currently running operation

+

POST /apis/v1/applications/operations/{appId}/cancel

+

Request +

curl --location --request POST 'http://drove.local:7000/apis/v1/operations/TEST_APP/cancel' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+

+

Response +

{
+    "status": "SUCCESS",
+    "message": "success"
+}
+

+

Get list of applications

+

GET /apis/v1/applications

+

Request +

curl --location 'http://drove.local:7000/apis/v1/applications' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "TEST_APP-1": {
+            "id": "TEST_APP-1",
+            "name": "TEST_APP",
+            "requiredInstances": 0,
+            "healthyInstances": 0,
+            "totalCPUs": 0,
+            "totalMemory": 0,
+            "state": "MONITORING",
+            "created": 1719826995764,
+            "updated": 1719892126096
+        }
+    },
+    "message": "success"
+}
+

+

Get info for an app

+

GET /apis/v1/applications/{id}

+

Request +

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "id": "TEST_APP-1",
+        "name": "TEST_APP",
+        "requiredInstances": 1,
+        "healthyInstances": 1,
+        "totalCPUs": 1,
+        "totalMemory": 128,
+        "state": "RUNNING",
+        "created": 1719826995764,
+        "updated": 1719892279019
+    },
+    "message": "success"
+}
+

+

Get raw JSON specs

+

GET /apis/v1/applications/{id}/spec

+

Request

+
curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/spec' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+
+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "name": "TEST_APP",
+        "version": "1",
+        "executable": {
+            "type": "DOCKER",
+            "url": "ghcr.io/appform-io/perf-test-server-httplib",
+            "dockerPullTimeout": "100 seconds"
+        },
+        "exposedPorts": [
+            {
+                "name": "main",
+                "port": 8000,
+                "type": "HTTP"
+            }
+        ],
+        "volumes": [],
+        "configs": [
+            {
+                "type": "INLINE",
+                "localFilename": "/testfiles/drove.txt",
+                "data": ""
+            }
+        ],
+        "type": "SERVICE",
+        "resources": [
+            {
+                "type": "CPU",
+                "count": 1
+            },
+            {
+                "type": "MEMORY",
+                "sizeInMB": 128
+            }
+        ],
+        "placementPolicy": {
+            "type": "ANY"
+        },
+        "healthcheck": {
+            "mode": {
+                "type": "HTTP",
+                "protocol": "HTTP",
+                "portName": "main",
+                "path": "/",
+                "verb": "GET",
+                "successCodes": [
+                    200
+                ],
+                "payload": "",
+                "connectionTimeout": "1 second",
+                "insecure": false
+            },
+            "timeout": "1 second",
+            "interval": "5 seconds",
+            "attempts": 3,
+            "initialDelay": "0 seconds"
+        },
+        "readiness": {
+            "mode": {
+                "type": "HTTP",
+                "protocol": "HTTP",
+                "portName": "main",
+                "path": "/",
+                "verb": "GET",
+                "successCodes": [
+                    200
+                ],
+                "payload": "",
+                "connectionTimeout": "1 second",
+                "insecure": false
+            },
+            "timeout": "1 second",
+            "interval": "3 seconds",
+            "attempts": 3,
+            "initialDelay": "0 seconds"
+        },
+        "tags": {
+            "superSpecialApp": "yes_i_am",
+            "say_my_name": "heisenberg"
+        },
+        "env": {
+            "CORES": "8"
+        },
+        "exposureSpec": {
+            "vhost": "testapp.local",
+            "portName": "main",
+            "mode": "ALL"
+        },
+        "preShutdown": {
+            "hooks": [
+                {
+                    "type": "HTTP",
+                    "protocol": "HTTP",
+                    "portName": "main",
+                    "path": "/",
+                    "verb": "GET",
+                    "successCodes": [
+                        200
+                    ],
+                    "payload": "",
+                    "connectionTimeout": "1 second",
+                    "insecure": false
+                }
+            ],
+            "waitBeforeKill": "3 seconds"
+        }
+    },
+    "message": "success"
+}
+

+
+

Note

+

configs section data will not be returned by any api calls

+
+

Get list of currently active instances

+

GET /apis/v1/applications/{id}/instances

+

Request +

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": [
+        {
+            "appId": "TEST_APP-1",
+            "appName": "TEST_APP",
+            "instanceId": "AI-58eb1111-8c2c-4ea2-a159-8fc68010a146",
+            "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+            "localInfo": {
+                "hostname": "ppessdev",
+                "ports": {
+                    "main": {
+                        "containerPort": 8000,
+                        "hostPort": 33857,
+                        "portType": "HTTP"
+                    }
+                }
+            },
+            "resources": [
+                {
+                    "type": "CPU",
+                    "cores": {
+                        "0": [
+                            2
+                        ]
+                    }
+                },
+                {
+                    "type": "MEMORY",
+                    "memoryInMB": {
+                        "0": 128
+                    }
+                }
+            ],
+            "state": "HEALTHY",
+            "metadata": {},
+            "errorMessage": "",
+            "created": 1719892354194,
+            "updated": 1719893180105
+        }
+    ],
+    "message": "success"
+}
+

+

Get list of old instances

+

GET /apis/v1/applications/{id}/instances/old

+

Request +

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances/old' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": [
+        {
+            "appId": "TEST_APP-1",
+            "appName": "TEST_APP",
+            "instanceId": "AI-869e34ed-ebf3-4908-bf48-719475ca5640",
+            "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+            "resources": [
+                {
+                    "type": "CPU",
+                    "cores": {
+                        "0": [
+                            2
+                        ]
+                    }
+                },
+                {
+                    "type": "MEMORY",
+                    "memoryInMB": {
+                        "0": 128
+                    }
+                }
+            ],
+            "state": "STOPPED",
+            "metadata": {},
+            "errorMessage": "Error while pulling image ghcr.io/appform-io/perf-test-server-httplib: Status 500: {\"message\":\"Get \\\"https://ghcr.io/v2/\\\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\"}\n",
+            "created": 1719892279039,
+            "updated": 1719892354099
+        }
+    ],
+    "message": "success"
+}
+

+

Get info for an instance

+

GET /apis/v1/applications/{appId}/instances/{instanceId}

+

Request +

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances/AI-58eb1111-8c2c-4ea2-a159-8fc68010a146' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1",
+        "appName": "TEST_APP",
+        "instanceId": "AI-58eb1111-8c2c-4ea2-a159-8fc68010a146",
+        "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+        "localInfo": {
+            "hostname": "ppessdev",
+            "ports": {
+                "main": {
+                    "containerPort": 8000,
+                    "hostPort": 33857,
+                    "portType": "HTTP"
+                }
+            }
+        },
+        "resources": [
+            {
+                "type": "CPU",
+                "cores": {
+                    "0": [
+                        2
+                    ]
+                }
+            },
+            {
+                "type": "MEMORY",
+                "memoryInMB": {
+                    "0": 128
+                }
+            }
+        ],
+        "state": "HEALTHY",
+        "metadata": {},
+        "errorMessage": "",
+        "created": 1719892354194,
+        "updated": 1719893440105
+    },
+    "message": "success"
+}
+

+

Application Endpoints

+

GET /apis/v1/endpoints

+
+

Info

+

This API provides up-to-date information about the host and port information about application instances running on the cluster. This information can be used for Service Discovery systems to keep their information in sync with changes in the topology of applications running on the cluster.

+
+
+

Tip

+

Any tag specified in the application specification is also exposed on endpoint. This can be used to implement complicated routing logic if needed in the NGinx template on Drove Gateway.

+
+

Request +

curl --location 'http://drove.local:7000/apis/v1/endpoints' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": [
+        {
+            "appId": "TEST_APP-1",
+            "vhost": "testapp.local",
+            "tags": {
+                "superSpecialApp": "yes_i_am",
+                "say_my_name": "heisenberg"
+            },
+            "hosts": [
+                {
+                    "host": "ppessdev",
+                    "port": 44315,
+                    "portType": "HTTP"
+                }
+            ]
+        },
+        {
+            "appId": "TEST_APP-2",
+            "vhost": "testapp.local",
+            "tags": {
+                "superSpecialApp": "yes_i_am",
+                "say_my_name": "heisenberg"
+            },
+            "hosts": [
+                {
+                    "host": "ppessdev",
+                    "port": 46623,
+                    "portType": "HTTP"
+                }
+            ]
+        }
+    ],
+    "message": "success"
+}
+

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/apis/cluster.html b/apis/cluster.html new file mode 100644 index 0000000..d50bd29 --- /dev/null +++ b/apis/cluster.html @@ -0,0 +1,2106 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Cluster Management - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Cluster Management

+

Ping API

+

GET /apis/v1/ping

+

Request +

curl --location 'http://drove.local:7000/apis/v1/ping' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": "pong",
+    "message": "success"
+}
+

+
+

Tip

+

Use this api call to determine the leader in a cluster. This api will return a HTTP 200 only for the leader controller. All other controllers in the cluster will return 4xx for this api call.

+
+

Cluster Management

+

Get current cluster state

+

GET /apis/v1/cluster

+

Request +

curl --location 'http://drove.local:7000/apis/v1/cluster' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "leader": "ppessdev:4000",
+        "state": "NORMAL",
+        "numExecutors": 1,
+        "numApplications": 1,
+        "numActiveApplications": 1,
+        "freeCores": 9,
+        "usedCores": 1,
+        "totalCores": 10,
+        "freeMemory": 18898,
+        "usedMemory": 128,
+        "totalMemory": 19026
+    },
+    "message": "success"
+}
+

+

Set maintenance mode on cluster

+

POST /apis/v1/cluster/maintenance/set

+

Request +

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/set' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "state": "MAINTENANCE",
+        "updated": 1719897526772
+    },
+    "message": "success"
+}
+

+

Remove maintenance mode from cluster

+

POST /apis/v1/cluster/maintenance/unset

+

Request +

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/unset' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "state": "NORMAL",
+        "updated": 1719897573226
+    },
+    "message": "success"
+}
+

+
+

Warning

+

Cluster will remain in maintenance mode for some time (about 2 minutes) internally even after maintenance mode is removed.

+
+

Executor Management

+

Get list of executors

+

GET /apis/v1/cluster/executors

+

Request +

curl --location 'http://drove.local:7000/apis/v1/cluster/executors' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": [
+        {
+            "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+            "hostname": "ppessdev",
+            "port": 3000,
+            "transportType": "HTTP",
+            "freeCores": 9,
+            "usedCores": 1,
+            "freeMemory": 18898,
+            "usedMemory": 128,
+            "tags": [
+                "ppessdev"
+            ],
+            "state": "ACTIVE"
+        }
+    ],
+    "message": "success"
+}
+

+

Get detailed info for one executor

+

GET /apis/v1/cluster/executors/{id}

+

Request +

curl --location 'http://drove.local:7000/apis/v1/cluster/executors/a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+
{
+    "status": "SUCCESS",
+    "data": {
+        "type": "EXECUTOR",
+        "hostname": "ppessdev",
+        "port": 3000,
+        "transportType": "HTTP",
+        "updated": 1719897100104,
+        "state": {
+            "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+            "cpus": {
+                "type": "CPU",
+                "freeCores": {
+                    "0": [
+                        3,
+                        4,
+                        5,
+                        6,
+                        7,
+                        8,
+                        9,
+                        10,
+                        11
+                    ]
+                },
+                "usedCores": {
+                    "0": [
+                        2
+                    ]
+                }
+            },
+            "memory": {
+                "type": "MEMORY",
+                "freeMemory": {
+                    "0": 18898
+                },
+                "usedMemory": {
+                    "0": 128
+                }
+            }
+        },
+        "instances": [
+            {
+                "appId": "TEST_APP-1",
+                "appName": "TEST_APP",
+                "instanceId": "AI-58eb1111-8c2c-4ea2-a159-8fc68010a146",
+                "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+                "localInfo": {
+                    "hostname": "ppessdev",
+                    "ports": {
+                        "main": {
+                            "containerPort": 8000,
+                            "hostPort": 33857,
+                            "portType": "HTTP"
+                        }
+                    }
+                },
+                "resources": [
+                    {
+                        "type": "CPU",
+                        "cores": {
+                            "0": [
+                                2
+                            ]
+                        }
+                    },
+                    {
+                        "type": "MEMORY",
+                        "memoryInMB": {
+                            "0": 128
+                        }
+                    }
+                ],
+                "state": "HEALTHY",
+                "metadata": {},
+                "errorMessage": "",
+                "created": 1719892354194,
+                "updated": 1719897100104
+            }
+        ],
+        "tasks": [],
+        "tags": [
+            "ppessdev"
+        ],
+        "blacklisted": false
+    },
+    "message": "success"
+}
+
+

Take executor out of rotation

+

POST /apis/v1/cluster/executors/blacklist

+

Request +

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/blacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+

+
+

Note

+

Unlike other POST apis, the executors to be blacklisted are passed as query parameter id. To blacklist multiple executors, pass .../blacklist?id=<id1>&id=<id2>...

+
+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "successful": [
+            "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d"
+        ],
+        "failed": []
+    },
+    "message": "success"
+}
+

+

Bring executor back into rotation

+

POST /apis/v1/cluster/executors/unblacklist

+

Request +

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/unblacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+

+
+

Note

+

Unlike other POST apis, the executors to be un-blacklisted are passed as query parameter id. To un-blacklist multiple executors, pass .../unblacklist?id=<id1>&id=<id2>...

+
+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "successful": [
+            "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d"
+        ],
+        "failed": []
+    },
+    "message": "success"
+}
+

+

Drove Cluster Events

+

The following APIs can be used to monitor events on Drove. If the data needs to be consumed, the /latest API should be used. For simply knowing if an event of a certain type has occurred or not, the /summary is sufficient.

+

Event List

+

GET /apis/v1/cluster/events/latest

+

Request +

curl --location 'http://drove.local:7000/apis/v1/cluster/events/latest?size=1024&lastSyncTime=0' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "events": [
+            {
+                "metadata": {
+                    "CURRENT_INSTANCES": 0,
+                    "APP_ID": "TEST_APP-1",
+                    "PLACEMENT_POLICY": "ANY",
+                    "APP_VERSION": "1",
+                    "CPU_COUNT": 1,
+                    "CURRENT_STATE": "RUNNING",
+                    "PORTS": "main:8000:http",
+                    "MEMORY": 128,
+                    "EXECUTABLE": "ghcr.io/appform-io/perf-test-server-httplib",
+                    "VHOST": "testapp.local",
+                    "APP_NAME": "TEST_APP"
+                },
+                "type": "APP_STATE_CHANGE",
+                "id": "a2b7d673-2bc2-4084-8415-d8d37cafa63d",
+                "time": 1719977632050
+            },
+            {
+                "metadata": {
+                    "APP_NAME": "TEST_APP",
+                    "APP_ID": "TEST_APP-1",
+                    "PORTS": "main:44315:http",
+                    "EXECUTOR_ID": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+                    "EXECUTOR_HOST": "ppessdev",
+                    "CREATED": 1719977629042,
+                    "INSTANCE_ID": "AI-5efbb94f-835c-4c62-a073-a68437e60339",
+                    "CURRENT_STATE": "HEALTHY"
+                },
+                "type": "INSTANCE_STATE_CHANGE",
+                "id": "55d5876f-94ac-4c5d-a580-9c3b296add46",
+                "time": 1719977631534
+            }
+        ],
+        "lastSyncTime": 1719977632050//(1)!
+    },
+    "message": "success"
+}
+

+
    +
  1. Pass this as the parameter lastSyncTime in the next call to events api to receive latest events.
  2. +
+ + + + + + + + + + + + + + + + + + + + +
Query ParameterValidationDescription
lastSyncTime+ve long rangeTime when the last sync call happened on the server. Defaults to 0 (initial sync).
size1-1024Number of latest events to return. Defaults to 1024. We recommend leaving this as is.
+

Event Summary

+

GET /apis/v1/cluster/events/summary

+

Request +

curl --location 'http://drove.local:7000/apis/v1/cluster/events/summary?lastSyncTime=0' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+
+Response +
{
+    "status": "SUCCESS",
+    "data": {
+        "eventsCount": {
+            "INSTANCE_STATE_CHANGE": 8,
+            "APP_STATE_CHANGE": 17,
+            "EXECUTOR_BLACKLISTED": 1,
+            "EXECUTOR_UN_BLACKLISTED": 1
+        },
+        "lastSyncTime": 1719977632050//(1)!
+    },
+    "message": "success"
+}
+

+
    +
  1. Pass this as the parameter lastSyncTime in the next call to events api to receive latest events.
  2. +
+

Continuous monitoring for events

+

This is applicable for both the APIs listed above

+
    +
  • In the first call to events api, pass lastSyncTime as zero.
  • +
  • In the response there will be a field lastSyncTime
  • +
  • Pass the last received lastSyncTime as the lastSyncTime param in the next call
  • +
  • This api is cheap enough, you should plan to make calls to it every few seconds
  • +
+
+

Info

+

Model for the events can be found here.

+
+
+

Tip

+

Java programs should definitely look at using the event listener library +to listen to cluster events

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/apis/index.html b/apis/index.html new file mode 100644 index 0000000..7c65346 --- /dev/null +++ b/apis/index.html @@ -0,0 +1,1562 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Introduction - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Introduction

+

This section lists all the APIs that a user can communicate with.

+

Making an API call

+

Use a standard HTTP client in the language of your choice to make a call to the leader controller (the cluster virtual host exposed by drove-gateway-nginx).

+
+

Tip

+

In case you are using Java, we recommend using the drove-client library along with the http-transport.

+
+

If multiple controllers endpoints are provided, the client will track the leader automatically. This will reduce your dependency on drove-gateway.

+
+
+

Authentication

+

Drove uses basic auth for authentication. (You can extend to use any other auth format like OAuth). The basic auth credentials need to be sent out in the standard format in the Authorization header.

+

Response format

+

The response format is standard for all API calls:

+
{
+    "status": "SUCCESS",//(1)!
+    "data": {//(2)!
+        "taskId": "T0012"
+    },
+    "message": "success"//(3)!
+}
+
+
    +
  1. SUCCESS or FAILURE as the case may be.
  2. +
  3. Content of this field is contextual to the response.
  4. +
  5. Will contain success if the call was successful or relevant error message.
  6. +
+
+

Warning

+

APIs will return relevant HTTP status codes in case of error (for example 400 for validation errors, 401 for authentication failure). However, you must always ensure that the status field is set to SUCCESS for assuming the api call is succesful, even when HTTP status code is 2xx.

+
+

APIs in Drove belong to the following major classes:

+ +
+

Tip

+

Response models for these apis can be found in drove-models

+
+
+

Note

+

There are no publicly accessible APIs exposed by individual executors.

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/apis/logs.html b/apis/logs.html new file mode 100644 index 0000000..3579668 --- /dev/null +++ b/apis/logs.html @@ -0,0 +1,1623 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Log Related APIs - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Log Related APIs

+

Get list if log files

+

Application +GET /apis/v1/logfiles/applications/{appId}/{instanceId}/list

+

Task +GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/list

+

Request +

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/list' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "files": [
+        "output.log-2024-07-04",
+        "output.log-2024-07-03",
+        "output.log"
+    ]
+}
+

+

Download Log Files

+

Application +GET /apis/v1/logfiles/applications/{appId}/{instanceId}/download/{fileName}

+

Task +GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/download/{fileName}

+

Request +

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/download/output.log' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

+
+

Note

+

The Content-Disposition header is set properly to the actual filename. For the above example it would be set to attachment; filename=output.log.

+
+

Read chunks from log

+

Application +GET /apis/v1/logfiles/applications/{appId}/{instanceId}/read/{fileName}

+

Task +GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/read/{fileName}

+ + + + + + + + + + + + + + + + + + + + +
Query ParameterValidationDescription
offsetDefault -1, should be positive numberThe offset of the file to read from.
lengthShould be a positive numberNumber of bytes to read.
+

Request +

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/read/output.log' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "data": "", //(1)!
+    "offset": 43318 //(2)!
+}
+

+
    +
  1. Will contain raw data or empty string (in case of first call)
  2. +
  3. Offset to be passed in the next call
  4. +
+

How to tail logs

+
    +
  1. Have a fixed buffer size in ming 1024/4096 etc
  2. +
  3. Make a call to /read api with offset=-1, length = buffer size
  4. +
  5. The call will return no data, but will have a valid offset
  6. +
  7. Pass this offset in the next call, data will be returned if available (or empty). The response will also return the offset to pass in the .ext call.
  8. +
  9. The data returned might be empty or less than length depending on availability.
  10. +
  11. Keep repeating (4) to keep tailing log
  12. +
+
+

Warning

+
    +
  • Offset = 0 means start of the file
  • +
  • First call must be -1 for tail type functionality
  • +
+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/apis/task.html b/apis/task.html new file mode 100644 index 0000000..b80c734 --- /dev/null +++ b/apis/task.html @@ -0,0 +1,1658 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Task Management - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Task Management

+

Issue task operation

+

POST /apis/v1/tasks/operations

+

Request +

curl --location 'http://drove.local:7000/apis/v1/tasks/operations' \
+--header 'Content-Type: application/json' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data '{
+    "type": "KILL",
+    "sourceAppName" : "TEST_APP",
+    "taskId" : "T0012",
+    "opSpec": {
+        "timeout": "5m",
+        "parallelism": 1,
+        "failureStrategy": "STOP"
+    }
+}'
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "taskId": "T0012"
+    },
+    "message": "success"
+}
+

+
+

Tip

+

Relevant payloads for task commands can be found in task operations section.

+
+

Search for task

+

POST /apis/v1/tasks/search

+

List all tasks

+

GET /apis/v1/tasks

+

Request +

curl --location 'http://drove.local:7000/apis/v1/tasks' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": [
+        {
+            "sourceAppName": "TEST_APP",
+            "taskId": "T0013",
+            "instanceId": "TI-c2140806-2bb5-4ed3-9bb9-0c0c5fd0d8d6",
+            "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+            "hostname": "ppessdev",
+            "executable": {
+                "type": "DOCKER",
+                "url": "ghcr.io/appform-io/test-task",
+                "dockerPullTimeout": "100 seconds"
+            },
+            "resources": [
+                {
+                    "type": "CPU",
+                    "cores": {
+                        "0": [
+                            2
+                        ]
+                    }
+                },
+                {
+                    "type": "MEMORY",
+                    "memoryInMB": {
+                        "0": 512
+                    }
+                }
+            ],
+            "volumes": [],
+            "env": {
+                "ITERATIONS": "10"
+            },
+            "state": "RUNNING",
+            "metadata": {},
+            "errorMessage": "",
+            "created": 1719827035480,
+            "updated": 1719827038414
+        }
+    ],
+    "message": "success"
+}
+

+

Get Task Instance Details

+

GET /apis/v1/tasks/{sourceAppName}/instances/{taskId}

+

Request +

curl --location 'http://drove.local:7000/apis/v1/tasks/TEST_APP/instances/T0012' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4='
+

+

Response +

{
+    "status": "SUCCESS",
+    "data": {
+        "sourceAppName": "TEST_APP",
+        "taskId": "T0012",
+        "instanceId": "TI-6cf36f5c-6480-4ed5-9e2d-f79d9648529a",
+        "executorId": "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d",
+        "hostname": "ppessdev",
+        "executable": {
+            "type": "DOCKER",
+            "url": "ghcr.io/appform-io/test-task",
+            "dockerPullTimeout": "100 seconds"
+        },
+        "resources": [
+            {
+                "type": "CPU",
+                "cores": {
+                    "0": [
+                        3
+                    ]
+                }
+            },
+            {
+                "type": "MEMORY",
+                "memoryInMB": {
+                    "0": 512
+                }
+            }
+        ],
+        "volumes": [],
+        "env": {
+            "ITERATIONS": "10"
+        },
+        "state": "STOPPED",
+        "metadata": {},
+        "taskResult": {
+            "status": "SUCCESSFUL",
+            "exitCode": 0
+        },
+        "errorMessage": "",
+        "created": 1719823470267,
+        "updated": 1719823483836
+    },
+    "message": "success"
+}
+

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/applications/index.html b/applications/index.html new file mode 100644 index 0000000..4e5872e --- /dev/null +++ b/applications/index.html @@ -0,0 +1,1612 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Introduction - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Introduction

+

An application is a virtual representation of a running service in the system.

+

Running containers for an application are called application instances.

+

An application specification contains the following details about the application:

+
    +
  • Name - Name of the application
  • +
  • Version - Version of this specification
  • +
  • Executable - The container to deploy on the cluster
  • +
  • Ports - Ports to be exposed from the container
  • +
  • Resources - CPU and Memory required for the container
  • +
  • Placement Policy - How containers are to be placed in the cluster
  • +
  • Healthchecks - Healthcheck details
  • +
  • Readiness Checks - Readiness checks to pass before container is considered to be healthy
  • +
  • Pre Shutdown Hooks - Pre shutdown hooks to run on container before it is killed
  • +
  • Environment Variables - Environment variables and values
  • +
  • Exposure Information - Virtual host information
  • +
  • Volumes - Volumes to be mounted into the container
  • +
  • Configs - Configs/files to be mounted into the container
  • +
  • Logging details - Logging spec (for example rsyslog server)
  • +
  • Tags - A map of strings for additional metadata
  • +
+
+

Info

+

Once a spec is registered to the cluster, it can not be changed

+
+

Application ID

+

Once an application is created on the cluster, an Application id is generated. The format of this id currently is: {name}-{version}. All further operations to be done on the application will need to refer to it by this ID.

+

Application States and Operations

+

An application on a Drove cluster follows a fixed lifecycle modelled as a state machine. State transitions are triggered by operations. Operations can be issued externally using API calls or may be generated internally by the application monitoring system.

+

States

+

Applications on a Drove cluster can be one of the following states:

+
    +
  • INIT - This is an intermediate state during which the application is being initialized and the spec is being validated. This is the origination state of the application.
  • +
  • MONITORING - A stable state in which application is created or suspended and does not have any running instances
  • +
  • RUNNING - A stable state in which application has the expected non-zero number of healthy application instances running on the cluster
  • +
  • OUTAGE_DETECTED - An intermediate state when Drove has detected that the current number of application instances is not matching the expected number of instances.
  • +
  • SCALING_REQUESTED - An intermediate state that signifies that application instances are being spun up or shut down to get the number of running instances to match the expected instances.
  • +
  • STOP_INSTANCES_REQUESTED - An intermediate state that signifies that specific instances of the application are being killed as requested by the user/system.
  • +
  • REPLACE_INSTANCES_REQUESTED - An _intermediate state _that signifies that instances of the application are being replaced with newer instances as requested by the user. This signifies that the app is effectively being restarted.
  • +
  • DESTROY_REQUESTED - An intermediate state that signifies that the user has requested to destroy the application and remove it from the cluster.
  • +
  • DESTROYED - An intermediate state that signifies that the app has been destroyed and metadata cleanup is underway. This is the terminal state of an application.
  • +
+

Operations

+

The following application operations are recognized by Drove:

+
    +
  • CREATE - Create an application. Take the Application Specification. Fails if an app with the same application id (name + version) already exists on the cluster
  • +
  • DESTROY - Destroy an application. Takes app id as parameter. Deletes all metadata about the application from the cluster. Allowed only if the application is in Monitoring state (i.e. has zero running instances).
  • +
  • START_INSTANCES - Create new application instances. Takes the app id as well as the number of new instances to deploy. Allowed only if the application is in Monitoring or Running state.
  • +
  • STOP_INSTANCES - Stop running application instances. Takes the app id, list of instance ids to be stopped as well as flag to denote if replacement instances are to be started by Drove or not. Allowed only if the application is in Monitoring or Running state.
  • +
  • SCALE - Scale the application up and down to the specified number of instances. Drove will internally calculate whether to spin new containers up or spin old containers down as needed. Allowed if the app is in Monitoring or Running state. It is better to use either START or STOP instances command above to be more explicit in behavior. The SCALE operation is mostly for internal use by Drove, but can be issued externally as well.
  • +
  • REPLACE_INSTANCES - Replace application instances with newer ones. Can be used to do rolling restarts on the cluster. Specific instances can be targeted as well by passing an optional list of instance ids to be replaced. Allowed only when the application is in Running state.
  • +
  • SUSPEND - A shortcut to set expected instances for an application to zero. This will get translated into a SCALE operation and any running instances will be gracefully shut down. Allowed only when the application is in running state.
  • +
  • RECOVER - Internal command used to restore application state on controller failover.
  • +
+
+

Tip

+

All operations can take an optional Cluster Operation Spec which can be used to control the timeout and parallelism of tasks generated by the operation.

+
+

Application State Machine

+

The following state machine signifies the states and transitions as affected by cluster state and operations issued.

+

Application State Machine

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/applications/instances.html b/applications/instances.html new file mode 100644 index 0000000..f5b2552 --- /dev/null +++ b/applications/instances.html @@ -0,0 +1,1514 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Application Instances - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Application Instances

+

Application instances are running containers for an application. The state machine for instances are managed in a decentralised manner on the cluster nodes locally and not by the controllers. This includes running health checks, readiness checks and shutdown hooks on the container, container loss detection and container state recovery on executor service restart.

+

Regular updates about the instance state are provided by executors to the controllers and are used to keep the application state up-to-date or trigger application operations to bring the applications to stable states.

+

Application Instance States

+

An application instance can be in one of the following states at one point in time:

+
    +
  • PENDING - Container state machine start has been triggered.
  • +
  • PROVISIONING - Docker image is being downloaded
  • +
  • PROVISIONING_FAILED - Docker image download failed
  • +
  • STARTING - Docker run is being executed
  • +
  • START_FAILED - Docker run failed
  • +
  • UNREADY - Docker started, readiness check not yet started.
  • +
  • READINESS_CHECK_FAILED - Readiness check was run and has failed terminally
  • +
  • READY - Readiness checks have passed
  • +
  • HEALTHY - Health check has passed. Container is running properly and passing regular health checks
  • +
  • UNHEALTHY - Regular health check has failed. Container will stop.
  • +
  • STOPPING - Shutdown hooks are being called and docker kill be be issued
  • +
  • DEPROVISIONING - Docker image is being cleaned up
  • +
  • STOPPED - Docker stop has completed
  • +
  • LOST - Container has exited unexpectedly while executor service was down
  • +
  • UNKNOWN - All running containers are in this state when executor service is getting restarted and before startup recovery has kicked in
  • +
+

Application Instance State Machine

+

Instance state machine transitions might be triggered on receipt of commands issued by the controller or due to internal changes in the container (might have died or started failing health checks) as well as external factors like executor service restarts.

+

Application Instance State Machine

+
+

Note

+

No operations are allowed to be performed on application instances directly through the executor

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/applications/operations.html b/applications/operations.html new file mode 100644 index 0000000..480a2a6 --- /dev/null +++ b/applications/operations.html @@ -0,0 +1,2074 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Application Operations - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Application Operations

+

This page discusses operations relevant to Application management. Please go over the Application State Machine and Application Instance State Machine to understand the different states an application (and it's instances) can be in and how operations applied move an application from one state to another.

+
+

Note

+

Please go through Cluster Op Spec to understand the operation parameters being sent.

+
+
+

Note

+

Only one operation can be active on a particular {appName,version} combination.

+
+
+

Warning

+

Only the leader controller will accept and process operations. To avoid confusion, use the controller endpoint exposed by Drove Gateway to issue commands.

+
+

How to initiate an operation

+
+

Tip

+

Use the Drove CLI to perform all manual operations.

+
+

All operations for application lifecycle management need to be issued via a POST HTTP call to the leader controller endpoint on the path /apis/v1/applications/operations. API will return HTTP OK/200 and relevant json response as payload.

+

Sample api call:

+
curl --location 'http://drove.local:7000/apis/v1/applications/operations' \
+--header 'Content-Type: application/json' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data '{
+    "type": "START_INSTANCES",
+    "appId": "TEST_APP-3",
+    "instances": 1,
+    "opSpec": {
+        "timeout": "5m",
+        "parallelism": 32,
+        "failureStrategy": "STOP"
+    }
+}'
+
+
+

Note

+

In the above examples, http://drove.local:7000 is the endpoint of the leader. TEST_APP-3 is the Application ID. Authorization is basic auth.

+
+

Cluster Operation Specification

+

When an operation is submitted to the cluster, a cluster op spec needs to be specified. This is needed to control different aspects of the operation, including parallelism of an operation or increase the timeout for the operation and so on.

+

The following aspects of an operation can be configured:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TimeouttimeoutThe duration after which Drove considers the operation to have timed out.
ParallelismparallelismParallelism of the task. (Range: 1-32)
Failure StrategyfailureStrategySet this to STOP.
+
+

Note

+

For internal recovery operations, Drove generates it's own operations. For that, Drove applies the following cluster operation spec:

+
    +
  • timeout - 300 seconds
  • +
  • parallelism - 1
  • +
  • failureStrategy - STOP
  • +
+
+

The default operation spec can be configured in the controller configuration file. It is recommended to set this to a something like 8 for faster recovery.

+
+
+

How to cancel an operation

+

Operations can be requested to be cancelled asynchronously. A POST call needs to be made to leader controller endpoint on the api /apis/v1/operations/{applicationId}/cancel (1) to achieve this.

+
    +
  1. applicationId is the Application ID for the application
  2. +
+
curl --location --request POST 'http://drove.local:7000/apis/v1/operations/TEST_APP-3/cancel' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+
+
+

Warning

+

Operation cancellation is not instantaneous. Cancellation will be affected only after current execution of the active operation is complete.

+
+

Create an application

+

Before deploying containers on the cluster, an application needs to be created.

+

Preconditions:

+
    +
  • App should not exist in the cluster
  • +
+

State Transition:

+
    +
  • none → MONITORING
  • +
+

To create an application, an Application Spec needs to be created first.

+

Once ready, CLI command needs to be issued or the following payload needs to be sent:

+
+
+
+
drove -c local apps create sample/test_app.json
+
+
+
+

Sample Request Payload +

{
+    "type": "CREATE",
+    "spec": {...}, //(1)!
+    "opSpec": { //(2)!
+        "timeout": "5m",
+        "parallelism": 1,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Spec as mentioned in Application Specification
  2. +
  3. Operation spec as mentioned in Cluster Op Spec
  4. +
+

Sample response +

{
+    "data" : {
+        "appId" : "TEST_APP-1"
+    },
+    "message" : "success",
+    "status" : "SUCCESS"
+}
+

+
+
+
+

Starting new instances of an application

+

New instances can be started by issuing the START_INSTANCES command.

+

Preconditions +- Application must be in one of the following states: MONITORING, RUNNING

+

State Transition:

+
    +
  • {RUNNING, MONITORING} → RUNNING
  • +
+

The following command/payload will start 2 new instances of the application.

+
+
+
+
drove -c local apps deploy TEST_APP-1 2
+
+
+
+

Sample Request Payload +

{
+    "type": "START_INSTANCES",
+    "appId": "TEST_APP-1",//(1)!
+    "instances": 2,//(2)!
+    "opSpec": {//(3)!
+        "timeout": "5m",
+        "parallelism": 32,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Application ID
  2. +
  3. Number of instances to be started
  4. +
  5. Operation spec as mentioned in Cluster Op Spec
  6. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1"
+    },
+    "message": "success"
+}
+

+
+
+
+

Suspending an application

+

All instances of an application can be shut down by issuing the SUSPEND command.

+

Preconditions +- Application must be in one of the following states: MONITORING, RUNNING

+

State Transition:

+
    +
  • {RUNNING, MONITORING} → MONITORING
  • +
+

The following command/payload will suspend all instances of the application.

+
+
+
+
drove -c local apps suspend TEST_APP-1
+
+
+
+

Sample Request Payload +

{
+    "type": "SUSPEND",
+    "appId": "TEST_APP-1",//(1)!
+    "opSpec": {//(2)!
+        "timeout": "5m",
+        "parallelism": 32,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Application ID
  2. +
  3. Operation spec as mentioned in Cluster Op Spec
  4. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1"
+    },
+    "message": "success"
+}
+

+
+
+
+

Scaling the application up or down

+

Scaling the application to required number of containers can be achieved using the SCALE command. Application can be either scaled up or down using this command.

+

Preconditions +- Application must be in one of the following states: MONITORING, RUNNING

+

State Transition:

+
    +
  • {RUNNING, MONITORING} → MONITORING if requiredInstances is set to 0
  • +
  • {RUNNING, MONITORING} → RUNNING if requiredInstances is non 0
  • +
+
+
+
+
drove -c local apps scale TEST_APP-1 2
+
+
+
+

Sample Request Payload +

{
+    "type": "SCALE",
+    "appId": "TEST_APP-1", //(3)!
+    "requiredInstances": 2, //(1)!
+    "opSpec": { //(2)!
+        "timeout": "1m",
+        "parallelism": 20,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Absolute number of instances to be maintained on the cluster for the application
  2. +
  3. Operation spec as mentioned in Cluster Op Spec
  4. +
  5. Application ID
  6. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1"
+    },
+    "message": "success"
+}
+

+
+
+
+
+

Note

+

During scale down, older instances are stopped first

+
+
+

Tip

+

If implementing automation on top of Drove APIs, just use the SCALE command to scale up or down instead of using START_INSTANCES or SUSPEND separately.

+
+

Restarting an application

+

Application can be restarted by issuing the REPLACE_INSTANCES operation. In this case, first clusterOpSpec.parallelism number of containers are spun up first and then an equivalent number of them are spun down. This ensures that cluster maintains enough capacity is maintained in the cluster to handle incoming traffic as the restart is underway.

+
+

Warning

+

If the cluster does not have sufficient capacity to spin up new containers, this operation will get stuck. So adjust your parallelism accordingly.

+
+

Preconditions +- Application must be in RUNNING state.

+

State Transition:

+
    +
  • RUNNINGREPLACE_INSTANCES_REQUESTEDRUNNING
  • +
+
+
+
+
drove -c local apps restart TEST_APP-1
+
+
+
+

Sample Request Payload +

{
+    "type": "REPLACE_INSTANCES",
+    "appId": "TEST_APP-1", //(1)!
+    "instanceIds": [], //(2)!
+    "opSpec": { //(3)!
+        "timeout": "1m",
+        "parallelism": 20,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Application ID
  2. +
  3. Instances that need to be restarted. This is optional. If nothing is passed, all instances will be replaced.
  4. +
  5. Operation spec as mentioned in Cluster Op Spec
  6. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1"
+    },
+    "message": "success"
+}
+

+
+
+
+
+

Tip

+

To replace specific instances, pass their application instance ids (starts with AI-...) in the instanceIds parameter in the JSON payload.

+
+

Stop or replace specific instances of an application

+

Application instances can be killed by issuing the STOP_INSTANCES operation. Default behaviour of Drove is to replace killed instances by new instances. Such new instances are always spun up before the specified(old) instances are stopped. If skipRespawn parameter is set to true, the application instance is killed but no new instances are spun up to replace it.

+
+

Warning

+

If the cluster does not have sufficient capacity to spin up new containers, and skipRespawn is not set or set to false, this operation will get stuck.

+
+

Preconditions +- Application must be in RUNNING state.

+

State Transition:

+
    +
  • RUNNINGSTOP_INSTANCES_REQUESTEDRUNNING if final number of instances is non zero
  • +
  • RUNNINGSTOP_INSTANCES_REQUESTEDMONITORING if final number of instances is zero
  • +
+
+
+
+
drove -c local apps appinstances kill TEST_APP-1 AI-601d160e-c692-4ddd-8b7f-4c09b30ed02e
+
+
+
+

Sample Request Payload +

{
+    "type": "STOP_INSTANCES",
+    "appId" : "TEST_APP-1",//(1)!
+    "instanceIds" : [ "AI-601d160e-c692-4ddd-8b7f-4c09b30ed02e" ],//(2)!
+    "skipRespawn" : true,//(3)!
+    "opSpec": {//(4)!
+        "timeout": "5m",
+        "parallelism": 1,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Application ID
  2. +
  3. Instance ids to be stopped
  4. +
  5. Do not spin up new containers to replace the stopped ones. This is set ot false by default.
  6. +
  7. Operation spec as mentioned in Cluster Op Spec
  8. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1"
+    },
+    "message": "success"
+}
+

+
+
+
+

Destroy an application

+

To remove an application deployment (appName-version combo) the DESTROY command can be issued.

+

Preconditions:

+
    +
  • App should not exist in the cluster
  • +
+

State Transition:

+
    +
  • MONITORINGDESTROY_REQUESTEDDESTROYED → none
  • +
+

To create an application, an Application Spec needs to be created first.

+

Once ready, CLI command needs to be issued or the following payload needs to be sent:

+
+
+
+
drove -c local apps destroy TEST_APP_1
+
+
+
+

Sample Request Payload +

{
+    "type": "DESTROY",
+    "appId" : "TEST_APP-1",//(1)!
+    "opSpec": {//(2)!
+        "timeout": "5m",
+        "parallelism": 2,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Spec as mentioned in Application Specification
  2. +
  3. Operation spec as mentioned in Cluster Op Spec
  4. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "appId": "TEST_APP-1"
+    },
+    "message": "success"
+}
+

+
+
+
+
+

Warning

+

All metadata for an app and it's instances are completely obliterated from Drove's storage once an app is destroyed

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/applications/outage.html b/applications/outage.html new file mode 100644 index 0000000..f47382e --- /dev/null +++ b/applications/outage.html @@ -0,0 +1,1559 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Outage Detection and Recovery - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Outage Detection and Recovery

+

Drove tracks all instances for an app deployment in the cluster. It will ensure the required number of containers is always running on the cluster.

+

Instance health detection and tracking

+

Executor runs periodic health checks on the container according to check spec configuration. +- Runs readiness checks to ensure container is started properly before declaring it healthy +- Runs health checks on the container at regular intervals to ensure it is in operating condition

+

Behavior for both is configured by setting the appropriate options in the application specification.

+

Result of such health checks (both success and failure) are reported to the controller. Appropriate action is taken to shut down containers that fail readiness or health checks.

+

Container crash

+

If container for an application crashes, Drove will automatically spin up a container in it's place.

+

Executor node hardware failure

+

If an executor node fails, instances running on that node will be lost. This is detected by the outage detector and new containers are spun up on other parts of the cluster.

+

Executor service temporary unavailability

+

On restart, executor service reads the metadata embedded in the container and registers them. It performs a reconciliation with the leader controller to kill any local containers if the unavailability was too long and controller has already spun up new alternatives.

+

Zombie (container) detection and cleanup

+

Executor service keeps track of all containers it is supposed to run by running periodic reconciliation with the leader controller. Any mismatch gets handled:

+
    +
  • if a container is found that is not supposed to be running, it is killed
  • +
  • If a container that is supposed to be running is not found, it is marked as lost and reported to the controller. This triggers the controller to spin up an alternative container on the cluster.
  • +
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/applications/specification.html b/applications/specification.html new file mode 100644 index 0000000..e3a9159 --- /dev/null +++ b/applications/specification.html @@ -0,0 +1,3171 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Application Specification - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Application Specification

+

An application is defined using JSON. We use a sample configuration below to explain the options.

+

Sample Application Definition

+
{
+    "name": "TEST_APP", // (1)!
+    "version": "1", // (2)!
+    "type": "SERVICE", // (3)!
+    "executable": { //(4)!
+        "type": "DOCKER", // (5)!
+        "url": "ghcr.io/appform-io/perf-test-server-httplib",// (6)!
+        "dockerPullTimeout": "100 seconds"// (7)!
+    },
+    "resources": [//(20)!
+        {
+            "type": "CPU",
+            "count": 1//(21)!
+        },
+        {
+            "type": "MEMORY",
+            "sizeInMB": 128//(22)!
+        }
+    ],
+    "volumes": [//(12)!
+        {
+            "pathInContainer": "/data",//(13)!
+            "pathOnHost": "/mnt/datavol",//(14)!
+            "mode" : "READ_WRITE"//(15)!
+        }
+    ],
+    "configs" : [//(16)!
+        {
+            "type" : "INLINE",//(17)!
+            "localFilename": "/testfiles/drove.txt",//(18)!
+            "data" : "RHJvdmUgdGVzdA=="//(19)!
+        }
+    ],
+    "placementPolicy": {//(23)!
+        "type": "ANY"//(24)!
+    },
+    "exposedPorts": [//(8)!
+        {
+            "name": "main",//(9)!
+            "port": 8000,//(10)!
+            "type": "HTTP"//(11)!
+        }
+    ],
+    "healthcheck": {//(25)!
+        "mode": {//(26)!
+            "type": "HTTP", //(27)!
+            "protocol": "HTTP",//(28)!
+            "portName": "main",//(29)!
+            "path": "/",//(30)!
+            "verb": "GET",//(31)!
+            "successCodes": [//(32)!
+                200
+            ],
+            "payload": "", //(33)!
+            "connectionTimeout": "1 second" //(34)!
+        },
+        "timeout": "1 second",//(35)!
+        "interval": "5 seconds",//(36)!
+        "attempts": 3,//(37)!
+        "initialDelay": "0 seconds"//(38)!
+    },
+    "readiness": {//(39)!
+        "mode": {
+            "type": "HTTP",
+            "protocol": "HTTP",
+            "portName": "main",
+            "path": "/",
+            "verb": "GET",
+            "successCodes": [
+                200
+            ],
+            "payload": "",
+            "connectionTimeout": "1 second"
+        },
+        "timeout": "1 second",
+        "interval": "3 seconds",
+        "attempts": 3,
+        "initialDelay": "0 seconds"
+    },
+    "exposureSpec": {//(42)!
+        "vhost": "testapp.local", //(43)!
+        "portName": "main", //(44)!
+        "mode": "ALL"//(45)!
+    },
+    "env": {//(41)!
+        "CORES": "8"
+    },
+    "args" : [//(54)!
+        "./entrypoint.sh",
+        "arg1",
+        "arg2"
+    ],
+    "tags": { //(40)!
+        "superSpecialApp": "yes_i_am",
+        "say_my_name": "heisenberg"
+    },
+    "preShutdown": {//(46)!
+        "hooks": [ //(47)!
+            {
+                "type": "HTTP",
+                "protocol": "HTTP",
+                "portName": "main",
+                "path": "/",
+                "verb": "GET",
+                "successCodes": [
+                    200
+                ],
+                "payload": "",
+                "connectionTimeout": "1 second"
+            }
+        ],
+        "waitBeforeKill": "3 seconds"//(48)!
+    },
+    "logging": {//(49)!
+        "type": "LOCAL",//(50)!
+        "maxSize": "100m",//(51)!
+        "maxFiles": 3,//(52)!
+        "compress": true//(53)!
+    }
+}
+
+
    +
  1. A human readable name for the application. This will remain constant for different versions of the app.
  2. +
  3. A version number. Drove does not enforce any format for this, but it is recommended to increment this for changes in spec.
  4. +
  5. This should be fixed to SERVICE for an application/service.
  6. +
  7. Coordinates for the executable. Refer to Executable Specification for details.
  8. +
  9. Right now the only type supported is DOCKER.
  10. +
  11. Docker container address
  12. +
  13. Timeout for container pull.
  14. +
  15. The ports to be exposed from the container.
  16. +
  17. A logical name for the port. This will be used to reference this port in other sections.
  18. +
  19. Actual port number as mentioned in Dockerfile.
  20. +
  21. Type of port. Can be: HTTP, HTTPS, TCP, UDP.
  22. +
  23. Volumes to be mounted. Refer to Volume Specification for details.
  24. +
  25. Path that will be visible inside the container for this mount.
  26. +
  27. Actual path on the host machine for the mount.
  28. +
  29. Mount mode can be READ_WRITE and READ_ONLY
  30. +
  31. Configuration to be injected as file inside the container. Please refer to Config Specification for details.
  32. +
  33. Type of config. Can be INLINE, EXECUTOR_LOCAL_FILE, CONTROLLER_HTTP_FETCH and EXECUTOR_HTTP_FETCH. Specifies how drove will get the contents to be injected..
  34. +
  35. File name for the config inside the container.
  36. +
  37. Serialized form of the data, this and other parameters will vary according to the type specified above.
  38. +
  39. List of resources required to run this application. Check Resource Requirements Specification for more details.
  40. +
  41. Number of CPU cores to be allocated.
  42. +
  43. Amount of memory to be allocated expressed in Megabytes
  44. +
  45. Specifies how the container will be placed on the cluster. Check Placement Policy for details.
  46. +
  47. Type of placement can be ANY, ONE_PER_HOST, MATCH_TAG, NO_TAG, RULE_BASED, ANY and COMPOSITE. Rest of the parameters in this section will depend on the type.
  48. +
  49. Health check to ensure service is running fine. Refer to Check Specification for details.
  50. +
  51. Mode of health check, can be api call or command.
  52. +
  53. Type of this check spec. Type can be HTTP or CMD. Rest of the options in this example are HTTP specific.
  54. +
  55. API call protocol. Can be HTTP/HTTPS
  56. +
  57. Port name as mentioned in the exposedPorts section.
  58. +
  59. HTTP path. Include query params here.
  60. +
  61. HTTP method. Can be GET,PUT or POST.
  62. +
  63. Set of HTTP status codes which can be considered as success.
  64. +
  65. Payload to be sent for POST and PUT calls.
  66. +
  67. Connection timeout for the port.
  68. +
  69. Timeout for the check run.
  70. +
  71. Interval between check runs.
  72. +
  73. Max attempts after which the overall check is considered to be a failure.
  74. +
  75. Time to wait before starting check runs.
  76. +
  77. Readiness check to pass for the container to be considered as ready. Refer to Check Specification for details.
  78. +
  79. Key value metadata that can be used in external systems.
  80. +
  81. Custom environment variables. Additional variables are injected by Drove as well. See Environment Variables section for details.
  82. +
  83. Specifies the virtual host on which this container is exposed.
  84. +
  85. FQDN for the virtual host.
  86. +
  87. Port name as specified in exposedPorts section.
  88. +
  89. Mode for exposure. Set this to ALL for now.
  90. +
  91. Things to do before a container is shutdown. Check Pre Shutdown Behavior for more details.
  92. +
  93. Hooks (HTTP api call or shell command) to run before shutting down the container. Format is same as health/readiness checks. Refer to HTTP Check Actions and Command Check Options for details.
  94. +
  95. Time to wait before killing the container. The container will be in UNREADY state during this time and hence won't have api calls routed to it via Drove Gateway.
  96. +
  97. Specify how docker log files are configured. Refer to Logging Specification
  98. +
  99. Log to local file
  100. +
  101. Maximum File Size
  102. +
  103. Number of latest log files to retain
  104. +
  105. Log files will be compressed
  106. +
  107. List of command line arguments. See Command Line Arguments for details.
  108. +
+

Executable Specification

+

Right now Drove supports only docker containers. However as engines, both docker and podman are supported. Drove executors will fetch the executable directly from the registry based on the configuration provided.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet type to DOCKER.
URLurlDocker container URL`.
TimeoutdockerPullTimeoutTimeout for docker image pull.
+
+

Note

+

Drove supports docker registry authentication. This can be configured in the executor configuration file.

+
+

Resource Requirements Specification

+

This section specifies the hardware resources required to run the container. Right now only CPU and MEMORY are supported as resource types that can be reserved for a container.

+

CPU Requirements

+

Specifies number of cores to be assigned to the container.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet type to CPU for this.
CountcountNumber of cores to be assigned.
+

Memory Requirements

+

Specifies amount of memory to be allocated to a container.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet type to MEMORY for this.
CountsizeInMBAmount of memory (in Mega Bytes) to be allocated.
+

Sample +

[
+    {
+        "type": "CPU",
+        "count": 1
+    },
+    {
+        "type": "MEMORY",
+        "sizeInMB": 128
+    }
+]
+

+
+

Note

+

Both CPU and MEMORY configurations are mandatory.

+
+

Volume Specification

+

Files and directories can be mounted from the executor host into the container. The volumes section contains a list of volumes that need to be mounted.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Path In ContainerpathInContainerPath that will be visible inside the container for this mount.
Path On HostpathOnHostActual path on the host machine for the mount.
Mount ModemodeMount mode can be READ_WRITE and READ_ONLY to allow the containerized process to write or read to the volume.
+
+

Info

+

We do not support mounting remote volumes as of now.

+
+

Config Specification

+

Drove supports injection of configuration files into containers. The specifications for the same are discussed below.

+

Inline config

+

Inline configuration can be added in the Application Specification itself. This will manifest as a file inside the container.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to INLINE
Local FilenamelocalFilenameFile name for the config inside the container.
DatadataBase64 encoded string for the data. The value for this will be masked on UI.
+

Config file: +

port: 8080
+logLevel: DEBUG
+
+Corresponding config specification: +
{
+    "type" : "INLINE",
+    "localFilename" : "/config/service.yml",
+    "data" : "cG9ydDogODA4MApsb2dMZXZlbDogREVCVUcK"
+}
+

+
+

Warning

+

The full base 64 encoded config data will get stored in Drove ZK and will be pushed to executors inline. It is not recommended to stream large config files to containers using this method. This will probably need additional configuration on your ZK cluster.

+
+

Locally loaded config

+

Config file from a path on the executor directly. Such files can be distributed to the executor host using existing configuration management systems such as OpenTofu, Salt etc.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to EXECUTOR_LOCAL_FILE
Local FilenamelocalFilenameFile name for the config inside the container.
File pathfilePathOnHostPath to the config file on executor host.
+

Sample config specification: +

{
+    "type" : "EXECUTOR_LOCAL_FILE",
+    "localFilename" : "/config/service.yml",
+    "data" : "/mnt/configs/myservice/config.yml"
+}
+

+

Controller fetched Config

+

Config file can be fetched from a remote server by the controller. Once fetched, these will be streamed to the executor as part of the instance specification for starting a container.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to CONTROLLER_HTTP_FETCH
Local FilenamelocalFilenameFile name for the config inside the container.
HTTP Call DetailshttpHTTP Call related details. Please refer to HTTP Call Specification for details.
+

Sample config specification: +

{
+    "type" : "CONTROLLER_HTTP_FETCH",
+    "localFilename" : "/config/service.yml",
+    "http" : {
+        "protocol" : "HTTP",
+        "hostname" : "configserver.internal.yourdomain.net",
+        "port" : 8080,
+        "path" : "/configs/myapp",
+        "username" : "appuser",
+        "password" : "secretpassword"
+    }
+}
+

+
+

Note

+

The controller will make an API call for every single time it asks an executor to spin up a container. Please make sure to account for this in your configuration management system.

+
+

Executor fetched Config

+

Config file can be fetched from a remote server by the executor before spinning up a container. Once fetched, the payload will be injected as a config file into the container.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to EXECUTOR_HTTP_FETCH
Local FilenamelocalFilenameFile name for the config inside the container.
HTTP Call DetailshttpHTTP Call related details. Please refer to HTTP Call Specification for details.
+

Sample config specification: +

{
+    "type" : "EXECUTOR_HTTP_FETCH",
+    "localFilename" : "/config/service.yml",
+    "http" : {
+        "protocol" : "HTTP",
+        "hostname" : "configserver.internal.yourdomain.net",
+        "port" : 8080,
+        "path" : "/configs/myapp",
+        "username" : "appuser",
+        "password" : "secretpassword"
+    }
+}
+

+
+

Note

+

All executors will make an API call for every single time they spin up a container for this application. Please make sure to account for this in your configuration management system.

+
+

HTTP Call Specification

+

This section details the options that can set when making http calls to a configuration management system from controllers or executors.

+

The following options are available for HTTP call:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
ProtocolprotocolProtocol to use for upstream call. Can be HTTP or HTTPS.
HostnamehostnameHost to call.
PortportProvide custom port. Defaults to 80 for http and 443 for https.
API PathpathPath component of the URL. Include query parameters here. Defaults to /
HTTP MethodverbType of call, use GET, POST or PUT. Defaults to GET.
Success CodesuccessCodesList of HTTP status codes which is considered as success. Defaults to [200]
PayloadpayloadData to be used for POST and PUT calls
Connection TimeoutconnectionTimeoutTimeout for upstream connection.
Operation timeoutoperationTimeoutTimeout for actual operation.
UsernameusernameUsername to be used basic auth. This field is masked out on the UI.
PasswordpasswordPassword to be used for basic auth. This field is masked on the UI.
Authorization HeaderauthHeaderData to be passed in HTTP Authorization header. This field is masked on the UI.
Additional HeadersheadersAny other headers to be passed to the upstream in the HTTP calls. This is a map of
Skip SSL ChecksinsecureSkip hostname and certification checks during SSL handshake with the upstream.
+

Placement Policy Specification

+

Placement policy governs how Drove deploys containers on the cluster. The following sections discuss the different placement policies available and how they can be configured to achieve optimal placement of containers.

+
+

Warning

+

All policies will work only at a {appName, version} combination level. They will not ensure constraints at an appName level. This means that for somethinge like a one per node placement, for the same appName, multiple containers can run on the same host if multiple deployments with different versions are active in a cluster. Same applies for all policies like N per host and so on.

+
+
+

Important details about executor tagging

+
    +
  • All hosts have at-least one tag, it's own hostname.
  • +
  • The TAG policy will consider them as valid tags. This can be used to place containers on specific hosts if needed.
  • +
  • This is handled specially in all other policy types and they will consider executors having only the hostname tag as untagged.
  • +
  • A host with a tag (other than host) will not have any containers running if not placed on them specifically using the MATCH_TAG policy
  • +
+
+

Any Placement

+

Containers for a {appName, version} combination can run on any un-tagged executor host.

+ + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut ANY as policy.
+

Sample: +

{
+    "type" : "ANY"
+}
+

+
+

Tip

+

For most use-cases this is the placement policy to use.

+
+

One Per Host Placement

+

Ensures that only one container for a particular {appName, version} combination is running on an executor host at a time.

+ + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut ONE_PER_HOST as policy.
+

Sample: +

{
+    "type" : "ONE_PER_HOST"
+}
+

+

Max N Per Host Placement

+

Ensures that at most N containers for a {appName, version} combination is running on an executor host at a time.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut MAX_N_PER_HOST as policy.
Max countmaxThe maximum num of containers that can run on an executor. Range: 1-64
+

Sample: +

{
+    "type" : "MAX_N_PER_HOST",
+    "max": 3
+}
+

+

Match Tag Placement

+

Ensures that containers for a {appName, version} combination are running on an executor host that has the tags as mentioned in the policy.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut MATCH_TAG as policy.
Max counttagThe tag to match.
+

Sample: +

{
+    "type" : "MATCH_TAG",
+    "tag": "gpu_enabled"
+}
+

+

No Tag Placement

+

Ensures that containers for a {appName, version} combination are running on an executor host that has no tags.

+ + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut NO_TAG as policy.
+

Sample: +

{
+    "type" : "NO_TAG"
+}
+

+
+

Info

+

The NO_TAG policy is mostly for internal use, and does not need to be specified when deploying containers that do not need any special placement logic.

+
+

Composite Policy Based Placement

+

Composite policy can be used to combine policies together to create complicated placement requirements.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut COMPOSITE as policy.
PolicespoliciesList of policies to combine
CombinercombinerCan be AND and OR and signify all-match and any-match logic on the policies mentioned.
+

Sample: +

{
+    "type" : "COMPOSITE",
+    "policies": [
+        {
+            "type": "ONE_PER_HOST"
+        },
+        {
+            "type": "MATH_TAG",
+            "tag": "gpu_enabled"
+        }
+    ],
+    "combiner" : "AND"
+}
+
+The above policy will ensure that only one container of the relevant {appName,version} will run on GPU enabled machines.

+
+

Tip

+

It is easy to go into situations where no executors match complicated placement policies. Internally, we tend to keep things rather simple and use the ANY placement for most cases and maybe tags in a few places with over-provisioning or for hosts having special hardware 🙂

+
+

Environment variables

+

This config can be used to inject custom environment variables to containers. The values are defined as part of deployment specification, are same across the cluster and immutable to modifications from inside the container (ie any overrides from inside the container will not be visible across the cluster).

+

Sample: +

{
+    "MY_VARIABLE_1": "fizz",
+    "MY_VARIABLE_2": "buzz"
+}
+

+

The following environment variables are injected by Drove to all containers:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Variable NameValue
HOSTHostname where the container is running. This is for marathon compatibility.
PORT_PORT_NUMBERA variable for every port specified in exposedPorts section. The value is the actual port on the host, the specified port is mapped to. For example if ports 8080 and 8081 are specified, two variables called PORT_8080 and PORT_8081 will be injected.
DROVE_EXECUTOR_HOSTHostname where container is running.
DROVE_CONTAINER_IDContainer that is deployed
DROVE_APP_NAMEApp name as specified in the Application Specification
DROVE_INSTANCE_IDActual instance ID generated by Drove
DROVE_APP_IDApplication ID as generated by Drove
DROVE_APP_INSTANCE_AUTH_TOKENA JWT string generated by Drove that can be used by this container to call /apis/v1/internal/... apis.
+
+

Warning

+

Do not pass secrets using environment variables. These variables are all visible on the UI as is. Please use Configs to inject secrets files and so on.

+
+

Command line arguments

+

A list of command line arguments that are sent to the container engine to execute inside the container. This is provides ways for you to configure your container behaviour based off such arguments. Please refer to docker documentation for details.

+
+

Danger

+

This might have security implications from a system point of view. As such Drove provides administrators a way to disable passing arguments at the cluster level by setting disableCmdlArgs to true in the controller configuration.

+
+

Check Specification

+

One of the cornerstones of managing applications on the cluster is to ensure we keep track of instance health and manage their life cycle depending on their health state. We need to define how to monitor health for containers accordingly. The checks will be executed on Applications and a Check result is generated. The result consists of the following:

+
    +
  • Status - Healthy, Unhealthy or Stopped if the container is already in stopping state
  • +
  • Message - Any error message as generated by a specific checker
  • +
+

Common Options

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
ModemodeThe definition of a HTTP call or a Command to be executed in the container. See following sections for details.
TimeouttimeoutDuration for which we wait before declaring a check as failed
IntervalintervalInterval at which check will be retried
AttemptsattemptsNumber of times a check is retried before it is declared as a failure
Initial DelayinitialDelayDelay before executing the check for the first time.
+
+

Note

+

initialDelay is ignored when readiness checks and health checks are run in the recovery path as the container is already running at that point in time.

+
+

HTTP Check Options

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeFixed to HTTP for HTTP checker
ProtocolprotocolHTTP or HTTPS call to be made
Port NameportNameThe name of the container port to make the http call on as specified in the Exposed Ports section in Application Spec
PathpathThe api path to call
HTTP methodverbThe HTTP Verb/Method to invoke. GET/PUT and POST are supported here
Success CodessuccessCodesA set of HTTP status codes that we should consider as a success from this API.
PayloadpayloadA string payload that we can pass if the Verb is POST or PUT
Connection TimeoutconnectionTimeoutMaximum time for which the checker will wait for the connection to be set up with the container.
InsecureinsecureSkip hostname and certificate checks for HTTPS ports during checks.
+

Command Check Options

+ + + + + + + + + + + + + + + + + + + + +
FieldOptionDescription
TypetypeFixed to CMD for command checker
CommandcommandCommand to execute in the container. (Equivalent to docker exec -it <container> command>)
+

Exposure Specification

+

Exposure spec is used to specify the virtual host Drove Gateway exposes to outside world for communication with the containers.

+

The following information needs to be specified:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Virtual HostvhostThe virtual host to be exposed on NGinx. This should be a fully qualified domain name.
Port NameportNameThe portname to be exposed on the vhost. Port names are defined in exposedPorts section.
Exposure ModemodeUse ALL here for now. Signifies that all healthy instances of the app are exposed to traffic.
+

Sample: +

{
+    "vhost": "teastapp.mydomain",
+    "port": "main",
+    "mode": "ALL"
+}
+

+
+

Note

+

Application instances in any state other than HEALTHY are not considered for exposure. Please check Application Instance State Machine for an understanding of states of instances.

+
+

Configuring Pre Shutdown Behaviour

+

Before a container is shut down, it is desirable to ensure things are spun down properly. This behaviour can be configured in the preShutdown section of the configuration.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
HookshooksList of api calls and commands to be run on the container before it is killed. Each hook is either a HTTP Call Spec or Command Spec
Wait TimewaitBeforeKillTime to wait before killing the container.
+

Sample +

{
+    "hooks": [
+        {
+            "type": "HTTP",
+            "protocol": "HTTP",
+            "portName": "main",
+            "path": "/",
+            "verb": "GET",
+            "successCodes": [
+                200
+            ],
+            "payload": "",
+            "connectionTimeout": "1 second"
+        }
+    ],
+    "waitBeforeKill": "3 seconds"//(48)!
+}
+

+
+

Note

+

The waitBeforeKill timed wait kicks in after all the hooks have been executed.

+
+

Logging Specification

+

Can be used to configure how container logs are managed on the system.

+
+

Note

+

This section affects the docker log driver. Drove will continue to stream logs to it's own logger which can be configured at executor level through the executor configuration file.

+
+

Local Logger configuration

+

This is used to configure the json-file log driver.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to LOCAL
Max SizemaxSizeMaximum file size. Anything bigger than this will lead to rotation.
Max FilesmaxFilesMaximum number of logs files to keep. Range: 1-100
CompresscompressEnable log file compression.
+
+

Tip

+

If logging section is omitted, the following configuration is applied by default: +- File size: 10m +- Number of files: 3 +- Compression: on

+
+

Rsyslog configuration

+

In case suers want to stream logs to an rsyslog server, the logging configuration needs to be set to RSYSLOG mode.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to RSYSLOG
ServerserverURL for the rsyslog server.
Tag PrefixtagPrefixPrefix to add at the start of a tag
Tag SuffixtagSuffixSuffix to add at the en of a tag.
+
+

Note

+

The default tag is the DROVE_INSTANCE_ID. The tagPrefix and tagSuffix will to before and after this

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/assets/external/cdn.jsdelivr.net/gh/jdecked/twemoji@15.0.3/assets/svg/1f642.svg b/assets/external/cdn.jsdelivr.net/gh/jdecked/twemoji@15.0.3/assets/svg/1f642.svg new file mode 100644 index 0000000..ff9f989 --- /dev/null +++ b/assets/external/cdn.jsdelivr.net/gh/jdecked/twemoji@15.0.3/assets/svg/1f642.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/assets/external/fonts.googleapis.com/css.49ea35f2.css b/assets/external/fonts.googleapis.com/css.49ea35f2.css new file mode 100644 index 0000000..14eb3ad --- /dev/null +++ b/assets/external/fonts.googleapis.com/css.49ea35f2.css @@ -0,0 +1,594 @@ +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc3CsTKlA.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc-CsTKlA.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc2CsTKlA.woff2) format('woff2'); + unicode-range: U+1F00-1FFF; +} +/* greek */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc5CsTKlA.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc1CsTKlA.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc0CsTKlA.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc6CsQ.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xFIzIFKw.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xMIzIFKw.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xEIzIFKw.woff2) format('woff2'); + unicode-range: U+1F00-1FFF; +} +/* greek */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xLIzIFKw.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xHIzIFKw.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xGIzIFKw.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xIIzI.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic3CsTKlA.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic-CsTKlA.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic2CsTKlA.woff2) format('woff2'); + unicode-range: U+1F00-1FFF; +} +/* greek */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic5CsTKlA.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic1CsTKlA.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic0CsTKlA.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic6CsQ.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCRc4EsA.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fABc4EsA.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCBc4EsA.woff2) format('woff2'); + unicode-range: U+1F00-1FFF; +} +/* greek */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBxc4EsA.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCxc4EsA.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fChc4EsA.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 300; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBBc4.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu72xKOzY.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu5mxKOzY.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7mxKOzY.woff2) format('woff2'); + unicode-range: U+1F00-1FFF; +} +/* greek */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4WxKOzY.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7WxKOzY.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7GxKOzY.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4mxK.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCRc4EsA.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfABc4EsA.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCBc4EsA.woff2) format('woff2'); + unicode-range: U+1F00-1FFF; +} +/* greek */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBxc4EsA.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCxc4EsA.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfChc4EsA.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBBc4.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEluUlYIw.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEn-UlYIw.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEmOUlYIw.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtElOUlYIw.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEleUlYIw.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEm-Ul.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEluUlYIw.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEn-UlYIw.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEmOUlYIw.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtElOUlYIw.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEleUlYIw.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto Mono'; + font-style: italic; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEm-Ul.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSV0mf0h.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSx0mf0h.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSt0mf0h.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSd0mf0h.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSZ0mf0h.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 400; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSh0mQ.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} +/* cyrillic-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSV0mf0h.woff2) format('woff2'); + unicode-range: U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F; +} +/* cyrillic */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSx0mf0h.woff2) format('woff2'); + unicode-range: U+0301, U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116; +} +/* greek */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSt0mf0h.woff2) format('woff2'); + unicode-range: U+0370-0377, U+037A-037F, U+0384-038A, U+038C, U+038E-03A1, U+03A3-03FF; +} +/* vietnamese */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSd0mf0h.woff2) format('woff2'); + unicode-range: U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+0300-0301, U+0303-0304, U+0308-0309, U+0323, U+0329, U+1EA0-1EF9, U+20AB; +} +/* latin-ext */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSZ0mf0h.woff2) format('woff2'); + unicode-range: U+0100-02AF, U+0304, U+0308, U+0329, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; +} +/* latin */ +@font-face { + font-family: 'Roboto Mono'; + font-style: normal; + font-weight: 700; + font-display: fallback; + src: url(../fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSh0mQ.woff2) format('woff2'); + unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+0304, U+0308, U+0329, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD; +} diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc-CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc-CsTKlA.woff2 new file mode 100644 index 0000000..d88dd2b Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc-CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc0CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc0CsTKlA.woff2 new file mode 100644 index 0000000..0f8ca12 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc0CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc1CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc1CsTKlA.woff2 new file mode 100644 index 0000000..317f651 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc1CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc2CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc2CsTKlA.woff2 new file mode 100644 index 0000000..0e37f98 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc2CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc3CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc3CsTKlA.woff2 new file mode 100644 index 0000000..e0934d9 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc3CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc5CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc5CsTKlA.woff2 new file mode 100644 index 0000000..d95067a Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc5CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc6CsQ.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc6CsQ.woff2 new file mode 100644 index 0000000..83874b7 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TjASc6CsQ.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic-CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic-CsTKlA.woff2 new file mode 100644 index 0000000..50a2805 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic-CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic0CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic0CsTKlA.woff2 new file mode 100644 index 0000000..efbe79a Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic0CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic1CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic1CsTKlA.woff2 new file mode 100644 index 0000000..ea329ab Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic1CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic2CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic2CsTKlA.woff2 new file mode 100644 index 0000000..993b327 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic2CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic3CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic3CsTKlA.woff2 new file mode 100644 index 0000000..d3cb894 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic3CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic5CsTKlA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic5CsTKlA.woff2 new file mode 100644 index 0000000..1283c45 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic5CsTKlA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic6CsQ.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic6CsQ.woff2 new file mode 100644 index 0000000..851fedb Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOjCnqEu92Fr1Mu51TzBic6CsQ.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xEIzIFKw.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xEIzIFKw.woff2 new file mode 100644 index 0000000..8f20a2c Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xEIzIFKw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xFIzIFKw.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xFIzIFKw.woff2 new file mode 100644 index 0000000..bed8708 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xFIzIFKw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xGIzIFKw.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xGIzIFKw.woff2 new file mode 100644 index 0000000..e1f558c Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xGIzIFKw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xHIzIFKw.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xHIzIFKw.woff2 new file mode 100644 index 0000000..688c713 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xHIzIFKw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xIIzI.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xIIzI.woff2 new file mode 100644 index 0000000..9dc0be8 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xIIzI.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xLIzIFKw.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xLIzIFKw.woff2 new file mode 100644 index 0000000..3e5facb Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xLIzIFKw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xMIzIFKw.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xMIzIFKw.woff2 new file mode 100644 index 0000000..1125cc0 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOkCnqEu92Fr1Mu51xMIzIFKw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fABc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fABc4EsA.woff2 new file mode 100644 index 0000000..a57fbdc Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fABc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBBc4.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBBc4.woff2 new file mode 100644 index 0000000..72226f5 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBBc4.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBxc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBxc4EsA.woff2 new file mode 100644 index 0000000..b61eed3 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fBxc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCBc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCBc4EsA.woff2 new file mode 100644 index 0000000..a26ba15 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCBc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCRc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCRc4EsA.woff2 new file mode 100644 index 0000000..a69131b Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCRc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fChc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fChc4EsA.woff2 new file mode 100644 index 0000000..14af54a Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fChc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCxc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCxc4EsA.woff2 new file mode 100644 index 0000000..a7026d4 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmSU5fCxc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfABc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfABc4EsA.woff2 new file mode 100644 index 0000000..41637e5 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfABc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBBc4.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBBc4.woff2 new file mode 100644 index 0000000..22f6f53 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBBc4.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBxc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBxc4EsA.woff2 new file mode 100644 index 0000000..19fc4b1 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfBxc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCBc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCBc4EsA.woff2 new file mode 100644 index 0000000..98f53f7 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCBc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCRc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCRc4EsA.woff2 new file mode 100644 index 0000000..660850e Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCRc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfChc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfChc4EsA.woff2 new file mode 100644 index 0000000..327eb66 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfChc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCxc4EsA.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCxc4EsA.woff2 new file mode 100644 index 0000000..c175453 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOlCnqEu92Fr1MmWUlfCxc4EsA.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4WxKOzY.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4WxKOzY.woff2 new file mode 100644 index 0000000..a7f32b6 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4WxKOzY.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4mxK.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4mxK.woff2 new file mode 100644 index 0000000..2d7b215 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu4mxK.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu5mxKOzY.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu5mxKOzY.woff2 new file mode 100644 index 0000000..a4962e9 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu5mxKOzY.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu72xKOzY.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu72xKOzY.woff2 new file mode 100644 index 0000000..e3d708f Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu72xKOzY.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7GxKOzY.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7GxKOzY.woff2 new file mode 100644 index 0000000..20c87e6 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7GxKOzY.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7WxKOzY.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7WxKOzY.woff2 new file mode 100644 index 0000000..cfd043d Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7WxKOzY.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7mxKOzY.woff2 b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7mxKOzY.woff2 new file mode 100644 index 0000000..47ce460 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/roboto/v32/KFOmCnqEu92Fr1Mu7mxKOzY.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSV0mf0h.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSV0mf0h.woff2 new file mode 100644 index 0000000..022274d Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSV0mf0h.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSZ0mf0h.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSZ0mf0h.woff2 new file mode 100644 index 0000000..48edd1b Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSZ0mf0h.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSd0mf0h.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSd0mf0h.woff2 new file mode 100644 index 0000000..cb41535 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSd0mf0h.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSh0mQ.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSh0mQ.woff2 new file mode 100644 index 0000000..1d988a3 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSh0mQ.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSt0mf0h.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSt0mf0h.woff2 new file mode 100644 index 0000000..11e6a46 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSt0mf0h.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSx0mf0h.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSx0mf0h.woff2 new file mode 100644 index 0000000..50fb8e7 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xTDF4xlVMF-BfR8bXMIhJHg45mwgGEFl0_3vrtSM1J-gEPT5Ese6hmHSx0mf0h.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtElOUlYIw.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtElOUlYIw.woff2 new file mode 100644 index 0000000..1f1c97f Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtElOUlYIw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEleUlYIw.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEleUlYIw.woff2 new file mode 100644 index 0000000..1623005 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEleUlYIw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEluUlYIw.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEluUlYIw.woff2 new file mode 100644 index 0000000..6f232c3 Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEluUlYIw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEm-Ul.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEm-Ul.woff2 new file mode 100644 index 0000000..a3e5aef Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEm-Ul.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEmOUlYIw.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEmOUlYIw.woff2 new file mode 100644 index 0000000..f73f27d Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEmOUlYIw.woff2 differ diff --git a/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEn-UlYIw.woff2 b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEn-UlYIw.woff2 new file mode 100644 index 0000000..135d06e Binary files /dev/null and b/assets/external/fonts.gstatic.com/s/robotomono/v23/L0xdDF4xlVMF-BfR8bXMIjhOsXG-q2oeuFoqFrlnAIe2Imhk1T8rbociImtEn-UlYIw.woff2 differ diff --git a/assets/external/unpkg.com/iframe-worker/shim.js b/assets/external/unpkg.com/iframe-worker/shim.js new file mode 100644 index 0000000..5f1e232 --- /dev/null +++ b/assets/external/unpkg.com/iframe-worker/shim.js @@ -0,0 +1 @@ +"use strict";(()=>{function c(s,n){parent.postMessage(s,n||"*")}function d(...s){return s.reduce((n,e)=>n.then(()=>new Promise(r=>{let t=document.createElement("script");t.src=e,t.onload=r,document.body.appendChild(t)})),Promise.resolve())}var o=class extends EventTarget{constructor(e){super();this.url=e;this.m=e=>{e.source===this.w&&(this.dispatchEvent(new MessageEvent("message",{data:e.data})),this.onmessage&&this.onmessage(e))};this.e=(e,r,t,i,m)=>{if(r===`${this.url}`){let a=new ErrorEvent("error",{message:e,filename:r,lineno:t,colno:i,error:m});this.dispatchEvent(a),this.onerror&&this.onerror(a)}};let r=document.createElement("iframe");r.hidden=!0,document.body.appendChild(this.iframe=r),this.w.document.open(),this.w.document.write(` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Anatomy of a Drove Cluster

+

The following diagram provides a high level overview of a typical Drove cluster. +Drove Cluster +The overall topology consists of the following components:

+
    +
  • An Apache ZooKeeper cluster for state persistence and coordination
  • +
  • A set of controller nodes one of which (the leader) manages the cluster
  • +
  • A set of executor nodes on which the containers actually execute
  • +
  • NGinx + drove-gateway nodes that expose virtual hosts for the leader controller as well as for the vhosts defined for the various applications running on the cluster
  • +
+

Apache ZooKeeper

+

Zookeeper is a central component in a Drove cluster. It is used in the following manner:

+
    +
  • As store for discovery of cluster components like Controller and Executor to each other
  • +
  • For electing the leader controller in the cluster
  • +
  • As storage for Application and Task Specifications
  • +
  • Asynchronous communication channel/transient store for real-time information about controller and executor state in the cluster
  • +
+

Controller

+

The controller service is the brains of a Drove cluster. The role of the controller consists of the following:

+
    +
  • Ensure it has a reasonably up-to-date information about the cluster topology and free/used resources
  • +
  • Track executor status (blacklisted/online/offline etc) and tagging. - Take corrective actions in case some of them become inaccessible for whatever reason
  • +
  • Manages container placement to ensure that application and task containers get placed according to provided placement configuration/spec
  • +
  • Manage NUMA node and core affinity ensuring that instances get deployed optimally on cores and NUMA nodes without stepping on each other
  • +
  • Provide a UI for users to consume data about cluster, applications and tasks
  • +
  • Provide APIs for systems to provision apps, tasks and manage the cluster
  • +
  • Provide event stream for other tools and services to follow what is happening on the cluster
  • +
  • Provide APIs to list container level logs and provide real-time offset based polling of log contents for application and task instances
  • +
  • Implement leader election based HA so that only one controller is active(leader) at a time.
  • +
  • All decisions regarding scheduling, state machine management and recovery are taken only by the leader
  • +
  • Manage the lifecycle of all applications deployed on the cluster.
      +
    • Maintain the required number of application instances as specified during deployment. This means that the controller has to monitor all applications running on all nodes and replace any instances or kill any spurious ones to ensure a constant number of instances on the cluster. The required number of instances is maintained as expected count, the current number of instances is maintained as running count.
    • +
    • Provide a way to adjust the number of instances for this application. This would help in users being able to scale applications up and down as needed.
    • +
    • Provide a way to restart all instances of the application. This would mean the controller would have to orchestrate a continuous string of start-kill operations across instances running on the cluster.
    • +
    • Graceful shutdown/suspension of application across the cluster. This comes as a natural extension of the above and is mostly a scale down operation with the expected count set as zero.
    • +
    +
  • +
  • Manage task lifecycle
      +
    • Maintain task state-machine by scheduling it on executors and ensuring it reaches terminal state
    • +
    • Provide mechanisms to cancel tasks
    • +
    +
  • +
  • Reconcile stale and dead instances for applications and tasks and take corrective measures to ensure steady state if necessary
  • +
  • Application instance migration from blacklisted executors
  • +
  • Send command messages to executors to start and stop instances with retries and failure recovery
  • +
+

Executors

+

Executors are the agents running on the nodes where the containers are deployed. Role of the executors is the following:

+
    +
  • Publish hardware topology of the machine to the controller on startup.
  • +
  • Manage container lifecycle including:
      +
    • Pulling containers from docker repository with optional authentication
    • +
    • Start containers with proper options for pinning containers to specific NUMA nodes and cores as specified by controller
        +
      • Data for an instance is stored as specific docker label values on the containers themselves
      • +
      +
    • +
    • Run HTTP call or shell command based readiness checks to ensure application container is ready to serve traffic based on readiness checks specification in start message
    • +
    • Monitor application container health by running periodic HTTP call or shell command based health checks as specified by controller in start message
    • +
    • Track normal (for tasks) or abnormal (for application instances) container exits.
        +
      • For tasks, the exit code is collected and used to deduce if task succeeded (exit code is 0) or failed (exit code is non-zero)
      • +
      • For application containers, the expectation is for the container to stop only when explicitly requested and hence all exits are considered as failures and handled accordingly
      • +
      +
    • +
    • Stop application containers on request from controller
    • +
    • Run any pre-shutdown hook calls as specified in the application specification before killing container
    • +
    • Cleanup container volumes etc
    • +
    • Cleanup docker images (if local image caching is turned off which is the default behaviour)
    • +
    +
  • +
  • Send regular local node status updates to ZooKeeper every 20 seconds
  • +
  • Send instant updates by making direct HTTP calls to the leader controller when anything changes for any running containers and for every step of the container state machine execution for both task and application instances to allow for faster updates
  • +
  • Recover containers on process restart based on the metadata stored as labels on the running container. This data is reconciled with a snapshot of the expected instances on the node as received from the leader controller at that point in time.
  • +
  • Find and kill any zombie containers that are not supposed to exist on that node. The check is done every 30 seconds.
  • +
  • Provide container log-file listing and offset based content delivery APIs to container
  • +
+

NGinx and Drove-Gateway

+

Almost all of the traffic between service containers is routed via the internal Ranger based service discovery system at PhonePe. However, traffic from the edge as well and between different protected environments are routed using the well-established virtual host (and additionally, in some unusual cases, header) based routing.

+
    +
  • All applications on Drove can specify a Vhost and a port name as endpoint for such routing.
  • +
  • Upstream information for such VHosts or endpoints is available from an API from the leading Drove controller.
  • +
  • This information can be used to configure any load-balancer or tourer or reverse proxy to expose applications running on Drove to the outside world.
  • +
  • +

    We modified an existing project called Nixy so that it gets the upstream information from Drove instead of Marathon. +Nixy plays the following role in a cluster:

    +
  • +
  • +

    Track the leader controller for a Drove cluster by making ping calls to all specified controllers

    +
  • +
  • Provide a special data structure that can be used by a template to expose a vhost that points to the leader controller in a Drove cluster. This can be used for any tools that need to interact with a Drove cluster for deployments, monitoring as well as callback endpoints for OAuth etc etc
  • +
  • Listen to relevant events from the Drove cluster to trigger upstream refresh as necessary
  • +
  • Provide data structures that include the vhost, upstream endpoints (host:port) and metadata (application level tags) that can be used to build templates that generate NGinx configurations to enable progressively complicated routing of calls from downstream to upstreams hosted on Drove clusters. Data structure exposed to templates, groups all upstream host:port tuples by using the vhost. This allows for multiple deployments for the same VHost to exist. This is needed for a variety of situations including online-updates of services.
  • +
  • Supports username/password based authentication and header based (used internally) to Drove clusters.
  • +
  • Support for both NGinx Plus and OSS products. Drove-Nixy can make appropriate api calls to corresponding NGinx plus server to only refresh existing VHost on topology change, as well as affect a full reload when new Vhosts are detected. This ensures that there are no connection drops for critical path applications where NGinx Plus might be used. This also solves the issue of NGinx workers going into a hung state due to frequent reloads on busy clusters like our dev testing environment.
  • +
+
+

Tip

+

The NGinx deployment is standard across all Drove clusters. However, for clusters that receive a lot of traffic using Nginx, the cluster exposing the VHost for Drove itself might be separated from the one exposing the application virtual hosts to allow for easy scalability of the latter. The template for these are configured differently as needed respectively.

+
+

Other components

+

There are a few more components that are used for operational management and observability.

+

Telegraf

+

PhonePe’s internal metric management system uses a HTTP based metric collector. Telegraf is installed on all Drove nodes to collect metric from the metric port (Admin connector on Dropwizard) and push that information to our metric ingestion system. This information is then used to build dashboards as well as by our Anomaly detection and alerting systems.

+

Log Management

+

Drove provides a special logger called drove that can be configured to handle compression rotation and archival of container logs. Such container logs are stored on specialised partitions by application/application-instance-id or by source app name/ task id for application and task instances respectively. PhonePe’s standardised log rotation tools are used to monitor and ship out such logs to our central log management system. The same can be replaced or enhanced by running something like promtail on Drove logs to ship out logs to tools like Grafana Loki.

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/controller.html b/cluster/setup/controller.html new file mode 100644 index 0000000..3541eed --- /dev/null +++ b/cluster/setup/controller.html @@ -0,0 +1,2278 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Setting up Controllers - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Setting up Controllers

+

Controllers are the brains of Drove cluster. For HA, at least 2 controllers should be set up.

+

Please note the following behaviour about controllers:

+
    +
  • Only one controller is leader at a time. A leader controller does not relinquish control till the process is stopped or dies
  • +
  • A controller process will die when it loses connectivity to Zookeeper
  • +
  • The process/container for controller should keep restarting till it gets connectivity
  • +
  • All decisions for the cluster are taken by the leader controller only
  • +
  • During maintenance and package upgrades etc, it is better to roll changes out on non-leaders first and then do the leader at the end
  • +
  • The controller process holds all metadata about the cluster, the current states and other information in memory.
      +
    • Some of this information is backed by Zookeeper based storage layer
    • +
    • The other information is recreated dynamically based on updates from executors
    • +
    +
  • +
  • Controllers being down does not affect running containers on executors
  • +
+

Controller Configuration File Reference

+

The Drove Controller is written on the Dropwizard framework. The configuration to the service is set using a YAML file which needs to be injected into the container. A typical controller configuration file will look like the following:

+
server: #(1)!
+  applicationConnectors: #(2)!
+    - type: http
+      port: 4000
+  adminConnectors: #(3)!
+    - type: http
+      port: 4001
+  applicationContextPath: / #(4)!
+  requestLog: #(5)!
+    appenders:
+      - type: console
+        timeZone: ${DROVE_TIMEZONE}
+      - type: file
+        timeZone: ${DROVE_TIMEZONE}
+        currentLogFilename: /logs/drove-controller-access.log
+        archivedLogFilenamePattern: /logs/drove-controller-access.log-%d-%i
+        archivedFileCount: 3
+        maxFileSize: 100MiB
+
+
+logging: #(6)!
+  level: INFO
+  loggers:
+    com.phonepe.drove: ${DROVE_LOG_LEVEL}
+
+  appenders:
+    - type: console #(7)!
+      threshold: ALL
+      timeZone: ${DROVE_TIMEZONE}
+      logFormat: "%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n"
+    - type: file #(8)!
+      threshold: ALL
+      timeZone: ${DROVE_TIMEZONE}
+      currentLogFilename: /logs/drove-controller.log
+      archivedLogFilenamePattern: /logs/drove-controller.log-%d-%i
+      archivedFileCount: 3
+      maxFileSize: 100MiB
+      logFormat: "%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n"
+      archive: true
+
+
+zookeeper: #(9)!
+  connectionString: ${ZK_CONNECTION_STRING}
+
+clusterAuth: #(10)!
+  secrets:
+  - nodeType: CONTROLLER
+    secret: ${DROVE_CONTROLLER_SECRET}
+  - nodeType: EXECUTOR
+    secret: ${DROVE_EXECUTOR_SECRET}
+
+userAuth: #(11)!
+  enabled: true
+  users:
+    - username: admin
+      password: ${DROVE_ADMIN_PASSWORD}
+      role: EXTERNAL_READ_WRITE
+    - username: guest
+      password: ${DROVE_GUEST_PASSWORD}
+      role: EXTERNAL_READ_ONLY
+
+instanceAuth: #(12)!
+  secret: ${DROVE_INSTANCE_AUTH_SECRET}
+
+options: #(13)!
+  maxStaleInstancesCount: 3
+  staleCheckInterval: 1m
+  staleAppAge: 1d
+  staleInstanceAge: 18h
+  staleTaskAge: 1d
+  clusterOpParallelism: 4
+
+
    +
  1. Server listener configuration. See Dropwizard Server Configuration for the different options.
  2. +
  3. Main port configuration. This is where the UI and APIs will be exposed. Check connector configuration docs for details.
  4. +
  5. Admin port. You can take thread dumps, metrics, run healthchecks on the Drove controller on this port.
  6. +
  7. Base path for UI. Keep this as is.
  8. +
  9. Access logs configuration. See requestLog docs.
  10. +
  11. Main logging configuration. See logging docs.
  12. +
  13. Log to console. Useful in docker-compose.
  14. +
  15. Log to rotating files. Useful for running servers.
  16. +
  17. Configure how to connect to Zookeeper See Zookeeper Config for details.
  18. +
  19. Configuration for authentication between nodes in the cluster. Please check intra node auth config for details.
  20. +
  21. Configure user authentication to access the cluster. Please check User auth config for details.
  22. +
  23. Signing secret for JWT to be embedded in application and task instances. Check Instance auth config for details.
  24. +
  25. Special options to configure controller behaviour. See Controller Options for details.
  26. +
+
+

Tip

+

In case you do not want to expose admin apis to outside the host, please set bindHost in the admin connectors section.

+
adminConnectors:
+  - type: http
+    port: 10001
+    bindHost: 127.0.0.1
+
+
+

Zookeeper Connection Configuration

+

The following details can be configured.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Connection StringconnectionStringThe connection string of the form: zkserver:2181,zkserver2:2181...
Data namespacenamespaceThe top level node inside which all Drove data will be scoped. Defaults to drove if not set.
+

Sample

+
zookeeper:
+  connectionString: "192.168.3.10:2181,192.168.3.11:2181,192.168.3.12:2181"
+  namespace: drovetest
+
+

Intra Node Authentication Configuration

+

Communication between controller and executor is protected by a shared-secret based authentication. The following configuration is meant to configure this. This section consists of a list of 2 members:

+
    +
  • Config for controller to talk to executors
  • +
  • Config for executors to talk to controller
  • +
+

Each section consists of the following:

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Node TypenodeTypeType of node in the cluster. Can be CONTROLLER or EXECUTOR
SecretsecretThe actual secret to be passed.
+

Sample +

clusterAuth:
+  secrets:
+  - nodeType: CONTROLLER
+    secret: ControllerSecretValue
+  - nodeType: EXECUTOR
+    secret: ExecutorSecret
+

+
+

Danger

+

The values are passed in the header as is. Please manage the config file ownership to ensure that the files are not world readable.

+
+
+

Tip

+

You can use pwgen -s 32 to generate secure random strings for usage as secrets.

+
+

User Authentication Configuration

+

This section is used to configure user details for human and other systems that need to call Drove APIs or access the Drove UI. This is implemented using basic auth.

+

The configuration consists of:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
EnabledenabledEnable basic auth for the cluster
EncodingencodingThe actual encoding of the password. Can be PLAIN or CRYPT
CachingcachingPolicyCaching policy for the authentication and authorization of the user. Please check CaffeineSpec docs for more details. Set to maximumSize=500, expireAfterAccess=30m by default
List of usersusersA list of users recognized by the system
+

Each entry in the user list consists of:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
User NameusernameThe actual login username
PasswordpasswordThe password for the user. Needs to be set to bcrypt string of the actual password if encoding is set to CRYPT in the parent section.
User RoleroleThe role of the user in the cluster. Can be EXTERNAL_READ_WRITE for users who have both read and write permissions or EXTERNAL_READ_ONLY for users with read-only permissions.
+

Sample +

userAuth:
+  enabled: true
+  encoding: CRYPT
+  users:
+    - username: admin
+      password: "$2y$10$pfGnPkYrJEGzasvVNPjRu.IJldV9TDa0Vh.u1UdimILWDuhvapc2O"
+      role: EXTERNAL_READ_WRITE
+    - username: guest
+      password: "$2y$10$uCJ7WxIvd13C.1oOTs28p.xpJShGiTWuDLY/sGH9JE8nrkSGBFkc6"
+      role: EXTERNAL_READ_ONLY
+    - username: noread
+      password: "$2y$10$8mr/zXL5rMW/s/jlBcgXHu0UvyzfdDDvyc.etfuoR.991sn9UOX/K"
+

+
+

No authentication

+

To configure a cluster without authentication, remove this section entirely.

+
+
+

Operator role

+

If role is not set, the user will be able to access the UI, but will not have access to application logs. This comes in handy to provide access to other teams to explore your deployment topology, but not get access to your logs that might contain sensitive information.

+
+
+

Password Hashing

+

We strongly recommend using bcrypt passwords for authentication. You can use the following command to generate hashed password strings:

+
htpasswd -nbBC 10 <username> <password>|cut -d ':' -f2
+
+
+

Instance Authentication Configuration

+

All application and task instances, get access to an unique JWT that is injected into it by Drove as the environment variable DROVE_APP_INSTANCE_AUTH_TOKEN. This token is signed using a secret. This secret can be configured by setting the secret parameter in the instanceAuth section.

+

Sample +

instanceAuth:
+  secret: RandomSecret
+

+

Controller Options

+

The following options can be set to influence the behavior of the Drove cluster and the controller.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Stale Check IntervalstaleCheckIntervalInterval at which Drove checks for stale application and task metadata for cleanup. Defaults to 1 hour. Expressed in duration.
Stale App AgestaleAppAgeApps in MONITORING state are cleaned up after some time by Drove. This variable can be used to control the max time for which such apps are maintained in the cluster. Defaults to 7 days. Expressed in duration.
Stale App Instances CountmaxStaleInstancesCountMaximum number of application instances metadata for stopped or lost instances to be maintained in the cluster. Defaults to 100.
Stale Instance AgestaleInstanceAgeMaximum age for a stale application instance to be retained. Defaults to 7 days. Expressed in duration.
Stale Task AgestaleTaskAgeMaximum time for which metadata for a finished task is retained on the cluster. Defaults to 2 days. Expressed in duration.
Event Storage DurationmaxEventsStorageDurationMaximum time for which cluster events are retained on the cluster. Defaults to 1 hour. Expressed in duration.
Default Operation TimeoutclusterOpTimeoutTimeout for operations that are initiated by drove itself. For example, instance spin up in case of executor failure, instance migrations etc. Defaults to 5 minutes. Expressed in duration.
Operation threadsclusterOpParallelismSignified the parallelism for operations internal to the cluster. Defaults to: 1. Range: 1-32.
Audited MethodsauditedHttpMethodsDrove prints an audit log with user details when an api is called by an user. Defaults to ["POST", "PUT"].
Allowed mount directoriesallowedMountDirsIf provided, Drove will ensure that application and task spec can mount only the directories mentioned in this set on executor host.
Disable read-only authdisableReadAuthWhen userAuth is enabled, setting this option, will enforce authorization only on write operations.
Disable command line argumentsdisableCmdlArgsWhen set to true, passing command line arguments will be disabled. Default: false (users can pass arguments.
+

Sample +

options:
+  staleCheckInterval: 5m
+  staleAppAge: 2d
+  maxStaleInstancesCount: 20
+  staleInstanceAge: 1d
+  staleTaskAge: 2d
+  maxEventsStorageDuration: 30m
+  clusterOpParallelism: 32
+  allowedMountDirs:
+   - /mnt/scratch
+

+

Stale data cleanup

+

In order to keep internal memory footprint low, reduce the amount of data stored on Zookeeper, and provide a faster experience on the UI,Drove keeps cleaning up data for stale applications, application instances, task instances and cluster events.

+

The retention for such metadata can be controlled using the following config options:

+
    +
  • staleAppAge
  • +
  • maxStaleInstancesCount
  • +
  • staleInstanceAge
  • +
  • staleTaskAge
  • +
  • maxEventsStorageDuration
  • +
+
+

Warning

+

Configuration changes done to these parameters will have direct impact on memory usage by the controller and memory and disk utilization on the Zookeeper cluster.

+
+

Internal Operations

+

Drove may need to create and issue operations on applications and tasks to manage cluster stability, for maintenance and other reasons. The following parameters can be used to control the speed and parallelism of such operations:

+
    +
  • clusterOpTimeout
  • +
  • clusterOpParallelism
  • +
+
+

Tip

+

The default value of 1 for the clusterOpParallelism parameter is generally too low for most clusters. Unless there is a specific problem, it would be advisable to set this to at least 4. If number of instances is quite high for applications (order of tens or hundreds), feel free to set this to 32.

+
+

Increasing clusterOpParallelism will make recovery faster in case of executor failures, but it will increase cpu utilization on the controller by a little bit.

+
+
+ +

The auditedHttpMethods parameter contains a list of all HTTP methods that need to be audited. This means that if the auditedHttpMethods contains POST and PUT, any drove HTTP POST or PUT apis being called will lead to a audit in the controller logs with the details of the user that made the call.

+
+

Warning

+

It would be advisable to not add GET to the list. This is because the UI keeps making calls to GET apis on drove to fetch data to render. These calls are automated and happen every few seconds from the browser. This will blow up controller logs size.

+
+

The allowedMountDirs option whitelists only some directories to be mounted on containers. If this is not provided, containers will be able to mount any directory on the executors.

+
+

Danger

+

It is highly recommended to set allowedMountDirs to a designated directory that containers might want to use as scratch space if needed. Keeping this empty will almost definitely cause security issues in the long run.

+
+

Relevant directories

+

Location for data and logs are as follows:

+
    +
  • /etc/drove/controller/ - Configuration files
  • +
  • /var/log/drove/controller/ - Logs
  • +
+

We shall be volume mounting the config and log directories with the same name.

+
+

Prerequisite Setup

+

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

+
+

Setup the config file

+

Create a relevant configuration file in /etc/drove/controller/controller.yml.

+

Sample +

server:
+  applicationConnectors:
+    - type: http
+      port: 10000
+  adminConnectors:
+    - type: http
+      port: 10001
+  requestLog:
+    appenders:
+      - type: file
+        timeZone: IST
+        currentLogFilename: /var/log/drove/controller/drove-controller-access.log
+        archivedLogFilenamePattern: /var/log/drove/controller/drove-controller-access.log-%d-%i
+        archivedFileCount: 3
+        maxFileSize: 100MiB
+
+logging:
+  level: INFO
+  loggers:
+    com.phonepe.drove: INFO
+
+
+  appenders:
+    - type: file
+      threshold: ALL
+      timeZone: IST
+      currentLogFilename: /var/log/drove/controller/drove-controller.log
+      archivedLogFilenamePattern: /var/log/drove/controller/drove-controller.log-%d-%i
+      archivedFileCount: 3
+      maxFileSize: 100MiB
+      logFormat: "%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n"
+
+zookeeper:
+  connectionString: "192.168.56.10:2181"
+
+clusterAuth:
+  secrets:
+  - nodeType: CONTROLLER
+    secret: "0v8XvJrDc7r86ZY1QCByPTDPninI4Xii"
+  - nodeType: EXECUTOR
+    secret: "pOd9sIEXhv0wrGOVc7ebwNvR7twZqyTN"
+
+userAuth:
+  enabled: true
+  encoding: CRYPT
+  users:
+    - username: admin
+      password: "$2y$10$pfGnPkYrJEGzasvVNPjRu.IJldV9TDa0Vh.u1UdimILWDuhvapc2O"
+      role: EXTERNAL_READ_WRITE
+    - username: guest
+      password: "$2y$10$uCJ7WxIvd13C.1oOTs28p.xpJShGiTWuDLY/sGH9JE8nrkSGBFkc6"
+      role: EXTERNAL_READ_ONLY
+
+
+instanceAuth:
+  secret: "bd2SIgz9OMPG2L8wA6zxj21oLVLbuLFC"
+
+options:
+  maxStaleInstancesCount: 3
+  staleCheckInterval: 1m
+  staleAppAge: 2d
+  staleInstanceAge: 1d
+  staleTaskAge: 1d
+  clusterOpParallelism: 4
+  allowedMountDirs:
+   - /dev/null
+

+

Setup required environment variables

+

Environment variables need to run the drove controller are setup in /etc/drove/controller/controller.env.

+
CONFIG_FILE_PATH=/etc/drove/controller/controller.yml
+JAVA_PROCESS_MIN_HEAP=2g
+JAVA_PROCESS_MAX_HEAP=2g
+ZK_CONNECTION_STRING="192.168.3.10:2181"
+JAVA_OPTS="-Xlog:gc:/var/log/drove/controller/gc.log -Xlog:gc:::filecount=3,filesize=10M -Xlog:gc::time,level,tags -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Dfile.encoding=utf-8 -Djute.maxbuffer=0x9fffff"
+
+

Create systemd file

+

Create a systemd file. Put the following in /etc/systemd/system/drove.controller.service:

+
[Unit]
+Description=Drove Controller Service
+After=docker.service
+Requires=docker.service
+
+[Service]
+User=drove
+TimeoutStartSec=0
+Restart=always
+ExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-controller:latest
+ExecStart=/usr/bin/docker run  \
+    --env-file /etc/drove/controller/controller.env \
+    --volume /etc/drove/controller:/etc/drove/controller:ro \
+    --volume /var/log/drove/controller:/var/log/drove/controller \
+    --publish 10000:10000  \
+    --publish 10001:10001 \
+    --hostname %H \
+    --rm \
+    --name drove.controller \
+    ghcr.io/phonepe/drove-controller:latest
+
+[Install]
+WantedBy=multi-user.target
+
+

Verify the file with the following command: +

systemd-analyze verify drove.controller.service
+

+

Set permissions +

chmod 664 /etc/systemd/system/drove.controller.service
+

+

Start the service on all servers

+

Use the following to start the service:

+
systemctl daemon-reload
+systemctl enable drove.controller
+systemctl start drove.controller
+
+

You can tail the logs at /var/logs/drove/controller/drove-controller.log.

+

The console would be available at http://<ip>:10000 and admin functionality will be available on http://<ip>:10001 according to the above config.

+

Health checks can be performed by running a curl as follows:

+
curl http://localhost:10001/healthcheck
+
+
+

Note

+
    +
  • The healthcheck check api is available on admin port.
  • +
  • HTTP status is 200/OK if things are fine.
  • +
+
+

Once controllers are up, one of them will become the leader. You can check the leader by running the following command:

+
curl http://<ip>:10000/apis/v1/ping
+
+

Only on the leader you should get the following response along with a HTTP status 200/OK: +

{
+    "status":"SUCCESS",
+    "data":"pong",
+    "message":"success"
+}
+

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/executor-setup.html b/cluster/setup/executor-setup.html new file mode 100644 index 0000000..f1f1baa --- /dev/null +++ b/cluster/setup/executor-setup.html @@ -0,0 +1,2443 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Setting up Executor Nodes - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Setting up Executor Nodes

+

We shall setup the executor nodes by setting up the hardware, operating system first and then the executor service itself.

+

Considerations and tuning for hardware and operating system

+

In the following sections we discus some aspects of scheduling, hardware and settings on the OS to ensure good performance.

+

CPU and Memory considerations

+

The executor nodes are the servers that host and run the actual docker containers. Drove will take into consideration the NUMA topology of these machines to optimize the placement for containers to extract the maximum performance. Along with this, Drove will cpuset the containers to the allocated cores in a non overlapping manner, so that the cores allocated to a container are dedicated to it. Memory allocated to a container is pinned as well and selected from the same NUMA node.

+

Needless to say the minimum amount of CPU that can be given to an application or task is 1. Fractional cpu allocation can be achieved in a predictable manner by configuring over provisioning on executor nodes.

+

Over Provisioning of CPU and Memory

+

Drove does not do any kind of burst scaling or overcommitment to ensure application performance remains predictable even under load. Instead, in Drove, there is a feature to make executors appear to have more cores (and memory) than it actually has. This can be used to get more utilization out of executor nodes in clusters that do not need guaranteed performance (for example staging or dev testing clusters). This is achieved by enabling over provisioning.

+

Over provisioning needs to be configured in the executor configuration. It primarily consists of two configs:

+
    +
  • CPU Multiplier - an integral multiplier which will be used to multiply the number of available cores
  • +
  • Memory Multiplier - an integral multiplier which will be used to multiply available memory
  • +
+

VCores (virtual cores) are internal representation of a CPU core on the executor. If over provisioning is disabled, a vcore will correspond to a physical core. If over provisioning is enabled, 1 CPU core will generate cpu multiplier number of v cores. Drove does do cpuset even on containers running on nodes that have over provisioning enabled, however the physical cores that the containers get bound to are chosen at random, albeit from the same NUMA node. cpuset-mem is always done on the same NUMA node as well.

+
+

Mixed clusters

+

In some production clusters you might have applications that are non critical in terms of performance and are unable to utilize a full core. These can be tagged to be spun up on some nodes where over provisioning is enabled. Adopting such a cluster topology will ensure that containers that need high performance run on nodes without over provisioning and the smaller apps (like for example operations consoles etc) are run on separate nodes with over provisioning enabled. Just ensure the latter are tagged properly and during app deployment specify this tag in application spec or task spec.

+
+

Disable NUMA Pinning

+

There is an option to disable memory and core pinning. In this situation, all cores from all NUM nodes show up as being part of one node. cpuset-mems is not called if numa pinning is disabled and therefore you will be leaving some memory performance on the table. We recommend not to dabble with this unless you have tasks and containers that need more than the number of cores available on a single NUMA node. This setting is enabled at executor level by setting disableNUMAPinning: true.

+

Hyper-threading

+

Whether Hyper Threading needs to be enabled or not is a bit dependent on applications deployed and how effectively they can utilize individual CPU cores. For mixed workloads, we recommend Hyper Threading to be enabled on the executor nodes.

+

Isolating container and OS processes

+

Typically we would not want containers to share CPU resources with processes for the operating system, Drove Executor Service as well as Docker engine (if using docker) and so on. While complete isolation would need creating a full scheduler (and passing isolcpus to GRUB parameters), we can get a good middle ground by ensuring such processes utilize only a few CPU cores on the system, and let the Drove executors deploy and pin containers to the rest.

+

This is achieved in two steps:

+
    +
  • Make changes to systemd to use only specific cores
  • +
  • Exclude these cores in the drove executor configuration
  • +
+

Let's say our server has 2 NUMA nodes, each with 40 hyper-threaded cores. We want to reserve the first 2 cores from each CPU to the OS processes. So we reserve cores [0,1,2,3] for the OS processes.

+

The following line in /etc/systemd/system.conf

+
#CPUAffinity=
+
+

needs to be changed to

+
CPUAffinity=0 1 2 3
+
+
+

Tip

+

Reboot the machine for this to take effect.

+
+

The changes can be validated post reboot by running the following command:

+
grep Cpus_allowed_list /proc/1/status
+
+

The expected output should be: +

Cpus_allowed_list:  0-3
+

+
+

Note

+

Refer to this for more details.

+
+

GPU Computation

+

Nvidia based GPU compute can be enabled at executor level by installing relevant drivers. Please follow the setup guide to enable this. Remember to tag these nodes to isolate them from the primary cluster and use tags to deploy apps and tasks that need GPU.

+

Storage consideration

+

On executor nodes the disk might be under pressure if container (re)deployments are frequent or the containers log very heavily. As such, we recommend the logging directory for Drove be mounted on hardware that will be able to handle this load. Similar considerations need to be given to the log and package directory for docker or podman.

+

Executor Configuration Reference

+

The Drove Executor is written on the Dropwizard framework. The configuration to the service is set using a YAML file which needs to be injected into the container. A typical controller configuration file will look like the following:

+
server: #(1)!
+  applicationConnectors: #(2)!
+    - type: http
+      port: 3000
+  adminConnectors: #(3)!
+    - type: http
+      port: 3001
+  applicationContextPath: /
+  requestLog:
+    appenders:
+      - type: console
+        timeZone: ${DROVE_TIMEZONE}
+      - type: file
+        timeZone: ${DROVE_TIMEZONE}
+        currentLogFilename: /logs/drove-executor-access.log
+        archivedLogFilenamePattern: /logs/drove-executor-access.log-%d-%i
+        archivedFileCount: 3
+        maxFileSize: 100MiB
+
+logging:
+  level: INFO
+  loggers:
+    com.phonepe.drove: ${DROVE_LOG_LEVEL}
+
+  appenders: #(4)!
+    - type: console #(5)!
+      threshold: ALL
+      timeZone: ${DROVE_TIMEZONE}
+      logFormat: "%(%-5level) [%date] [%logger{0} - %X{instanceLogId}] %message%n"
+    - type: file #(6)!
+      threshold: ALL
+      timeZone: ${DROVE_TIMEZONE}
+      currentLogFilename: /logs/drove-executor.log
+      archivedLogFilenamePattern: /logs/drove-executor.log-%d-%i
+      archivedFileCount: 3
+      maxFileSize: 100MiB
+      logFormat: "%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n"
+      archive: true
+
+    - type: drove #(7)!
+      logPath: "/logs/applogs/"
+      archivedLogFileSuffix: "%d"
+      archivedFileCount: 3
+      threshold: TRACE
+      timeZone: ${DROVE_TIMEZONE}
+      logFormat: "%(%-5level) | %-23date | %-30logger{0} | %message%n"
+      archive: true
+
+zookeeper: #(8)!
+  connectionString: ${ZK_CONNECTION_STRING}
+
+clusterAuth: #(9)!
+  secrets:
+  - nodeType: CONTROLLER
+    secret: ${DROVE_CONTROLLER_SECRET}
+  - nodeType: EXECUTOR
+    secret: ${DROVE_EXECUTOR_SECRET}
+
+resources: #(10)!
+  osCores: [ 0, 1 ]
+  exposedMemPercentage: 60
+  disableNUMAPinning: ${DROVE_DISABLE_NUMA_PINNING}
+  enableNvidiaGpu: ${DROVE_ENABLE_NVIDIA_GPU}
+
+options: #(11)!
+  cacheImages: true
+  maxOpenFiles: 10_000
+  logBufferSize: 5m
+  cacheFileSize: 10m
+  cacheFileCount: 3
+
+
    +
  1. Server listener configuration. See Dropwizard Server Configuration for the different options.
  2. +
  3. Main port configuration. This is where the UI and APIs will be exposed. Check connector configuration docs for details.
  4. +
  5. Admin port. You can take thread dumps, metrics, run healthchecks on the Drove controller on this port.
  6. +
  7. Logging configuration. See logging docs.
  8. +
  9. Log to console. Useful in docker-compose.
  10. +
  11. Log to rotating files. Useful for running servers.
  12. +
  13. Drove application logger configuration. See drove logger config for details.
  14. +
  15. Configure how to connect to Zookeeper See Zookeeper Config for details.
  16. +
  17. Configuration for authentication between nodes in the cluster. Please check intra node auth config for details.
  18. +
  19. Resource configuration for this node.
  20. +
  21. Options to configure executor behaviour. Check executor options section for details.
  22. +
+
+

Tip

+

In case you do not want to expose admin apis to outside the host, please set bindHost in the admin connectors section.

+
adminConnectors:
+  - type: http
+    port: 10001
+    bindHost: 127.0.0.1
+
+
+

Zookeeper Connection Configuration

+

The following details can be configured.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Connection StringconnectionStringThe connection string of the form: zkserver:2181,zkserver2:2181...
Data namespacenamespaceThe top level node inside which all Drove data will be scoped. Defaults to drove if not set.
+

Sample

+
zookeeper:
+  connectionString: "192.168.3.10:2181,192.168.3.11:2181,192.168.3.12:2181"
+  namespace: drovetest
+
+
+

Note

+

This section is same across the cluster including both controller and executor.

+
+

Intra Node Authentication Configuration

+

Communication between controller and executor is protected by a shared-secret based authentication. The following configuration is meant to configure this. This section consists of a list of 2 members:

+
    +
  • Config for controller to talk to executors
  • +
  • Config for executors to talk to controller
  • +
+

Each section consists of the following:

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Node TypenodeTypeType of node in the cluster. Can be CONTROLLER or EXECUTOR
SecretsecretThe actual secret to be passed.
+

Sample +

clusterAuth:
+  secrets:
+  - nodeType: CONTROLLER
+    secret: ControllerSecretValue
+  - nodeType: EXECUTOR
+    secret: ExecutorSecret
+

+
+

Note

+

This section is same across the cluster including both controller and executor.

+
+

Drove Application Logger Configuration

+

Drove will segregate application and task instance logs in a directory of your choice. The path for such files is set as: +- <application id>/<instance id> for Application Instances +- <sourceAppName>/<task id> for Task Instances

+

The Drove Log Appender is based of LogBack's Sifting Appender.

+

The following configuration options are supported:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
PathlogPathDirectory to host the logs
Archive old logsarchiveWhether to enable log rotation
Archived File SuffixarchivedLogFileSuffixSuffix for archived log files.
Archived File CountarchivedFileCountCount of archived log files. Older files are deleted.
File SizemaxFileSizeSize of current log file after which it is archived and a new file is created. Unit: DataSize.
Total SizetotalSizeCaptotal size after which deletion takes place. Unit: DataSize.
Buffer SizebufferSizeBuffer size for the logger. (Set to 8KB by default). Used if immediateFlush is turned off.
Immediate FlushimmediateFlushFlush logs immediately. Set to true by default (recommended)
+

Sample +

logging:
+  level: INFO
+  ...
+
+  appenders:
+    # Setup appenders for the executor process itself first
+    ...
+
+    - type: drove
+      logPath: "/logs/applogs/"
+      archivedLogFileSuffix: "%d"
+      archivedFileCount: 3
+      threshold: TRACE
+      timeZone: ${DROVE_TIMEZONE}
+      logFormat: "%(%-5level) | %-23date | %-30logger{0} | %message%n"
+      archive: true
+

+

Resource Configuration

+

This section can be used to configure how resources are exposed from an executor to the cluster. We have discussed a few of the considerations that will drive the configuration that is being setup.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
OS CoresosCoresA list of cores reserved for use by operating system processes. See the relevant section for details on the pre-steps needed to achieve this.
Exposed MemoryexposedMemPercentageWhat percentage of the system memory can be used by the containers running on the host collectively. Range: 50-100 integer
NUMA PinningdisableNUMAPinningDisable NUMA and CPU core pinning for containers. Pinning is on by default. (default: false)
Nvidia GPUenableNvidiaGpuEnable GPU support on containers. This setting makes all available Nvidia GPUs on the current executor machine available for any container running on this executor. GPU resources are not discovered on the executor, managed and rationed between containers. Needs to be used in conjunction with tagging (see tags below) to ensure only the applications which require a GPU end up on the executor with GPUs.
TagstagsA set of strings that can be used in TAG placement policy to route application and task instances to this executor.
Over ProvisioningoverProvisioningSetup over provisioning configuration.
+
+

Tagging

+

The current hostname is always added as a tag by default and is handled specially to allow for non-tagged deployments to be routed to this executor. If any tag is specified in the tags config, this node will receive containers only when MATCH_TAG placement is used. Please check relevant sections to specify correct placement policies for applications and tasks.

+
+

Sample +

resources:
+  osCores: [0,1,2,3]
+  exposedMemPercentage: 90
+

+

Over provisioning configuration

+

Drove strives to ensure that containers can run unencumbered on CPU cores allocated to them. This means that the minimum allocation unit possible is 1 for cores. It does not support fractional CPU.

+

However, there are situations where we would want some non-critical applications to run the cluster but not waste CPU. The overProvisioning configuration aims to provide user a way to turn off NUMA pinning on the executor and run more containers than it normally would.

+

To ensure predictability, we do not want pinned and non-pinned containers running on the same host. Hence, an executor host can either be running in pinned mode or in non-pinned mode.

+

To enable more containers than we could usually deploy and to still retain some level of control on how small you want a container to go, we specify multipliers on CPU and memory.

+

Example: +- Let's say your executor server has 40 cores available. If you set cpuMultiplier as 4, this node will now show up as having 160 cores to the controller. +- Let's say your server had 512GB of memory, setting memoryMultiplier to 2 will make drove see it as 1TB.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
EnabledenabledSet this to true to enable over provisioning. Default: false
CPU MultipliercpuMultiplierMultiplier to be applied to enable cpu over provisioning. Default: 1. Range: 1-20
Memory MultipliermemoryMultiplierMultiplier to be applied to enable memory over provisioning. Default: 1. Range: 1-20
+

Sample +

resources:
+  exposedMemPercentage: 90
+  overProvisioning:
+    enabled: true
+    memoryMultiplier: 1
+    cpuMultiplier: 3
+

+
+

Tip

+

This feature was developed to allow us to run our development environments cheaper. In such environments there is not much pressure on CPU or memory, but a large number of containers run as developers can spin up containers for features they are working on. There was no point is wasting a full core on containers that get hit twice a minute or less. On production we tend to err on the side of caution and allocate at least one core even to the most trivial applications as of the time of writing this.

+
+

Executor Options

+

The following options can be set to influence the behavior for the Drove executors.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
HostnamehostnameOverride the hostname that gets exposed to the controller. Make sure this is resolvable.
Cache ImagescacheImagesCache container images. If this is not passed, a container image is removed when a container dies and no other instance is using the image.
Command TimeoutcontainerCommandTimeoutTimeout used by the container engine client when issuing container commands to docker or podman
Container Socket PathdockerSocketPathThe path of socket for docker socket. Comes in handy to configure path for socket when using podman etc.
Max Open FilesmaxOpenFilesOverride the maximum number of file descriptors a container can open. Default: 470,000
Log Buffer SizelogBufferSizeThe size of the buffer the executor uses to read logs from container. Unit DataSize. Range: 1-128MB. Default: 10MB
Cache File SizecacheFileSizeTo limit disk usage, configure fixed size log file cache for containers. Unit: DataSize. Range: 10MB-100GB. Default: 20MB. Compression is always enabled.
Cache File CountcacheFileSizeTo limit disk usage, configure fixed count of log file cache for containers. Unit: integer. Max: 1024. Default: 3
+

Sample +

options:
+  logBufferSize: 20m
+  cacheFileSize: 30m
+  cacheFileCount: 3
+  cacheImages: true
+

+

Relevant directories

+

Location for data and logs are as follows:

+
    +
  • /etc/drove/executor/ - Configuration files
  • +
  • /var/log/drove/executor/ - Executor Logs
  • +
  • /var/log/drove/executor/instance-logs - Application/Task Instance Logs
  • +
+

We shall be volume mounting the config and log directories with the same name.

+
+

Prerequisite Setup

+

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

+
+

Setup the config file

+

Create a relevant configuration file in /etc/drove/controller/executor.yml.

+

Sample +

server:
+  applicationConnectors:
+    - type: http
+      port: 11000
+  adminConnectors:
+    - type: http
+      port: 11001
+  requestLog:
+    appenders:
+      - type: file
+        timeZone: IST
+        currentLogFilename: /var/log/drove/executor/drove-executor-access.log
+        archivedLogFilenamePattern: /var/log/drove/executor/drove-executor-access.log-%d-%i
+        archivedFileCount: 3
+        maxFileSize: 100MiB
+
+logging:
+  level: INFO
+  loggers:
+    com.phonepe.drove: INFO
+
+
+  appenders:
+    - type: file
+      threshold: ALL
+      timeZone: IST
+      currentLogFilename: /var/log/drove/executor/drove-executor.log
+      archivedLogFilenamePattern: /var/log/drove/executor/drove-executor.log-%d-%i
+      archivedFileCount: 3
+      maxFileSize: 100MiB
+      logFormat: "%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n"
+    - type: drove
+      logPath: "/var/log/drove/executor/instance-logs"
+      archivedLogFileSuffix: "%d-%i"
+      archivedFileCount: 0
+      maxFileSize: 1GiB
+      threshold: INFO
+      timeZone: IST
+      logFormat: "%(%-5level) | %-23date | %-30logger{0} | %message%n"
+      archive: true
+
+zookeeper:
+  connectionString: "192.168.56.10:2181"
+
+clusterAuth:
+  secrets:
+  - nodeType: CONTROLLER
+    secret: "0v8XvJrDc7r86ZY1QCByPTDPninI4Xii"
+  - nodeType: EXECUTOR
+    secret: "pOd9sIEXhv0wrGOVc7ebwNvR7twZqyTN"
+
+resources:
+  osCores: []
+  exposedMemPercentage: 90
+  disableNUMAPinning: true
+  overProvisioning:
+    enabled: true
+    memoryMultiplier: 10
+    cpuMultiplier: 10
+
+options:
+  cacheImages: true
+  logBufferSize: 20m
+  cacheFileSize: 30m
+  cacheFileCount: 3
+  cacheImages: true
+

+

Setup required environment variables

+

Environment variables need to run the drove controller are setup in /etc/drove/executor/executor.env.

+
CONFIG_FILE_PATH=/etc/drove/executor/executor.yml
+JAVA_PROCESS_MIN_HEAP=1g
+JAVA_PROCESS_MAX_HEAP=1g
+ZK_CONNECTION_STRING="192.168.56.10:2181"
+JAVA_OPTS="-Xlog:gc:/var/log/drove/executor/gc.log -Xlog:gc:::filecount=3,filesize=10M -Xlog:gc::time,level,tags -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Dfile.encoding=utf-8 -Djute.maxbuffer=0x9fffff"
+
+

Create systemd file

+

Create a systemd file. Put the following in /etc/systemd/system/drove.executor.service:

+

[Unit]
+Description=Drove Executor Service
+After=docker.service
+Requires=docker.service
+
+[Service]
+User=drove
+TimeoutStartSec=0
+Restart=always
+ExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-executor:latest
+ExecStart=/usr/bin/docker run  \
+    --env-file /etc/drove/executor/executor.env \
+    --volume /etc/drove/executor:/etc/drove/executor:ro \
+    --volume /var/log/drove/executor:/var/log/drove/executor \
+    --volume /var/run/docker.sock:/var/run/docker.sock \
+    --publish 11000:11000  \
+    --publish 11001:11001 \
+    --hostname %H \
+    --rm \
+    --name drove.executor \
+    ghcr.io/phonepe/drove-executor:latest
+
+[Install]
+WantedBy=multi-user.target
+
+Verify the file with the following command: +
systemd-analyze verify drove.executor.service
+

+

Set permissions +

chmod 664 /etc/systemd/system/drove.executor.service
+

+

Start the service on all servers

+

Use the following to start the service:

+
systemctl daemon-reload
+systemctl enable drove.executor
+systemctl start drove.executor
+
+

You can tail the logs at /var/logs/drove/executor/drove-executor.log.

+

The executor should now show up on the Drove Console.

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/gateway.html b/cluster/setup/gateway.html new file mode 100644 index 0000000..8d415b1 --- /dev/null +++ b/cluster/setup/gateway.html @@ -0,0 +1,1920 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Setting up Drove Gateway - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Setting up Drove Gateway

+

The Drove Gateway works as a gateway to expose apps running on a drove cluster to rest of the world.

+

Drove Gateway container uses NGinx and a modified version of Nixy to track drove endpoints. More details about this can be found in the grove-gateway project.

+

Drove Gateway Nixy Configuration Reference

+

The nixy running inside the gateway container is configured using a custom TOML file. This section looks into this file:

+
address = "127.0.0.1"# (1)!
+port = "6000"
+
+
+# Drove Options
+drove = [#(2)!
+  "http://controller1.mydomain:10000",
+   "http://controller1.mydomain:10000"
+   ]
+
+leader_vhost = "drove-staging.mydomain"#(3)!
+event_refresh_interval_sec = 5#(5)!
+user = ""#(6)!
+pass = ""
+access_token = ""#(7)!
+
+# Parameters to control which apps are exposed as VHost
+routing_tag = "externally_exposed"#(4)!
+realm = "api.mydomain,support.mydomain"#(8)!
+realm_suffix = "-external.mydomain"#(9)!
+
+# Nginx related config
+
+nginx_config = "/etc/nginx/nginx.conf"#(10)!
+nginx_template = "/etc/drove/gateway/nginx.tmpl"#(11)!
+nginx_cmd = "nginx"#(12)!
+nginx_ignore_check = true#(13)!
+
+# NGinx plus specific options
+nginxplusapiaddr="127.0.0.1"#(14)!
+nginx_reload_disabled=true#(15)!
+maxfailsupstream = 0#(16)!
+failtimeoutupstream = "1s"
+slowstartupstream = "0s"
+
+
    +
  1. +

    Nixy listener configuration. Endpoint for nixy itself.

    +
  2. +
  3. +

    List of Drove controllers. Add all controller nodes here. Nixy will automatically determine and track the current leader.

    +
    +

    Auto detection is disabled if a single endpoint is specified.

    +
    +
  4. +
  5. +

    Helps create a vhost entry that tracks the leader on the cluster. Use this to expose the Drove endpoint to users. The value for this will be available to the template engine as the LeaderVHost variable.

    +
  6. +
  7. +

    If some special routing behaviour needs to be implemented in the template based on some tag metadata of the deployed apps, set the routing_tag option to set the tag name to be used. The actual value is derived from app instances and exposed to the template engine as the variable: RoutingTag. Optional.

    +
    +

    In this example, the RoutingTag variable will be set to the value specified in the routing_tag tag key specified when deploying the Drove Application. For example, if we want to expose the app we can set it to yes, and filter the VHost to be exposed in NGinx template when RoutingTag == "yes".

    +
    +
  8. +
  9. +

    Drove Gateway/Nixy works on event polling on controller. This is the polling interval. Especially if number of NGinx nodes is high. Default is 2 seconds. Unless cluster is really busy with a high rate of change of containers, this strikes a good balance between apps becoming discoverable vs putting the leader controller under heavy load.

    +
  10. +
  11. +

    user and pass are optional params can be used to set basic auth credentials to the calls made to Drove controllers if basic auth is enabled on the cluster. Leave empty if no basic auth is required.

    +
  12. +
  13. +

    If cluster has some custom header based auth, the following can be used. The contents on this parameter are passed verbatim to the Authorization HTTP header. Leave empty if no token auth is enabled on the cluster.

    +
  14. +
  15. +

    By default drove-gateway will expose all vhost declared in the spec for all drove apps on a cluster (caveat: filtering can be done using RoutingTag as well). If specific vhosts need to be exposed, set the realms parameter to a comma separated list of realms. Optional.

    +
  16. +
  17. +

    Beside perfect vhost matching, Drove Gateway supports suffix based matches as well. A single suffix is supported. Optional.

    +
  18. +
  19. +

    Path to NGinx config.

    +
  20. +
  21. +

    Path to the template file, based on which the template will be generated.

    +
  22. +
  23. +

    NGinx command to use to reload the config. Set this to openresty optionally to use openresty.

    +
  24. +
  25. +

    Ignore calling NGinx command to test the config. Set this to false or delete this line on production. Default: false.

    +
  26. +
  27. +

    If using NGinx plus, set the endpoint to the local server here. If left empty, NGinx plus api based vhost update will be disabled.

    +
  28. +
  29. +

    If specific vhosts are exposed, auto-discovery and updation of config (and NGinx reloads) might not be desired as it will cause connection drops. Set the following parameter to true to disable reloads. Nixy will only update upstreams using the nplus APIs. Default: false.

    +
  30. +
  31. +

    Connection parameters for NGinx plus.

    +
  32. +
+
+

NGinx plus

+

NGinx plus is not shipped with this docker. If you want to use NGinx plus, please build nixy from the source tree here and build your own container.

+
+

Relevant directories

+

Location for data and logs are as follows:

+
    +
  • /etc/drove/gateway/ - Configuration files
  • +
  • /var/log/drove/gateway/ - NGinx Logs
  • +
+

We shall be volume mounting the config and log directories with the same name.

+
+

Prerequisite Setup

+

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

+
+

Go through the following steps to run drove-gateway as a service.

+

Create the TOML config for Nixy

+

Sample config file /etc/drove/gateway/gateway.toml:

+
address = "127.0.0.1"
+port = "6000"
+
+
+# Drove Options
+drove = [
+  "http://controller1.mydomain:10000",
+   "http://controller1.mydomain:10000"
+   ]
+
+leader_vhost = "drove-staging.mydomain"
+event_refresh_interval_sec = 5
+user = "guest"
+pass = "guest"
+
+
+# Nginx related config
+nginx_config = "/etc/nginx/nginx.conf"
+nginx_template = "/etc/drove/gateway/nginx.tmpl"
+nginx_cmd = "nginx"
+nginx_ignore_check = true
+
+
+

Replace domain names

+

Please remember to update mydomain to a valid domain you want to use.

+
+

Create template for NGinx

+

Create a NGinx template with the following config in /etc/drove/gateway/nginx.tmpl

+
# Generated by drove-gateway {{datetime}}
+
+user www-data;
+worker_processes auto;
+pid /run/nginx.pid;
+
+events {
+    use epoll;
+    worker_connections 2048;
+    multi_accept on;
+}
+http {
+    server_names_hash_bucket_size  128;
+    add_header X-Proxy {{ .Xproxy }} always;
+    access_log /var/log/nginx/access.log;
+    error_log /var/log/nginx/error.log warn;
+    server_tokens off;
+    client_max_body_size 128m;
+    proxy_buffer_size 128k;
+    proxy_buffers 4 256k;
+    proxy_busy_buffers_size 256k;
+    proxy_redirect off;
+    map $http_upgrade $connection_upgrade {
+        default upgrade;
+        ''      close;
+    }
+    # time out settings
+    proxy_send_timeout 120;
+    proxy_read_timeout 120;
+    send_timeout 120;
+    keepalive_timeout 10;
+
+    server {
+        listen       7000 default_server;
+        server_name  _;
+        # Everything is a 503
+        location / {
+            return 503;
+        }
+    }
+    {{if and .LeaderVHost .Leader.Endpoint}}
+    upstream {{.LeaderVHost}} {
+        server {{.Leader.Host}}:{{.Leader.Port}};
+    }
+    server {
+        listen 7000;
+        server_name {{.LeaderVHost}};
+        location / {
+            proxy_set_header HOST {{.Leader.Host}};
+            proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
+            proxy_connect_timeout 30;
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection $connection_upgrade;
+            proxy_pass http://{{.LeaderVHost}};
+        }
+    }
+    {{end}}
+    {{- range $id, $app := .Apps}}
+    upstream {{$app.Vhost}} {
+        {{- range $app.Hosts}}
+        server {{ .Host }}:{{ .Port }};
+        {{- end}}
+    }
+    server {
+        listen 7000;
+        server_name {{$app.Vhost}};
+        location / {
+            proxy_set_header HOST $host;
+            proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
+            proxy_connect_timeout 30;
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection $connection_upgrade;
+            proxy_pass http://{{$app.Vhost}};
+        }
+    }
+    {{- end}}
+}
+
+

The above template will do the following:

+
    +
  • Set NGinx port to 7000. This is the port exposed on the Docker container for the gateway. Do not change this.
  • +
  • Sets up error and access logs to /var/log/nginx. Log rotation is setup for this path already.
  • +
  • Set up a Vhost drove-staging.mydomain that will get auto-updated with the current leader of the Drove cluster
  • +
  • Setup automatically updated virtual hosts for all apps on the cluster.
  • +
+

Create environment file

+

We want to configure the drove gateway container using the required environment variables. To do that, put the following in /etc/drove/gateway/gateway.env:

+
CONFIG_FILE_PATH=/etc/drove/gateway/gateway.toml
+TEMPLATE_FILE_PATH=/etc/drove/gateway/nginx.tmpl
+
+

Create systemd file

+

Create a systemd file. Put the following in /etc/systemd/system/drove.gateway.service:

+
[Unit]
+Description=Drove Gateway Service
+After=docker.service
+Requires=docker.service
+
+[Service]
+User=drove
+TimeoutStartSec=0
+Restart=always
+ExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-gateway:latest
+ExecStart=/usr/bin/docker run  \
+    --env-file /etc/drove/gateway/gateway.env \
+    --volume /etc/drove/gateway:/etc/drove/gateway:ro \
+    --volume /var/log/drove/gateway:/var/log/nginx \
+    --network host \
+    --hostname %H \
+    --rm \
+    --name drove.gateway \
+    ghcr.io/phonepe/drove-gateway:latest
+
+[Install]
+WantedBy=multi-user.target
+
+

Verify the file with the following command: +

systemd-analyze verify drove.gateway.service
+

+

Set permissions +

chmod 664 /etc/systemd/system/drove.gateway.service
+

+

Start the service on all servers

+

Use the following to start the service:

+
systemctl daemon-reload
+systemctl enable drove.gateway
+systemctl start drove.gateway
+
+

Checking Logs

+

You can check logs using: +

journalctl -u drove.gateway -f
+

+

NGinx logs would be available at /var/log/drove/gateway.

+

Log rotation for NGinx

+

The gateway sets up log rotation for the access and errors logs with the following config: +

/var/log/nginx/*.log {
+    rotate 5
+    size 10M
+    dateext
+    dateformat -%Y-%m-%d
+    missingok
+    compress
+    delaycompress
+    sharedscripts
+    notifempty
+    postrotate
+        test -r /var/run/nginx.pid && kill -USR1 `cat /var/run/nginx.pid`
+    endscript
+}
+

+
+

This will rotate both error and access logs when they hit 10MB and keep 5 logs.

+
+

Configure the above if you want and volume mount your config to /etc/logrotate.d/nginx to use different scheme as per your requirements.

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/maintenance.html b/cluster/setup/maintenance.html new file mode 100644 index 0000000..09378d1 --- /dev/null +++ b/cluster/setup/maintenance.html @@ -0,0 +1,1728 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Maintaining a Drove Cluster - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Maintaining a Drove Cluster

+

There are a couple of constructs built into Drove to allow for easy maintenance.

+
    +
  • Cluster Maintenance Mode
  • +
  • Executor node blacklisting
  • +
+

Maintenance mode

+

Drove supports a maintenance mode to allow for software updates without affecting the containers running on the cluster.

+
+

Danger

+

In maintenance mode, outage detection is turned off and container failure for applications are not acted upon even if detected.

+
+

Engaging maintenance mode

+

Set cluster to maintenance mode.

+

Preconditions +- Cluster must be in the following state: MAINTENANCE

+
+
+
+
drove -c local cluster maintenance-on
+
+
+
+

Sample Request +

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/set' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+

+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "state": "MAINTENANCE",
+        "updated": 1721630351178
+    },
+    "message": "success"
+}
+

+
+
+
+

Disengaging maintenance mode

+

Set cluster to normal mode.

+

Preconditions +- Cluster must be in the following state: MAINTENANCE

+
+
+
+
drove -c local cluster maintenance-off
+
+
+
+

Sample Request

+
curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/unset' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "state": "NORMAL",
+        "updated": 1721630491296
+    },
+    "message": "success"
+}
+

+
+
+
+

Updating drove version across the cluster quickly

+

We recommend the following sequence of steps:

+
    +
  1. Find the leader controller for the cluster using drove ... cluster leader.
  2. +
  3. +

    Update the controller container on the nodes that are not the leader.

    +
    +

    If you are using the systemd file given here, you just need to restart the controller service using systemctl restart drove.controller

    +
    +
  4. +
  5. +

    Set cluster to maintenance mode using drove ... cluster maintenance-on.

    +
  6. +
  7. +

    Update the leader controller.

    +
    +

    If you are using the systemd file given here, you just need to restart the leader controller service: systemctl restart drove.controller

    +
    +
  8. +
  9. +

    Update the executors.

    +
    +

    If you are using the systemd file given here, you just need to restart all executors: systemctl restart drove.executor

    +
    +
  10. +
  11. +

    Take cluster out of maintenance mode: drove ... cluster maintenance-off

    +
  12. +
+

Executor blacklisting

+

In cases where we want to take an executor node out of the cluster for planned maintenance, we need to ensure application instances running on the node are replaced by containers on other nodes and the ones running here are shut down cleanly.

+

This is achieved by blacklisting the node.

+
+

Tip

+

Whenever blacklisting is done, it causes some flux in the application topology due to new container migration from blacklisted to normal nodes. To reduce the number of times this happens, plan to perform multiple operations togeter and blacklist and un-blacklist executors together.

+

Drove will optimize bulk blacklisting related app migrations and will migrate containers together for an app only once rather than once for every node.

+
+
+

Danger

+

Task instances are not migrated out. This is because it is impossible for Drove to know if a task can be migrated or not (i.e. killed and spun up on a new node in any order).

+
+

To blacklist executors do the following:

+
+
+
+
drove -c local executor blacklist dd2cbe76-9f60-3607-b7c1-bfee91c15623 ex1 ex2 
+
+
+
+

Sample Request

+
curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/blacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d&id=ex1&id=ex2' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "failed": [
+            "ex2",
+            "ex1"
+        ],
+        "successful": [
+            "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d"
+        ]
+    },
+    "message": "success"
+}
+

+
+
+
+

To un-blacklist executors do the following:

+
+
+
+
drove -c local executor unblacklist dd2cbe76-9f60-3607-b7c1-bfee91c15623 ex1 ex2 
+
+
+
+

Sample Request

+
curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/unblacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d&id=ex1&id=ex2' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data ''
+
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "failed": [
+            "ex2",
+            "ex1"
+        ],
+        "successful": [
+            "a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d"
+        ]
+    },
+    "message": "success"
+}
+

+
+
+
+
+

Note

+

Drove will not re-evaluate placement of existing Applications in RUNNING state once executors are brought back into rotation.

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/planning.html b/cluster/setup/planning.html new file mode 100644 index 0000000..ea95d9a --- /dev/null +++ b/cluster/setup/planning.html @@ -0,0 +1,1620 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Planning your cluster - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Planning your cluster

+

Running a drove cluster in production for critical workloads involves planning and preparation on factors like Availability, Scale, Security and Access management. +The following issues should be considered while planning your drove cluster.

+

Criteria for planning

+

The simplest form of a drove cluster would run controller, zookeeper, executor and gateway services all on the same machine while a highly available would separate out all components according to following considerations:

+
    +
  • Availability: Use of a single node in either executor, controller, gateway or zookeeper has a single point of failure. A typical highly available cluster consists of:
      +
    • Separated controller and executor nodes
    • +
    • Multi controller setup - typically minimum 3 controller nodes to provide adequate availability and majority for leader election
    • +
    • Multiple gateway nodes to load balance traffic and provide fault tolerance
    • +
    • Availability of sufficient executor nodes to follow placement policies suitable for high availability for application instances across executors
    • +
    • Multiple node zookeeper cluster to provide adequate availability and quorum. Even number of nodes is not useful. Atleast three zookeeper servers are required.
    • +
    +
  • +
  • Scale: You should size your components as per expected amount of traffic to your instances but also have a plan for expected demand growth and ways to scale your cluster accordingly
  • +
  • Security and Access management: You can use authentication for intra-cluster communication and add encryption for secure communications. Access management can be devolved by creating multiple users with either read or write roles.
  • +
+

Cluster configuration

+

Controllers

+

Controllers will manage the cluster with application instances spread across multiple executors as per different placement policies. Controllers use leader-election to coordinate and will act as a single entity while each executor acts as a single entity that runs many different application instances.

+
    +
  • Multiple controllers: For high availability, there should be atleast three controllers. Drove uses leader election to coordinate across the controllers. If the leader fails, the other controllers would elect a leader and takeover.
  • +
  • Separate Zookeeper service and snapshots: The zookeeper service used for state coordination by controller can either be run on same machines as controller or on separate machines. Zookeeper stores the configuration data for the whole cluster and zookeeper snapshots can be used to back up the data.
  • +
  • Availability zones: You can split the controllers and/or zookeeper nodes across availability zones/data centers to improve resilience.
  • +
  • Transit encryption and certificate Management: Controllers and executors can communicate via secure communications. TLS settings and certificates can be added by modifying applicationConnectors and adminConnectors parameters as per Dropwizard HTTPS connector configuration
  • +
  • Separate Gateways: The drove gateways will route and load balance application traffic across application instances. The gateway service can either be run on same machines as controller or on separate machines.
  • +
  • Resources: The required JVM maximum heap size for drove controller service will increase with the increase in number of executors,applications and application instances in the cluster. The JVM parameters should be reviewed as per your scale.
  • +
+

Zookeeper

+
    +
  • Quorum: For replicated zookeeper, a minimum of three servers are required, and it is recommended that you have an odd number of servers. If you only have two servers, if one of them fails, there are not enough machines to form a majority quorum. Two servers are inherently less stable than a single server, because there are two single points of failure. Refer Running replicated ZooKeeper in ZooKeeper Getting Started Guide
  • +
  • zxid rollover: zxid is the ZooKeeper transaction id and is 64 bit number. The zxid has two parts epoch and a counter which use the high order 32-bits for the epoch and the low order 32-bits for the counter. A zookeeper election is forced when the 32-bit counter rolls over. This will be more frequent as scale increases in your cluster.
  • +
  • JVM parameters and resources: The required JVM maximum heap size for zookeeper will increase with the increase in number of executors,applications and application instances in the cluster. The JVM parameters should be reviewed as per your scale. Refer ZooKeeper Administrator's Guide
  • +
+

Executors

+
    +
  • Containerisation Engine: Drove supports the Docker engine and has experimental support for Podman. Choose your engine as per your security, networking and performance considerations.
  • +
  • Container Networking: The container engine and container networking should be configured as per your requirements. It is recommended to use Port forwarding based container networking if you choose to use Drove Gateway to route application traffic. Container engine settings can be modified to manage DNS and proxy parameters for containers.
  • +
  • Placement policies and availability: Drove supports placement policies to set criteria for replication of instances across executors and avoiding single points of failure. Drove tags can be assigned to executors and placement policy can be used to pin certain applications to specific selected executors if you have any hardware or other considerations.
  • +
  • Scaling: As your cluster scale increases, you can continue adding executors to the cluster. Placement policies should be used to manage availability criteria. Controller and ZooKeeper resource requirements will increase as your executor count increases and should be reviewed accordingly.
  • +
+

Gateways

+
    +
  • Container Networking: It is recommended to use Port forwarding based container networking if you choose to use Drove Gateways to route application traffic
  • +
  • Load balancing: Gateways use Nginx as a web server and can use many different approaches to load balancing among multiple gateway nodes. Some examples include:
      +
    • DNS Load balancing: Multiple gateway IP's can be added as A records to the virtual host domain to let clients use round-robin DNS and split load across gateway nodes
    • +
    • Anycast/Network Load balancing: If any sort of anycast/network load balancing functionality is available in your network, it can be used to split traffic across gateway nodes
    • +
    +
  • +
  • High Availability and Scaling: Many different methods are available to achieve high availability and scale NGINX. Any method can be used by adequately modifying the template used by Gateway to render Nginx configuration.
  • +
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/prerequisites.html b/cluster/setup/prerequisites.html new file mode 100644 index 0000000..fde02b7 --- /dev/null +++ b/cluster/setup/prerequisites.html @@ -0,0 +1,1523 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Setting up the prerequisites - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Setting up the prerequisites

+

On all machines on the drove cluster, we would want to use the same user and have a consistent storage structure for configuration, logs etc.

+
+

Note

+

All commands o be issues as root. To get to admin/root mode issue the following command:

+
sudo su
+
+
+

Setting up user

+

We shall create an user called drove to be used to run all services and containers and assign the file ownership to this user.

+

adduser --system --group "drove" --home /var/lib/misc --no-create-home > /dev/null
+
+We want to user to be able to run docker containers, so we add the user to the docker group:

+
groupadd docker
+usermod -aG docker drove
+
+

Create directories

+

We shall use the following locations to store configurations, logs etc:

+
    +
  • /etc/drove/... - for configuration
  • +
  • /var/log/drove/.. - for all logs
  • +
+

We go ahead and create these locations and setup the correct permissions:

+
mkdir -p /etc/drove
+chown -R drove.drove /etc/drove
+chmod 700 /etc/drove
+chmod g+s /etc/drove
+
+mkdir -p /var/lib/drove
+chown -R drove.drove /var/lib/drove
+chmod 700 /var/lib/drove
+
+mkdir -p /var/log/drove
+
+
+

Danger

+

Ensure you run the chmod commands to remove read access everyone other than the owner.

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/units.html b/cluster/setup/units.html new file mode 100644 index 0000000..bd75fc7 --- /dev/null +++ b/cluster/setup/units.html @@ -0,0 +1,1556 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Units Reference - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Units Reference

+

In the configuration files for Drove, we use the Duration and DataSize units to make configuration easier.

+

Data Size

+

Use the following shortcuts to express sizes in human readable form such as 2GB etc:

+
    +
  • B - Bytes
  • +
  • byte - Bytes
  • +
  • Bytes - Bytes
  • +
  • K - Kilobytes
  • +
  • KB - Kilobytes
  • +
  • KiB - Kibibytes
  • +
  • Kilobyte - Kilobytes
  • +
  • kibibyte - Kibibytes
  • +
  • KiloBytes - Kilobytes
  • +
  • kibiBytes - Kibibytes
  • +
  • M - Megabytes
  • +
  • MB - Megabytes
  • +
  • MiB - Mebibytes
  • +
  • megabyte - Megabytes
  • +
  • mebibyte - Mebibytes
  • +
  • megaBytes - Megabytes
  • +
  • mebiBytes - Mebibytes
  • +
  • G - Gigabytes
  • +
  • GB - Gigabytes
  • +
  • GiB - Gibibytes
  • +
  • gigabyte - Gigabytes
  • +
  • gibibyte - Gibibytes
  • +
  • gigaBytes - Gigabytes
  • +
  • gibiBytes - Gibibytes
  • +
  • T - Terabytes
  • +
  • TB - Terabytes
  • +
  • TiB - Tebibytes
  • +
  • terabyte - Terabytes
  • +
  • tebibyte - Tebibytes
  • +
  • teraBytes - Terabytes
  • +
  • tebiBytes - Tebibytes
  • +
  • P - Petabytes
  • +
  • PB - Petabytes
  • +
  • PiB - Pebibytes
  • +
  • petabyte - Petabytes
  • +
  • pebibyte - Pebibytes
  • +
  • petaBytes - Petabytes
  • +
  • pebiBytes - Pebibytes
  • +
+

Duration

+

Time durations in Drove can be expressed in human readable form, for example: 3d can be used to signify 3 days and so on. The list of valid duration unit suffixes are:

+
    +
  • ns - nanoseconds
  • +
  • nanosecond - nanoseconds
  • +
  • nanoseconds - nanoseconds
  • +
  • us - microseconds
  • +
  • microsecond - microseconds
  • +
  • microseconds - microseconds
  • +
  • ms - milliseconds
  • +
  • millisecond - milliseconds
  • +
  • milliseconds - milliseconds
  • +
  • s - seconds
  • +
  • second - seconds
  • +
  • seconds - seconds
  • +
  • m - minutes
  • +
  • min - minutes
  • +
  • mins - minutes
  • +
  • minute - minutes
  • +
  • minutes - minutes
  • +
  • h - hours
  • +
  • hour - hours
  • +
  • hours - hours
  • +
  • d - days
  • +
  • day - days
  • +
  • days - days
  • +
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/cluster/setup/zookeeper.html b/cluster/setup/zookeeper.html new file mode 100644 index 0000000..a43fbd6 --- /dev/null +++ b/cluster/setup/zookeeper.html @@ -0,0 +1,1913 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Setting Up Zookeeper - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Setting Up Zookeeper

+

We shall be running Zookeeper using the official Docker images. All data volumes etc will be mounted on the host machines.

+

The following ports will be exposed:

+
    +
  • 2181 - This is the main port for ZK clients to connect to the server
  • +
  • 2888 - The port used by Zookeeper for in-cluster communications between peers
  • +
  • 3888 - Port used for internal leader election
  • +
  • 8080 - Admin server port. We are going to turn this off.
  • +
+
+

Danger

+

The ZK admin server does not shut down cleanly from time to time. And is not needed for anything related to Drove. If not needed, you should turn it off.

+
+

We assume the following to be the IP for the 3 zookeeper nodes:

+
    +
  • 192.168.3.10
  • +
  • 192.168.3.11
  • +
  • 192.168.3.12
  • +
+

Relevant directories

+

Location for data and logs are as follows:

+
    +
  • /etc/drove/zk - Configuration files
  • +
  • /var/lib/drove/zk/ - Data and data logs
  • +
  • /var/log/drove/zk - Logs
  • +
+

Important files

+

The zookeeper container stores snapshots, transaction logs and application logs on /data, /datalog and /logs directories respectively. We shall be volume mounting the following:

+
    +
  • /var/lib/drove/zk/data to /data on the container
  • +
  • /var/lib/drove/zk/datalog to /datalog on the container
  • +
  • /var/logs/drove/zk to /logs on the container
  • +
+

Docker will create these directories when container comes up for the first time.

+
+

Tip

+

The zk server id (as set above using the ZOO_MY_ID) can also be set by putting the server number in a file named myid in the /data directory.

+
+
+

Prerequisite Setup

+

If not done already, lease complete the prerequisite setup on all machines earmarked for the cluster.

+
+

Setup configuration files

+

Let's create the config directory:

+
mkdir -p /etc/drove/zk
+
+

We shall be creating 3 different configuration files to configure zookeeper:

+
    +
  • zk.env - Environment variables to be used by zookeeper container
  • +
  • java.env - Setup JVM related options
  • +
  • logbaxk.xml - Logging configuration
  • +
+

Setup environment variables

+

Let us prepare the configuration. Put the following in a file: /etc/drove/zk/zk.env:

+
#(1)!
+ZOO_TICK_TIME=2000
+ZOO_INIT_LIMIT=10
+ZOO_SYNC_LIMIT=5
+ZOO_STANDALONE_ENABLED=false
+ZOO_ADMINSERVER_ENABLED=false
+
+#(2)!
+ZOO_AUTOPURGE_PURGEINTERVAL=12
+ZOO_AUTOPURGE_SNAPRETAINCOUNT=5
+
+#(3)!
+ZOO_MY_ID=1
+ZOO_SERVERS=server.1=192.168.3.10:2888:3888;2181 server.2=192.168.3.11:2888:3888;2181 server.3=192.168.3.12:2888:3888;2181
+
+
    +
  1. This is cluster level configuration to ensure the cluster topology remains stable through minor flaps
  2. +
  3. This will control how much data we retain
  4. +
  5. This section needs to change per server. Each server should have a different ZOO_MY_ID set. And the same numbers get referred to in ZOO_SERVERS section.
  6. +
+
+

Warning

+
    +
  • +

    The ZOO_MY_ID value needs to be different on every server.So it would be:

    +
      +
    • 192.168.3.10 - 1
    • +
    • 192.168.3.11 - 2
    • +
    • 192.168.3.12 - 3
    • +
    +
  • +
  • +

    The format for ZOO_SERVERS is server.id=<address1>:<port1>:<port2>[:role];[<client port address>:]<client port>.

    +
  • +
+
+
+

Info

+

Exhaustive set of options can be found on the Official Docker Page.

+
+

Setup JVM parameters

+

Put the following in /etc/drove/zk/java.env

+
export SERVER_JVMFLAGS='-Djute.maxbuffer=0x9fffff -Xmx4g -Xms4g -Dfile.encoding=utf-8 -XX:+UseG1GC -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError'
+
+
+

Configuring Max Data Size

+

Drove data per node can get a bit on the larger side from time to time depending on your application configuration. To be on the safe side, we need to increase the maximum data size per node. This is achieved by setting the JVM option -Djute.maxbuffer=0x9fffff on all cluster nodes in Drove. This is 10MB (approx). The actual payload doesn't reach anywhere close. However we shall be picking up payload compression in a future version to stop this variable from needing to be set.

+

For the Zookeeper Docker, the environment variable SERVER_JVMFLAGS needs to be set to -Djute.maxbuffer=0x9fffff.

+

Please refer to Zookeeper Advanced Configuration for further properties that can be tuned.

+
+
+

JVM Size

+

We set 4GB JVM heap size for ZK by adding appropriate options in SERVER_JVMFLAGS. Please make sure you have sized your machines to have 10-16GB of RAM at the very least. Tune the JVM size and machine size according to your needs.

+
+

q

+
+

JVMFLAGS environment variable

+

Do not set this variable in zk.env. Couple of reasons:

+
    +
  • This will affect both the zk server as well as the client.
  • +
  • There is an issue and the flag (nor the SERVER_JVMFLAGS) are not used properly by the startup scripts.
  • +
+
+

Configure logging

+

We want to have physical log files on disk for debugging and audits and want the container to be ephemeral to allow for easy updates etc. To achieve this, put the following in /etc/drove/zk/logback.xml:

+
<!--
+ Copyright 2022 The Apache Software Foundation
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+ Define some default values that can be overridden by system properties
+-->
+<configuration>
+  <!-- Uncomment this if you would like to expose Logback JMX beans -->
+  <!--jmxConfigurator /-->
+
+  <property name="zookeeper.console.threshold" value="INFO" />
+
+  <property name="zookeeper.log.dir" value="/logs" />
+  <property name="zookeeper.log.file" value="zookeeper.log" />
+  <property name="zookeeper.log.threshold" value="INFO" />
+  <property name="zookeeper.log.maxfilesize" value="256MB" />
+  <property name="zookeeper.log.maxbackupindex" value="20" />
+
+  <!--
+    console
+    Add "console" to root logger if you want to use this
+  -->
+  <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
+    <encoder>
+      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>
+    </encoder>
+    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
+      <level>${zookeeper.console.threshold}</level>
+    </filter>
+  </appender>
+
+  <!--
+    Add ROLLINGFILE to root logger to get log file output
+  -->
+  <appender name="ROLLINGFILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
+    <File>${zookeeper.log.dir}/${zookeeper.log.file}</File>
+    <encoder>
+      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>
+    </encoder>
+    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
+      <level>${zookeeper.log.threshold}</level>
+    </filter>
+    <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
+      <maxIndex>${zookeeper.log.maxbackupindex}</maxIndex>
+      <FileNamePattern>${zookeeper.log.dir}/${zookeeper.log.file}.%i</FileNamePattern>
+    </rollingPolicy>
+    <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
+      <MaxFileSize>${zookeeper.log.maxfilesize}</MaxFileSize>
+    </triggeringPolicy>
+  </appender>
+
+  <!--
+    Add TRACEFILE to root logger to get log file output
+    Log TRACE level and above messages to a log file
+  -->
+  <!--property name="zookeeper.tracelog.dir" value="${zookeeper.log.dir}" />
+  <property name="zookeeper.tracelog.file" value="zookeeper_trace.log" />
+  <appender name="TRACEFILE" class="ch.qos.logback.core.FileAppender">
+    <File>${zookeeper.tracelog.dir}/${zookeeper.tracelog.file}</File>
+    <encoder>
+      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>
+    </encoder>
+    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
+      <level>TRACE</level>
+    </filter>
+  </appender-->
+
+  <!--
+    zk audit logging
+  -->
+  <property name="zookeeper.auditlog.file" value="zookeeper_audit.log" />
+  <property name="zookeeper.auditlog.threshold" value="INFO" />
+  <property name="audit.logger" value="INFO, RFAAUDIT" />
+
+  <appender name="RFAAUDIT" class="ch.qos.logback.core.rolling.RollingFileAppender">
+    <File>${zookeeper.log.dir}/${zookeeper.auditlog.file}</File>
+    <encoder>
+      <pattern>%d{ISO8601} %p %c{2}: %m%n</pattern>
+    </encoder>
+    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
+      <level>${zookeeper.auditlog.threshold}</level>
+    </filter>
+    <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
+      <maxIndex>10</maxIndex>
+      <FileNamePattern>${zookeeper.log.dir}/${zookeeper.auditlog.file}.%i</FileNamePattern>
+    </rollingPolicy>
+    <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
+      <MaxFileSize>10MB</MaxFileSize>
+    </triggeringPolicy>
+  </appender>
+
+  <logger name="org.apache.zookeeper.audit.Slf4jAuditLogger" additivity="false" level="${audit.logger}">
+    <appender-ref ref="RFAAUDIT" />
+  </logger>
+
+  <root level="INFO">
+    <appender-ref ref="CONSOLE" />
+    <appender-ref ref="ROLLINGFILE" />
+  </root>
+</configuration>
+
+
+

Tip

+

This is a customization of the original file from Zookeeper source tree. Please refer to documentation to configure logging.

+
+

Create Systemd File

+

Create a systemd file. Put the following in /etc/systemd/system/drove.zookeeper.service:

+
[Unit]
+Description=Drove Zookeeper Service
+After=docker.service
+Requires=docker.service
+
+[Service]
+User=drove
+TimeoutStartSec=0
+Restart=always
+ExecStartPre=-/usr/bin/docker pull zookeeper:3.8
+ExecStart=/usr/bin/docker run \
+    --env-file /etc/drove/zk/zk.env \
+    --volume /var/lib/drove/zk/data:/data \
+    --volume /var/lib/drove/zk/datalog:/datalog \
+    --volume /var/log/drove/zk:/logs \
+    --volume /etc/drove/zk/logback.xml:/conf/logback.xml \
+    --volume /etc/drove/zk/java.env:/conf/java.env \
+    --publish 2181:2181 \
+    --publish 2888:2888 \
+    --publish 3888:3888 \
+    --rm \
+    --name drove.zookeeper \
+    zookeeper:3.8
+
+[Install]
+WantedBy=multi-user.target
+
+

Verify the file with the following command: +

systemd-analyze verify drove.zookeeper.service
+

+

Set permissions +

chmod 664 /etc/systemd/system/drove.zookeeper.service
+

+

Start the service on all servers

+

Use the following to start the service:

+
systemctl daemon-reload
+systemctl enable drove.zookeeper
+systemctl start drove.zookeeper
+
+

You can check server status using the following:

+
echo srvr | nc localhost 2181
+
+
+

Tip

+

Replace localhost on the above command with the actual ZK server IPs to test remote connectivity.

+
+
+

Note

+

You can access the ZK client from the container using the following command:

+
docker exec -it drove.zookeeper bin/zkCli.sh
+
+

To connect to remote host you can use the following: +

docker exec -it drove.zookeeper bin/zkCli.sh -server <server name or ip>:2181
+

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/extra/cli.html b/extra/cli.html new file mode 100644 index 0000000..15c93a5 --- /dev/null +++ b/extra/cli.html @@ -0,0 +1,1414 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Drove CLI - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Drove CLI

+

Details for the Drove CLI, including installation and usage can be found in the cli repo.

+

Repo link: https://github.com/PhonePe/drove-cli.

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/extra/epoch.html b/extra/epoch.html new file mode 100644 index 0000000..a6f8d93 --- /dev/null +++ b/extra/epoch.html @@ -0,0 +1,1474 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Epoch - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Epoch

+

Epoch is a cron type scheduler that spins up container jobs on Drove.

+

Details for using epoch can be found in the epoch repo.

+

Link for Epoch repo: https://github.com/PhonePe/epoch.

+

Epoch CLI

+

There is a cli client for interaction with epoch. Details for installation and usage can be found in the epoch CLI repo.

+

Link for Epoch CLI repo: https://github.com/phonepe/epoch-cli.

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/extra/libraries.html b/extra/libraries.html new file mode 100644 index 0000000..68b9793 --- /dev/null +++ b/extra/libraries.html @@ -0,0 +1,1898 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Libraries - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Libraries

+

Drove is written in Java. We provide a few libraries that can be used to integrate with a Drove cluster.

+

Setup

+

Setup the drove version +

<properties>
+    <!--other properties-->
+    <drove.version>1.29</drove.version>
+</properties>
+

+
+

Checking the latest version

+

Latest version can be checked at the github packages page here

+
+

All libraries are located in sub packages of the top level package com.phonepe.drove.

+
+

Java Version Compatibility

+

Using Drove libraries would need Java versions 17+.

+
+

Drove Model

+

The model library for the classes used in request and response. It has dependency on jackson and dropwizard-validation.

+

Dependency

+
<dependency>
+    <groupId>com.phonepe.drove</groupId>
+    <artifactId>drove-models</artifactId>
+    <version>${drove.version}</version>
+</dependency>
+
+

Drove Client

+

We provide a client library that can be used to connect to a Drove cluster. The cluster accepts controller endpoints as parameter (among other things) and automatically tracks the leader controller. If a single controller endpoint is provided, this functionality is turned off.

+

Please note that the client does not provide specific functions corresponding to different api calls from the controller, it acts as a simple endpoint discovery mechanism for drove cluster. Please refer to API section for details on individual apis.

+

Transport

+

The transport layer in the client is used to actually make HTTP calls to the Drove server. A new transport can be used by implementing the get(), post(), put() and delete() methods in the DroveHttpTransport interface.

+

By default Drove client uses Java internal HTTP client as a trivial transport implementation. We also provide an Apache Http Components based implementation.

+
+

Tip

+

Do not use the default transport in production. Please use the HTTP Components based transport or your custom ones.

+
+

Dependencies

+
 <dependency>
+    <groupId>com.phonepe.drove</groupId>
+    <artifactId>drove-client</artifactId>
+    <version>${drove.version}</version>
+</dependency>
+<dependency>
+    <groupId>com.phonepe.drove</groupId>
+    <artifactId>drove-client-httpcomponent-transport</artifactId>
+    <version>${drove.version}</version>
+</dependency>
+
+

Sample code

+
public class DroveCluster implements AutoCloseable {
+
+    @Getter
+    private final DroveClient droveClient;
+
+    public DroveCluster() {
+        final var config = new DroveConfig()
+            .setEndpoints(List.of("http://controller1:4000,http://controller2:4000"));
+
+        this.droveClient = new DroveClient(config,
+                                      List.of(new BasicAuthDecorator("guest", "guest")),
+                                           new DroveHttpComponentsTransport(config.getCluster()));
+    }
+
+    @Override
+    public void close() throws Exception {
+        this.droveClient.close();
+    }
+}
+
+
+

RequestDecorator

+

This interface can be implemented to augment requests with special headers like for example Authorization, as well as for other stuff like adding content type etc etc.

+
+

Drove Event Listener

+

This library provides callbacks that can be used to listen and react to events happening on the Drove cluster.

+

Dependencies

+
<!--Include Drove client-->
+<dependency>
+    <groupId>com.phonepe.drove</groupId>
+    <artifactId>drove-events-client</artifactId>
+    <version>${drove.version}</version>
+</dependency>
+
+

Sample Code

+
final var droveClient = ... //build your java transport, client here
+
+//Create and setup your object mapper
+final var mapper = new ObjectMapper();
+mapper.registerModule(new ParameterNamesModule());
+mapper.setSerializationInclusion(JsonInclude.Include.NON_EMPTY);
+mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);
+mapper.disable(SerializationFeature.FAIL_ON_EMPTY_BEANS);
+mapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);
+mapper.enable(MapperFeature.ACCEPT_CASE_INSENSITIVE_ENUMS);
+
+final var listener = new DroveRemoteEventListener(droveClient, //Create listener
+                                                    mapper,
+                                                    new DroveEventPollingOffsetInMemoryStore(),
+                                                    Duration.ofSeconds(1));
+
+listener.onEventReceived() //Connect signal handlers
+    .connect(events -> {
+        log.info("Remote Events: {}", events);
+    });
+
+listener.start(); //Start listening
+
+
+//Once done close the listener
+listener.close();
+
+
+

Event Types

+

Please check the com.phonepe.drove.models.events package for the different event types and classes.

+
+
+

Event Polling Offset Store

+

The event poller library uses polling to find new events based on an offset. The event polling offset store is used to store and retrieve this offset. The DroveEventPollingOffsetInMemoryStore default store stores this information in-memory. Implement DroveEventPollingOffsetStore to a more permanent storage if you want this to be more permanent.

+
+

Drove Hazelcast Cluster Discovery

+

Drove provides an implementation of the Hazelcast discovery SPI so that containers deployed on a drove cluster can discover each other. This client uses the token injected by drove in the DROVE_APP_INSTANCE_AUTH_TOKEN environment variable to get sibling information from the controller.

+

Dependencies

+
<!--Include Drove client-->
+<!--Include Hazelcast-->
+<dependency>
+    <groupId>com.phonepe.drove</groupId>
+    <artifactId>drove-events-client</artifactId>
+    <version>${drove.version}</version>
+</dependency>
+
+

Sample Code

+
//Setup hazelcast
+Config config = new Config();
+
+// Enable discovery
+config.setProperty("hazelcast.discovery.enabled", "true");
+config.setProperty("hazelcast.discovery.public.ip.enabled", "true");
+config.setProperty("hazelcast.socket.client.bind.any", "true");
+config.setProperty("hazelcast.socket.bind.any", "false");
+
+//Setup networking
+NetworkConfig networkConfig = config.getNetworkConfig();
+networkConfig.getInterfaces().addInterface("0.0.0.0").setEnabled(true);
+networkConfig.setPort(port); //Port is the port exposed on the container for hazelcast clustering
+
+// Setup Drove discovery
+JoinConfig joinConfig = networkConfig.getJoin();
+
+DiscoveryConfig discoveryConfig = joinConfig.getDiscoveryConfig();
+DiscoveryStrategyConfig discoveryStrategyConfig =
+        new DiscoveryStrategyConfig(new DroveDiscoveryStrategyFactory());
+discoveryStrategyConfig.addProperty("drove-endpoint", "http://controller1:4000,http://controller2:4000"); //Controller endpoints
+discoveryStrategyConfig.addProperty("port-name", "hazelcast"); // Name of the hazelcast port defined in Application spec
+discoveryStrategyConfig.addProperty("transport", "com.phonepe.drove.client.transport.httpcomponent.DroveHttpComponentsTransport");
+discoveryStrategyConfig.addProperty("cluster-by-app-name", true); //Cluster container across multiple app versions
+discoveryConfig.addDiscoveryStrategyConfig(discoveryStrategyConfig);
+
+//Create hazelcast node
+val node = Hazelcast.newHazelcastInstance(config);
+
+//Once connected, node.getCluster() will be non null
+
+
+

Peer discovery modes

+

By default the containers will only discover and connect to containers from the same application id. If you need to connect to containers from all versions of the same application please set the cluster-by-app-name property to true as in the above example.

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/extra/nvidia.html b/extra/nvidia.html new file mode 100644 index 0000000..6b2d5d0 --- /dev/null +++ b/extra/nvidia.html @@ -0,0 +1,1581 @@ + + + + + + + + + + + + + + + + + + + + + + + Setting up Nvidia GPU computation on executor - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Setting up Nvidia GPU computation on executor

+

Prerequisite: Docker version 19.0.3+. Check Docker versions and nvidia for details.

+

Below steps are for ubuntu primarily for other distros check the associated links.

+

Install nvidia drivers on hosts

+

Ubuntu provides packaged drivers for nvidia. +Driver installation Guide

+

Recommended +

ubuntu-drivers list --gpgpu
+ubuntu-drivers install --gpgpu nvidia:535-server
+

+

Alternatively apt can be used, but may require additional steps Manual install +

# Check for the latest stable version 
+apt search nvidia-driver.*server
+apt install -y nvidia-driver-535-server  nvidia-utils-535-server 
+

+

For other distros check Guide

+

Install Nvidia-container-toolkit

+

Add nvidia repo

+

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+
+apt install -y nvidia-container-toolkit
+
+For other distros check guide here

+

Configure docker with nvidia toolkit

+
nvidia-ctk runtime configure --runtime=docker
+
+systemctl restart docker #Restart Docker
+
+

Verify installation

+

On Host +nvidia-smi -l +In docker container +docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

+

+-----------------------------------------------------------------------------+
+| NVIDIA-SMI 535.86.10    Driver Version: 535.86.10    CUDA Version: 12.2     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
+| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+
++-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
++-----------------------------------------------------------------------------+
+
+Verification guide

+

Enable nvidia support on drove

+

Enable Nvidia support in drove-executor.yml and restart drove-executor +

...
+resources:
+  ...
+  enableNvidiaGpu: true
+...
+

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/getting-started.html b/getting-started.html new file mode 100644 index 0000000..992cd31 --- /dev/null +++ b/getting-started.html @@ -0,0 +1,1678 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Getting Started - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + + + +

Getting Started

+

To get a trivial cluster up and running on a machine, the compose file can be used.

+

Update etc hosts to interact wih nginx

+

Add the following lines to /etc/hosts +

127.0.0.1   drove.local
+127.0.0.1   testapp.local
+

+

Download the compose file

+
wget https://raw.githubusercontent.com/PhonePe/drove-orchestrator/master/compose/compose.yaml
+
+

Bringing up a demo cluster

+

cd compose
+docker-compose up
+
+This will start zookeeper,drove controller, executor and nginx/drove-gateway. +The following ports are used:

+
    +
  • Zookeeper - 2181
  • +
  • Executor - 3000
  • +
  • Controller - 4000
  • +
  • Gateway - 7000
  • +
+

Drove credentials would be admin/admin and guest/guest for read-write and read-only permissions respectively.

+

You should be able to access the UI at http://drove.local:7000

+

Install drove-cli

+

Install the CLI for drove +

pip install drove-cli
+

+

Create Client Configuration

+

Put the following in ${HOME}/.drove

+
[local]
+endpoint = http://drove.local:4000
+username = admin
+password = admin
+
+

Deploy an app

+

Get the sample app spec: +

wget https://raw.githubusercontent.com/PhonePe/drove-cli/master/sample/test_app.json
+

+

Now deploy the app. +

drove -c local apps create test_app.json
+

+

Scale the app

+

drove -c local apps scale TEST_APP-1 1 -w
+
+This would expose the app as testapp.local. Endpoint would be: http://testapp.local:7000.

+

You can test the app by running the following commands:

+
curl http://testapp.local:7000/
+curl http://testapp.local:7000/files/drove.txt
+
+

Suspend and destroy the app

+
drove -c local apps scale TEST_APP-1 0 -w
+drove -c local apps destroy TEST_APP-1
+
+

Accessing the code

+

Code is hosted on github.

+

Cloning everything:

+
git clone git@github.com:PhonePe/drove-orchestrator.git
+git submodule init
+git submodule update
+
+ + + + + + + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/images/app-instance-state-machine.png b/images/app-instance-state-machine.png new file mode 100644 index 0000000..28c3039 Binary files /dev/null and b/images/app-instance-state-machine.png differ diff --git a/images/app-state-machine.png b/images/app-state-machine.png new file mode 100644 index 0000000..eb11fb6 Binary files /dev/null and b/images/app-state-machine.png differ diff --git a/images/cluster.png b/images/cluster.png new file mode 100644 index 0000000..82a98d7 Binary files /dev/null and b/images/cluster.png differ diff --git a/images/cluster.svg b/images/cluster.svg new file mode 100644 index 0000000..52867d5 --- /dev/null +++ b/images/cluster.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/images/drove-home.png b/images/drove-home.png new file mode 100644 index 0000000..468ef69 Binary files /dev/null and b/images/drove-home.png differ diff --git a/images/task-state-machine.png b/images/task-state-machine.png new file mode 100644 index 0000000..f00d387 Binary files /dev/null and b/images/task-state-machine.png differ diff --git a/index.html b/index.html new file mode 100644 index 0000000..d9e35c6 --- /dev/null +++ b/index.html @@ -0,0 +1,1768 @@ + + + + + + + + + + + + + + + + + + + + + + + Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + + + + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Introduction

+

Drove is a container orchestrator built at PhonePe. It is focused on simplicity, container performance, and easy operations.

+

Drove Home

+

Features

+

The following sections go over the features.

+

Functional

+
    +
  • Application (service) and application container lifecycle management including mandated readiness checks, health checks, and pre-shutdown hooks to enable operators to take containers out of rotation easily and shut them down gracefully if needed.
  • +
  • Ensures the required (specified) number of containers will always be present in the cluster. It will detect failures across the cluster and bring containers up/down to maintain the required instance count.
  • +
  • Provides endpoint information to be consumed by routers like drove-gateway+nginx/traefik, etc., to expose containers over vhost.
  • +
  • Supports short-lived container-based tasks. This helps folks build newer systems that can spin up containers as needed on the cluster. (See epoch).
  • +
  • Provides functionality for real-time log streaming and log download for all instances.
  • +
  • Log generation is handled by Drove in a file layout suitable for existing log shipping mechanisms as well as for streaming to rsyslog servers (if needed).
  • +
  • Provides a functional read-only web-based console for checking cluster, application, task, and instance states, log streaming, etc.
  • +
  • Provides APIs for both read and write operations.
  • +
  • Supports discovery for sibling containers to support dynamic cluster reconfiguration in frameworks like Hazelcast.
  • +
  • Support extra metadata in the form of tags on instances. This can be used in external systems for routing or other use-cases, as this information is available at the endpoint as well.
  • +
  • CLI system for easy deployments and app/task lifecycle management.
  • +
  • NGinx based router called drove-gateway for efficient communication with the cluster itself and containers deployed on it.
  • +
+

Operations

+
    +
  • Only two components (controller and executor) to make a cluster (plus Zookeeper for coordination, drove-gateway for routing if needed).
  • +
  • All components dockerised to allow for easy deployment as well as upgrades.
  • +
  • Simple single file YAML based configuration for the controller and executor.
  • +
  • Cluster can be set to maintenance mode, where it pauses making changes to the cluster and turns off safeguards around ensuring the required number of containers get reported from the executor nodes. This will allow the SRE team to do seamless software updates across the whole cluster in a few minutes, irrespective of the size.
  • +
  • Blacklisting of the executor nodes will automatically move all the running application containers to other nodes and prevent any further allocations to this node. This allows the node to be taken down for maintenance that needs longer periods of time to complete the OS/patch application, hardware maintenance, etc.
  • +
  • Detect and kill any Zombie container nodes. On mesos, SRE team needs to be involved to manually kill such containers.
  • +
+

Performance

+
    +
  • Scheduler needs to be aware of NUMA hardware topology of the node and prevent containers from being split across nodes.
  • +
  • Scheduler will pin containers to specific cores on a NUMA node so as to stop containers from stepping on each other’s toes during peak hours and allow them to fully utilize the multi-level caches associated with the allocated CPU cores. Some balance is anyways gained by enabling hyper-threading on the executor nodes. This should be sufficient to provide a significant boost to the application performance.
  • +
  • Allows for specialised nodes in the cluster. For example, there might be nodes with GPU available. We would want to run ML models that can utilise such hardware rather than allocate generic service containers on such nodes. To this end, the scheduler supports tagging and allows for containers to be explicitly mapped to tagged nodes.
  • +
  • Allows for different placement policies to provide some flexibility to users in where they want to place their container nodes. This sometimes helps developers deploy specific apps to specific nodes where they might have been granted special privileges to perform deeper than usual investigations of running service containers (for example, take heap-dumps to specific mounted volumes, etc.).
  • +
  • Allows for configuration injection at container startup. Such configuration can be streamed in as part of the deployment specification, mounted in from executor hosts, or fetched via API calls by the controllers or executors.
  • +
  • Provides provisions to allow for extension of the scheduler to implement different scheduling algorithms in the code later on.
  • +
  • Sometimes, NUMA localization and CPU pinning are overkill for clusters that don't need to extract the last bit of performance. For example, testing/staging clusters. To this end, Drove supports the following features:
      +
    • Allows turning off NUMA and core pinning at executor level.
    • +
    • Allows to specify multipliers for available CPU/memory to accommodate overprovisioning on the cluster.
    • +
    +
  • +
  • Because the above are set at an executor level, the cluster can have different types of nodes with different required performance characteristics appropriately tagged. Relevant apps can be deployed based on performance requirements.
  • +
+

Resilience

+
    +
  • Small number of moving pieces. Keeps the minimal amount of dependencies in the system. This reduces the exposure to failures by effectively reducing the number of external dependencies and possible failure points.
  • +
  • Controller stores state on external system (Zookeeper) for now.
  • +
  • Executor stores all container specific states in the container metadata itself. No other state is maintained/needed by executor.
  • +
  • Containers keep running even when most of the system is down. This means that even when the cluster coordinators, executors, state storage, etc. are down, the already deployed containers keep on running as is. If a service discovery mechanism is implemented properly, this effectively protects the system against service disruptions, even in the face of failure of critical cluster components. At PhonePe, we use Ranger for service discovery.
  • +
  • Container state reconciliation is part of the executor system, so that executor service can be restarted easily without affecting application or task deployments. In other words, the executor needs to recognise the containers started by itself on restarts and report their state as usual to the controller.
  • +
  • Keeps things as simple as possible. Drove uses a few simple constructs (scale up/scale down) and implements all application/task features using that.
  • +
  • Multi-mode cluster messaging ensures that faster updates will be sent to controller via sync channels, while the controller(s) keep refreshing the cluster state periodically, irrespective of the last synced data. Drove assumes that communication failures would happen. Even if new changes can’t be propagated from executor to controller, it tries to keep existing topology as updated as possible.
  • +
  • Built in safeguards to detect and kill any rogue (Zombie) container instances that have remained back for some reason (maybe some bug in the orchestrator, etc.).
  • +
  • Controller is highly available with one leader active at a time. Any communication issues with Zookeeper will lead to quick death of the controller so that another controller can take up its place as quickly as possible.
  • +
  • Leader can be tracked using the ping api and is used by components such as drove-gateway to provide a Virtual Host that can be used to interact with the cluster via the UI or the CLI, and other tools.
  • +
+

Security

+
    +
  • Clearly designate roles for read and write operations. Write operations include cluster maintenance and app and task lifecycle maintenance.
  • +
  • Authentication system is easily extensible.
  • +
  • Supports basic auth as the minimal auth requirement. User credentials are stored in bcrypt format in controller config files.
  • +
  • Support a no-auth mode for starter clusters.
  • +
  • Provides audit logs for events in the system. Such logs can get aggregated and/or shipped out independently by existing log aggregation systems like logrotate-+rsync or (r)syslog, etc., by configuring the appropriate loggers in the controller configuration file.
  • +
  • Separate authentication system for intra-cluster authentication and for edge. This will mean that even if external auth is compromised (or vice versa), the system will keep working as is.
  • +
  • Shared secret is used for intra-cluster authentication.
  • +
  • Dynamically generated tokens are injected into container instances for seamless sibling discovery. This provides a way for developers to implement clustering mechanisms for frameworks like Hazelcast (provided already).
  • +
+

Observability

+
    +
  • Real-time event stream from the controller can be used for any other event driven system like drove-gateway, etc., to refresh upstream topology.
  • +
  • Metrics are available on admin ports for both the controllers and executors. Something like Telegraf can be used to collect and send them to the centralised metrics management system of your choice. (At PhonePe, we use Telegraf, which pushes the metrics to our custom metrics collection service, backed by a modified version of OpenTSDB. We use Grafana to visualize the same metrics.).
  • +
  • Published metrics from controllers include system health metrics around themselves.
  • +
  • Published metrics from executors contain system health metrics as well as other metrics around the containers running on them. This includes, but is not limited to, CPU, Memory and network usage.
  • +
+

Unsupported Features

+
    +
  • Auto-scaling of containers: In PhonePe, we have an extensive metrics ingestion system and an auto-scaler that works on a proprietary algorithm to scale containers up and down based on the same. This works independent of the orchestration system in play (we were on Drove and Mesos both at the same time during the transition period) and calls APIs on the deployment system that handles scaling operations independently. Any implementation at the orchestration level will not work as the contributors to the metrics might be running on different clusters, and scaling them independently will bring in more complexities rather than solving for simplicity.
  • +
  • Network level traffic control: At PhonePe, network security is handled at VRF level, and container-level access control is not needed. All services are already integrated with the OAuth2 compliant internal authentication and authorization system and perform security checks for the same at the application layer. As a matter of fact, we want containers to be as close to the raw network level as possible to ensure we can extract the highest level of network performance possible, other things being constant.
  • +
  • End-to-end configuration management: At this point of time, app/task configuration is maintained independently at PhonePe, subject to our approval workflows based on the compliance domain for the application, which can be static or dynamic and may be tied to deployments.
  • +
  • Multi-DC clusters: We have not tested a single Drove cluster spanning across multiple data centers.
  • +
+

Terminology

+

Before we delve into the details, let's get acquainted with the required terminology:

+
    +
  • Application - A service running on the cluster. Such a service can have an exposed port and will have an automatically configured virtual host on Drove Gateway.
  • +
  • Task - A transient container-based task.
  • +
  • Controller Nodes - The brains of the cluster. Only one cluster is the leader and hence the decision maker in the system.
  • +
  • Executor Nodes - The workhorse nodes of the cluster where the actual containers are run.
  • +
  • Drove CLI - A command line client to interact with the cluster.
  • +
  • Drove Gateway - Used to provide ingress to the leader and containers running on the cluster.
  • +
  • Epoch - A cron-type scheduler to spin up tasks on a Drove cluster based on pre-defined schedules.
  • +
+

Github Repositories

+ +

License

+

Apache 2

+ + + + + + + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/search/search_index.js b/search/search_index.js new file mode 100644 index 0000000..c959cef --- /dev/null +++ b/search/search_index.js @@ -0,0 +1 @@ +var __index = {"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Introduction","text":"

Drove is a container orchestrator built at PhonePe. It is focused on simplicity, container performance, and easy operations.

"},{"location":"index.html#features","title":"Features","text":"

The following sections go over the features.

"},{"location":"index.html#functional","title":"Functional","text":""},{"location":"index.html#operations","title":"Operations","text":""},{"location":"index.html#performance","title":"Performance","text":""},{"location":"index.html#resilience","title":"Resilience","text":""},{"location":"index.html#security","title":"Security","text":""},{"location":"index.html#observability","title":"Observability","text":""},{"location":"index.html#unsupported-features","title":"Unsupported Features","text":""},{"location":"index.html#terminology","title":"Terminology","text":"

Before we delve into the details, let's get acquainted with the required terminology:

"},{"location":"index.html#github-repositories","title":"Github Repositories","text":""},{"location":"index.html#license","title":"License","text":"

Apache 2

"},{"location":"getting-started.html","title":"Getting Started","text":"

To get a trivial cluster up and running on a machine, the compose file can be used.

"},{"location":"getting-started.html#update-etc-hosts-to-interact-wih-nginx","title":"Update etc hosts to interact wih nginx","text":"

Add the following lines to /etc/hosts

127.0.0.1   drove.local\n127.0.0.1   testapp.local\n

"},{"location":"getting-started.html#download-the-compose-file","title":"Download the compose file","text":"
wget https://raw.githubusercontent.com/PhonePe/drove-orchestrator/master/compose/compose.yaml\n
"},{"location":"getting-started.html#bringing-up-a-demo-cluster","title":"Bringing up a demo cluster","text":"

cd compose\ndocker-compose up\n
This will start zookeeper,drove controller, executor and nginx/drove-gateway. The following ports are used:

Drove credentials would be admin/admin and guest/guest for read-write and read-only permissions respectively.

You should be able to access the UI at http://drove.local:7000

"},{"location":"getting-started.html#install-drove-cli","title":"Install drove-cli","text":"

Install the CLI for drove

pip install drove-cli\n

"},{"location":"getting-started.html#create-client-configuration","title":"Create Client Configuration","text":"

Put the following in ${HOME}/.drove

[local]\nendpoint = http://drove.local:4000\nusername = admin\npassword = admin\n
"},{"location":"getting-started.html#deploy-an-app","title":"Deploy an app","text":"

Get the sample app spec:

wget https://raw.githubusercontent.com/PhonePe/drove-cli/master/sample/test_app.json\n

Now deploy the app.

drove -c local apps create test_app.json\n

"},{"location":"getting-started.html#scale-the-app","title":"Scale the app","text":"

drove -c local apps scale TEST_APP-1 1 -w\n
This would expose the app as testapp.local. Endpoint would be: http://testapp.local:7000.

You can test the app by running the following commands:

curl http://testapp.local:7000/\ncurl http://testapp.local:7000/files/drove.txt\n
"},{"location":"getting-started.html#suspend-and-destroy-the-app","title":"Suspend and destroy the app","text":"
drove -c local apps scale TEST_APP-1 0 -w\ndrove -c local apps destroy TEST_APP-1\n
"},{"location":"getting-started.html#accessing-the-code","title":"Accessing the code","text":"

Code is hosted on github.

Cloning everything:

git clone git@github.com:PhonePe/drove-orchestrator.git\ngit submodule init\ngit submodule update\n
"},{"location":"apis/index.html","title":"Introduction","text":"

This section lists all the APIs that a user can communicate with.

"},{"location":"apis/index.html#making-an-api-call","title":"Making an API call","text":"

Use a standard HTTP client in the language of your choice to make a call to the leader controller (the cluster virtual host exposed by drove-gateway-nginx).

Tip

In case you are using Java, we recommend using the drove-client library along with the http-transport.

If multiple controllers endpoints are provided, the client will track the leader automatically. This will reduce your dependency on drove-gateway.

"},{"location":"apis/index.html#authentication","title":"Authentication","text":"

Drove uses basic auth for authentication. (You can extend to use any other auth format like OAuth). The basic auth credentials need to be sent out in the standard format in the Authorization header.

"},{"location":"apis/index.html#response-format","title":"Response format","text":"

The response format is standard for all API calls:

{\n    \"status\": \"SUCCESS\",//(1)!\n    \"data\": {//(2)!\n        \"taskId\": \"T0012\"\n    },\n    \"message\": \"success\"//(3)!\n}\n
  1. SUCCESS or FAILURE as the case may be.
  2. Content of this field is contextual to the response.
  3. Will contain success if the call was successful or relevant error message.

Warning

APIs will return relevant HTTP status codes in case of error (for example 400 for validation errors, 401 for authentication failure). However, you must always ensure that the status field is set to SUCCESS for assuming the api call is succesful, even when HTTP status code is 2xx.

APIs in Drove belong to the following major classes:

Tip

Response models for these apis can be found in drove-models

Note

There are no publicly accessible APIs exposed by individual executors.

"},{"location":"apis/application.html","title":"Application Management","text":""},{"location":"apis/application.html#issue-application-operation-command","title":"Issue application operation command","text":"

POST /apis/v1/applications/operations

Request

curl --location 'http://drove.local:7000/apis/v1/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"SCALE\",\n    \"appId\": \"TEST_APP-1\",\n    \"requiredInstances\": 1,\n    \"opSpec\": {\n        \"timeout\": \"1m\",\n        \"parallelism\": 20,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Tip

Relevant payloads for application commands can be found in application operations section.

"},{"location":"apis/application.html#cancel-currently-running-operation","title":"Cancel currently running operation","text":"

POST /apis/v1/applications/operations/{appId}/cancel

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/operations/TEST_APP/cancel' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-list-of-applications","title":"Get list of applications","text":"

GET /apis/v1/applications

Request

curl --location 'http://drove.local:7000/apis/v1/applications' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"TEST_APP-1\": {\n            \"id\": \"TEST_APP-1\",\n            \"name\": \"TEST_APP\",\n            \"requiredInstances\": 0,\n            \"healthyInstances\": 0,\n            \"totalCPUs\": 0,\n            \"totalMemory\": 0,\n            \"state\": \"MONITORING\",\n            \"created\": 1719826995764,\n            \"updated\": 1719892126096\n        }\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-info-for-an-app","title":"Get info for an app","text":"

GET /apis/v1/applications/{id}

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"id\": \"TEST_APP-1\",\n        \"name\": \"TEST_APP\",\n        \"requiredInstances\": 1,\n        \"healthyInstances\": 1,\n        \"totalCPUs\": 1,\n        \"totalMemory\": 128,\n        \"state\": \"RUNNING\",\n        \"created\": 1719826995764,\n        \"updated\": 1719892279019\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-raw-json-specs","title":"Get raw JSON specs","text":"

GET /apis/v1/applications/{id}/spec

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/spec' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"name\": \"TEST_APP\",\n        \"version\": \"1\",\n        \"executable\": {\n            \"type\": \"DOCKER\",\n            \"url\": \"ghcr.io/appform-io/perf-test-server-httplib\",\n            \"dockerPullTimeout\": \"100 seconds\"\n        },\n        \"exposedPorts\": [\n            {\n                \"name\": \"main\",\n                \"port\": 8000,\n                \"type\": \"HTTP\"\n            }\n        ],\n        \"volumes\": [],\n        \"configs\": [\n            {\n                \"type\": \"INLINE\",\n                \"localFilename\": \"/testfiles/drove.txt\",\n                \"data\": \"\"\n            }\n        ],\n        \"type\": \"SERVICE\",\n        \"resources\": [\n            {\n                \"type\": \"CPU\",\n                \"count\": 1\n            },\n            {\n                \"type\": \"MEMORY\",\n                \"sizeInMB\": 128\n            }\n        ],\n        \"placementPolicy\": {\n            \"type\": \"ANY\"\n        },\n        \"healthcheck\": {\n            \"mode\": {\n                \"type\": \"HTTP\",\n                \"protocol\": \"HTTP\",\n                \"portName\": \"main\",\n                \"path\": \"/\",\n                \"verb\": \"GET\",\n                \"successCodes\": [\n                    200\n                ],\n                \"payload\": \"\",\n                \"connectionTimeout\": \"1 second\",\n                \"insecure\": false\n            },\n            \"timeout\": \"1 second\",\n            \"interval\": \"5 seconds\",\n            \"attempts\": 3,\n            \"initialDelay\": \"0 seconds\"\n        },\n        \"readiness\": {\n            \"mode\": {\n                \"type\": \"HTTP\",\n                \"protocol\": \"HTTP\",\n                \"portName\": \"main\",\n                \"path\": \"/\",\n                \"verb\": \"GET\",\n                \"successCodes\": [\n                    200\n                ],\n                \"payload\": \"\",\n                \"connectionTimeout\": \"1 second\",\n                \"insecure\": false\n            },\n            \"timeout\": \"1 second\",\n            \"interval\": \"3 seconds\",\n            \"attempts\": 3,\n            \"initialDelay\": \"0 seconds\"\n        },\n        \"tags\": {\n            \"superSpecialApp\": \"yes_i_am\",\n            \"say_my_name\": \"heisenberg\"\n        },\n        \"env\": {\n            \"CORES\": \"8\"\n        },\n        \"exposureSpec\": {\n            \"vhost\": \"testapp.local\",\n            \"portName\": \"main\",\n            \"mode\": \"ALL\"\n        },\n        \"preShutdown\": {\n            \"hooks\": [\n                {\n                    \"type\": \"HTTP\",\n                    \"protocol\": \"HTTP\",\n                    \"portName\": \"main\",\n                    \"path\": \"/\",\n                    \"verb\": \"GET\",\n                    \"successCodes\": [\n                        200\n                    ],\n                    \"payload\": \"\",\n                    \"connectionTimeout\": \"1 second\",\n                    \"insecure\": false\n                }\n            ],\n            \"waitBeforeKill\": \"3 seconds\"\n        }\n    },\n    \"message\": \"success\"\n}\n

Note

configs section data will not be returned by any api calls

"},{"location":"apis/application.html#get-list-of-currently-active-instances","title":"Get list of currently active instances","text":"

GET /apis/v1/applications/{id}/instances

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"appId\": \"TEST_APP-1\",\n            \"appName\": \"TEST_APP\",\n            \"instanceId\": \"AI-58eb1111-8c2c-4ea2-a159-8fc68010a146\",\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"localInfo\": {\n                \"hostname\": \"ppessdev\",\n                \"ports\": {\n                    \"main\": {\n                        \"containerPort\": 8000,\n                        \"hostPort\": 33857,\n                        \"portType\": \"HTTP\"\n                    }\n                }\n            },\n            \"resources\": [\n                {\n                    \"type\": \"CPU\",\n                    \"cores\": {\n                        \"0\": [\n                            2\n                        ]\n                    }\n                },\n                {\n                    \"type\": \"MEMORY\",\n                    \"memoryInMB\": {\n                        \"0\": 128\n                    }\n                }\n            ],\n            \"state\": \"HEALTHY\",\n            \"metadata\": {},\n            \"errorMessage\": \"\",\n            \"created\": 1719892354194,\n            \"updated\": 1719893180105\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-list-of-old-instances","title":"Get list of old instances","text":"

GET /apis/v1/applications/{id}/instances/old

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances/old' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"appId\": \"TEST_APP-1\",\n            \"appName\": \"TEST_APP\",\n            \"instanceId\": \"AI-869e34ed-ebf3-4908-bf48-719475ca5640\",\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"resources\": [\n                {\n                    \"type\": \"CPU\",\n                    \"cores\": {\n                        \"0\": [\n                            2\n                        ]\n                    }\n                },\n                {\n                    \"type\": \"MEMORY\",\n                    \"memoryInMB\": {\n                        \"0\": 128\n                    }\n                }\n            ],\n            \"state\": \"STOPPED\",\n            \"metadata\": {},\n            \"errorMessage\": \"Error while pulling image ghcr.io/appform-io/perf-test-server-httplib: Status 500: {\\\"message\\\":\\\"Get \\\\\\\"https://ghcr.io/v2/\\\\\\\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\\\"}\\n\",\n            \"created\": 1719892279039,\n            \"updated\": 1719892354099\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-info-for-an-instance","title":"Get info for an instance","text":"

GET /apis/v1/applications/{appId}/instances/{instanceId}

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances/AI-58eb1111-8c2c-4ea2-a159-8fc68010a146' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\",\n        \"appName\": \"TEST_APP\",\n        \"instanceId\": \"AI-58eb1111-8c2c-4ea2-a159-8fc68010a146\",\n        \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n        \"localInfo\": {\n            \"hostname\": \"ppessdev\",\n            \"ports\": {\n                \"main\": {\n                    \"containerPort\": 8000,\n                    \"hostPort\": 33857,\n                    \"portType\": \"HTTP\"\n                }\n            }\n        },\n        \"resources\": [\n            {\n                \"type\": \"CPU\",\n                \"cores\": {\n                    \"0\": [\n                        2\n                    ]\n                }\n            },\n            {\n                \"type\": \"MEMORY\",\n                \"memoryInMB\": {\n                    \"0\": 128\n                }\n            }\n        ],\n        \"state\": \"HEALTHY\",\n        \"metadata\": {},\n        \"errorMessage\": \"\",\n        \"created\": 1719892354194,\n        \"updated\": 1719893440105\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#application-endpoints","title":"Application Endpoints","text":"

GET /apis/v1/endpoints

Info

This API provides up-to-date information about the host and port information about application instances running on the cluster. This information can be used for Service Discovery systems to keep their information in sync with changes in the topology of applications running on the cluster.

Tip

Any tag specified in the application specification is also exposed on endpoint. This can be used to implement complicated routing logic if needed in the NGinx template on Drove Gateway.

Request

curl --location 'http://drove.local:7000/apis/v1/endpoints' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"appId\": \"TEST_APP-1\",\n            \"vhost\": \"testapp.local\",\n            \"tags\": {\n                \"superSpecialApp\": \"yes_i_am\",\n                \"say_my_name\": \"heisenberg\"\n            },\n            \"hosts\": [\n                {\n                    \"host\": \"ppessdev\",\n                    \"port\": 44315,\n                    \"portType\": \"HTTP\"\n                }\n            ]\n        },\n        {\n            \"appId\": \"TEST_APP-2\",\n            \"vhost\": \"testapp.local\",\n            \"tags\": {\n                \"superSpecialApp\": \"yes_i_am\",\n                \"say_my_name\": \"heisenberg\"\n            },\n            \"hosts\": [\n                {\n                    \"host\": \"ppessdev\",\n                    \"port\": 46623,\n                    \"portType\": \"HTTP\"\n                }\n            ]\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html","title":"Cluster Management","text":""},{"location":"apis/cluster.html#ping-api","title":"Ping API","text":"

GET /apis/v1/ping

Request

curl --location 'http://drove.local:7000/apis/v1/ping' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": \"pong\",\n    \"message\": \"success\"\n}\n

Tip

Use this api call to determine the leader in a cluster. This api will return a HTTP 200 only for the leader controller. All other controllers in the cluster will return 4xx for this api call.

"},{"location":"apis/cluster.html#cluster-management_1","title":"Cluster Management","text":""},{"location":"apis/cluster.html#get-current-cluster-state","title":"Get current cluster state","text":"

GET /apis/v1/cluster

Request

curl --location 'http://drove.local:7000/apis/v1/cluster' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"leader\": \"ppessdev:4000\",\n        \"state\": \"NORMAL\",\n        \"numExecutors\": 1,\n        \"numApplications\": 1,\n        \"numActiveApplications\": 1,\n        \"freeCores\": 9,\n        \"usedCores\": 1,\n        \"totalCores\": 10,\n        \"freeMemory\": 18898,\n        \"usedMemory\": 128,\n        \"totalMemory\": 19026\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#set-maintenance-mode-on-cluster","title":"Set maintenance mode on cluster","text":"

POST /apis/v1/cluster/maintenance/set

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/set' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"MAINTENANCE\",\n        \"updated\": 1719897526772\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#remove-maintenance-mode-from-cluster","title":"Remove maintenance mode from cluster","text":"

POST /apis/v1/cluster/maintenance/unset

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/unset' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"NORMAL\",\n        \"updated\": 1719897573226\n    },\n    \"message\": \"success\"\n}\n

Warning

Cluster will remain in maintenance mode for some time (about 2 minutes) internally even after maintenance mode is removed.

"},{"location":"apis/cluster.html#executor-management","title":"Executor Management","text":""},{"location":"apis/cluster.html#get-list-of-executors","title":"Get list of executors","text":"

GET /apis/v1/cluster/executors

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/executors' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"hostname\": \"ppessdev\",\n            \"port\": 3000,\n            \"transportType\": \"HTTP\",\n            \"freeCores\": 9,\n            \"usedCores\": 1,\n            \"freeMemory\": 18898,\n            \"usedMemory\": 128,\n            \"tags\": [\n                \"ppessdev\"\n            ],\n            \"state\": \"ACTIVE\"\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#get-detailed-info-for-one-executor","title":"Get detailed info for one executor","text":"

GET /apis/v1/cluster/executors/{id}

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/executors/a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"type\": \"EXECUTOR\",\n        \"hostname\": \"ppessdev\",\n        \"port\": 3000,\n        \"transportType\": \"HTTP\",\n        \"updated\": 1719897100104,\n        \"state\": {\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"cpus\": {\n                \"type\": \"CPU\",\n                \"freeCores\": {\n                    \"0\": [\n                        3,\n                        4,\n                        5,\n                        6,\n                        7,\n                        8,\n                        9,\n                        10,\n                        11\n                    ]\n                },\n                \"usedCores\": {\n                    \"0\": [\n                        2\n                    ]\n                }\n            },\n            \"memory\": {\n                \"type\": \"MEMORY\",\n                \"freeMemory\": {\n                    \"0\": 18898\n                },\n                \"usedMemory\": {\n                    \"0\": 128\n                }\n            }\n        },\n        \"instances\": [\n            {\n                \"appId\": \"TEST_APP-1\",\n                \"appName\": \"TEST_APP\",\n                \"instanceId\": \"AI-58eb1111-8c2c-4ea2-a159-8fc68010a146\",\n                \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n                \"localInfo\": {\n                    \"hostname\": \"ppessdev\",\n                    \"ports\": {\n                        \"main\": {\n                            \"containerPort\": 8000,\n                            \"hostPort\": 33857,\n                            \"portType\": \"HTTP\"\n                        }\n                    }\n                },\n                \"resources\": [\n                    {\n                        \"type\": \"CPU\",\n                        \"cores\": {\n                            \"0\": [\n                                2\n                            ]\n                        }\n                    },\n                    {\n                        \"type\": \"MEMORY\",\n                        \"memoryInMB\": {\n                            \"0\": 128\n                        }\n                    }\n                ],\n                \"state\": \"HEALTHY\",\n                \"metadata\": {},\n                \"errorMessage\": \"\",\n                \"created\": 1719892354194,\n                \"updated\": 1719897100104\n            }\n        ],\n        \"tasks\": [],\n        \"tags\": [\n            \"ppessdev\"\n        ],\n        \"blacklisted\": false\n    },\n    \"message\": \"success\"\n}\n
"},{"location":"apis/cluster.html#take-executor-out-of-rotation","title":"Take executor out of rotation","text":"

POST /apis/v1/cluster/executors/blacklist

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/blacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Note

Unlike other POST apis, the executors to be blacklisted are passed as query parameter id. To blacklist multiple executors, pass .../blacklist?id=<id1>&id=<id2>...

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ],\n        \"failed\": []\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#bring-executor-back-into-rotation","title":"Bring executor back into rotation","text":"

POST /apis/v1/cluster/executors/unblacklist

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/unblacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Note

Unlike other POST apis, the executors to be un-blacklisted are passed as query parameter id. To un-blacklist multiple executors, pass .../unblacklist?id=<id1>&id=<id2>...

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ],\n        \"failed\": []\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#drove-cluster-events","title":"Drove Cluster Events","text":"

The following APIs can be used to monitor events on Drove. If the data needs to be consumed, the /latest API should be used. For simply knowing if an event of a certain type has occurred or not, the /summary is sufficient.

"},{"location":"apis/cluster.html#event-list","title":"Event List","text":"

GET /apis/v1/cluster/events/latest

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/events/latest?size=1024&lastSyncTime=0' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"events\": [\n            {\n                \"metadata\": {\n                    \"CURRENT_INSTANCES\": 0,\n                    \"APP_ID\": \"TEST_APP-1\",\n                    \"PLACEMENT_POLICY\": \"ANY\",\n                    \"APP_VERSION\": \"1\",\n                    \"CPU_COUNT\": 1,\n                    \"CURRENT_STATE\": \"RUNNING\",\n                    \"PORTS\": \"main:8000:http\",\n                    \"MEMORY\": 128,\n                    \"EXECUTABLE\": \"ghcr.io/appform-io/perf-test-server-httplib\",\n                    \"VHOST\": \"testapp.local\",\n                    \"APP_NAME\": \"TEST_APP\"\n                },\n                \"type\": \"APP_STATE_CHANGE\",\n                \"id\": \"a2b7d673-2bc2-4084-8415-d8d37cafa63d\",\n                \"time\": 1719977632050\n            },\n            {\n                \"metadata\": {\n                    \"APP_NAME\": \"TEST_APP\",\n                    \"APP_ID\": \"TEST_APP-1\",\n                    \"PORTS\": \"main:44315:http\",\n                    \"EXECUTOR_ID\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n                    \"EXECUTOR_HOST\": \"ppessdev\",\n                    \"CREATED\": 1719977629042,\n                    \"INSTANCE_ID\": \"AI-5efbb94f-835c-4c62-a073-a68437e60339\",\n                    \"CURRENT_STATE\": \"HEALTHY\"\n                },\n                \"type\": \"INSTANCE_STATE_CHANGE\",\n                \"id\": \"55d5876f-94ac-4c5d-a580-9c3b296add46\",\n                \"time\": 1719977631534\n            }\n        ],\n        \"lastSyncTime\": 1719977632050//(1)!\n    },\n    \"message\": \"success\"\n}\n

  1. Pass this as the parameter lastSyncTime in the next call to events api to receive latest events.
Query Parameter Validation Description lastSyncTime +ve long range Time when the last sync call happened on the server. Defaults to 0 (initial sync). size 1-1024 Number of latest events to return. Defaults to 1024. We recommend leaving this as is."},{"location":"apis/cluster.html#event-summary","title":"Event Summary","text":"

GET /apis/v1/cluster/events/summary

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/events/summary?lastSyncTime=0' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n
Response
{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"eventsCount\": {\n            \"INSTANCE_STATE_CHANGE\": 8,\n            \"APP_STATE_CHANGE\": 17,\n            \"EXECUTOR_BLACKLISTED\": 1,\n            \"EXECUTOR_UN_BLACKLISTED\": 1\n        },\n        \"lastSyncTime\": 1719977632050//(1)!\n    },\n    \"message\": \"success\"\n}\n

  1. Pass this as the parameter lastSyncTime in the next call to events api to receive latest events.
"},{"location":"apis/cluster.html#continuous-monitoring-for-events","title":"Continuous monitoring for events","text":"

This is applicable for both the APIs listed above

Info

Model for the events can be found here.

Tip

Java programs should definitely look at using the event listener library to listen to cluster events

"},{"location":"apis/logs.html","title":"Log Related APIs","text":""},{"location":"apis/logs.html#get-list-if-log-files","title":"Get list if log files","text":"

Application GET /apis/v1/logfiles/applications/{appId}/{instanceId}/list

Task GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/list

Request

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/list' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"files\": [\n        \"output.log-2024-07-04\",\n        \"output.log-2024-07-03\",\n        \"output.log\"\n    ]\n}\n

"},{"location":"apis/logs.html#download-log-files","title":"Download Log Files","text":"

Application GET /apis/v1/logfiles/applications/{appId}/{instanceId}/download/{fileName}

Task GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/download/{fileName}

Request

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/download/output.log' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

Note

The Content-Disposition header is set properly to the actual filename. For the above example it would be set to attachment; filename=output.log.

"},{"location":"apis/logs.html#read-chunks-from-log","title":"Read chunks from log","text":"

Application GET /apis/v1/logfiles/applications/{appId}/{instanceId}/read/{fileName}

Task GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/read/{fileName}

Query Parameter Validation Description offset Default -1, should be positive number The offset of the file to read from. length Should be a positive number Number of bytes to read.

Request

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/read/output.log' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"data\": \"\", //(1)!\n    \"offset\": 43318 //(2)!\n}\n

  1. Will contain raw data or empty string (in case of first call)
  2. Offset to be passed in the next call
"},{"location":"apis/logs.html#how-to-tail-logs","title":"How to tail logs","text":"
  1. Have a fixed buffer size in ming 1024/4096 etc
  2. Make a call to /read api with offset=-1, length = buffer size
  3. The call will return no data, but will have a valid offset
  4. Pass this offset in the next call, data will be returned if available (or empty). The response will also return the offset to pass in the .ext call.
  5. The data returned might be empty or less than length depending on availability.
  6. Keep repeating (4) to keep tailing log

Warning

"},{"location":"apis/task.html","title":"Task Management","text":""},{"location":"apis/task.html#issue-task-operation","title":"Issue task operation","text":"

POST /apis/v1/tasks/operations

Request

curl --location 'http://drove.local:7000/apis/v1/tasks/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"KILL\",\n    \"sourceAppName\" : \"TEST_APP\",\n    \"taskId\" : \"T0012\",\n    \"opSpec\": {\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"taskId\": \"T0012\"\n    },\n    \"message\": \"success\"\n}\n

Tip

Relevant payloads for task commands can be found in task operations section.

"},{"location":"apis/task.html#search-for-task","title":"Search for task","text":"

POST /apis/v1/tasks/search

"},{"location":"apis/task.html#list-all-tasks","title":"List all tasks","text":"

GET /apis/v1/tasks

Request

curl --location 'http://drove.local:7000/apis/v1/tasks' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"sourceAppName\": \"TEST_APP\",\n            \"taskId\": \"T0013\",\n            \"instanceId\": \"TI-c2140806-2bb5-4ed3-9bb9-0c0c5fd0d8d6\",\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"hostname\": \"ppessdev\",\n            \"executable\": {\n                \"type\": \"DOCKER\",\n                \"url\": \"ghcr.io/appform-io/test-task\",\n                \"dockerPullTimeout\": \"100 seconds\"\n            },\n            \"resources\": [\n                {\n                    \"type\": \"CPU\",\n                    \"cores\": {\n                        \"0\": [\n                            2\n                        ]\n                    }\n                },\n                {\n                    \"type\": \"MEMORY\",\n                    \"memoryInMB\": {\n                        \"0\": 512\n                    }\n                }\n            ],\n            \"volumes\": [],\n            \"env\": {\n                \"ITERATIONS\": \"10\"\n            },\n            \"state\": \"RUNNING\",\n            \"metadata\": {},\n            \"errorMessage\": \"\",\n            \"created\": 1719827035480,\n            \"updated\": 1719827038414\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/task.html#get-task-instance-details","title":"Get Task Instance Details","text":"

GET /apis/v1/tasks/{sourceAppName}/instances/{taskId}

Request

curl --location 'http://drove.local:7000/apis/v1/tasks/TEST_APP/instances/T0012' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"sourceAppName\": \"TEST_APP\",\n        \"taskId\": \"T0012\",\n        \"instanceId\": \"TI-6cf36f5c-6480-4ed5-9e2d-f79d9648529a\",\n        \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n        \"hostname\": \"ppessdev\",\n        \"executable\": {\n            \"type\": \"DOCKER\",\n            \"url\": \"ghcr.io/appform-io/test-task\",\n            \"dockerPullTimeout\": \"100 seconds\"\n        },\n        \"resources\": [\n            {\n                \"type\": \"CPU\",\n                \"cores\": {\n                    \"0\": [\n                        3\n                    ]\n                }\n            },\n            {\n                \"type\": \"MEMORY\",\n                \"memoryInMB\": {\n                    \"0\": 512\n                }\n            }\n        ],\n        \"volumes\": [],\n        \"env\": {\n            \"ITERATIONS\": \"10\"\n        },\n        \"state\": \"STOPPED\",\n        \"metadata\": {},\n        \"taskResult\": {\n            \"status\": \"SUCCESSFUL\",\n            \"exitCode\": 0\n        },\n        \"errorMessage\": \"\",\n        \"created\": 1719823470267,\n        \"updated\": 1719823483836\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/index.html","title":"Introduction","text":"

An application is a virtual representation of a running service in the system.

Running containers for an application are called application instances.

An application specification contains the following details about the application:

Info

Once a spec is registered to the cluster, it can not be changed

"},{"location":"applications/index.html#application-id","title":"Application ID","text":"

Once an application is created on the cluster, an Application id is generated. The format of this id currently is: {name}-{version}. All further operations to be done on the application will need to refer to it by this ID.

"},{"location":"applications/index.html#application-states-and-operations","title":"Application States and Operations","text":"

An application on a Drove cluster follows a fixed lifecycle modelled as a state machine. State transitions are triggered by operations. Operations can be issued externally using API calls or may be generated internally by the application monitoring system.

"},{"location":"applications/index.html#states","title":"States","text":"

Applications on a Drove cluster can be one of the following states:

"},{"location":"applications/index.html#operations","title":"Operations","text":"

The following application operations are recognized by Drove:

Tip

All operations can take an optional Cluster Operation Spec which can be used to control the timeout and parallelism of tasks generated by the operation.

"},{"location":"applications/index.html#application-state-machine","title":"Application State Machine","text":"

The following state machine signifies the states and transitions as affected by cluster state and operations issued.

"},{"location":"applications/instances.html","title":"Application Instances","text":"

Application instances are running containers for an application. The state machine for instances are managed in a decentralised manner on the cluster nodes locally and not by the controllers. This includes running health checks, readiness checks and shutdown hooks on the container, container loss detection and container state recovery on executor service restart.

Regular updates about the instance state are provided by executors to the controllers and are used to keep the application state up-to-date or trigger application operations to bring the applications to stable states.

"},{"location":"applications/instances.html#application-instance-states","title":"Application Instance States","text":"

An application instance can be in one of the following states at one point in time:

"},{"location":"applications/instances.html#application-instance-state-machine","title":"Application Instance State Machine","text":"

Instance state machine transitions might be triggered on receipt of commands issued by the controller or due to internal changes in the container (might have died or started failing health checks) as well as external factors like executor service restarts.

Note

No operations are allowed to be performed on application instances directly through the executor

"},{"location":"applications/operations.html","title":"Application Operations","text":"

This page discusses operations relevant to Application management. Please go over the Application State Machine and Application Instance State Machine to understand the different states an application (and it's instances) can be in and how operations applied move an application from one state to another.

Note

Please go through Cluster Op Spec to understand the operation parameters being sent.

Note

Only one operation can be active on a particular {appName,version} combination.

Warning

Only the leader controller will accept and process operations. To avoid confusion, use the controller endpoint exposed by Drove Gateway to issue commands.

"},{"location":"applications/operations.html#how-to-initiate-an-operation","title":"How to initiate an operation","text":"

Tip

Use the Drove CLI to perform all manual operations.

All operations for application lifecycle management need to be issued via a POST HTTP call to the leader controller endpoint on the path /apis/v1/applications/operations. API will return HTTP OK/200 and relevant json response as payload.

Sample api call:

curl --location 'http://drove.local:7000/apis/v1/applications/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"START_INSTANCES\",\n    \"appId\": \"TEST_APP-3\",\n    \"instances\": 1,\n    \"opSpec\": {\n        \"timeout\": \"5m\",\n        \"parallelism\": 32,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Note

In the above examples, http://drove.local:7000 is the endpoint of the leader. TEST_APP-3 is the Application ID. Authorization is basic auth.

"},{"location":"applications/operations.html#cluster-operation-specification","title":"Cluster Operation Specification","text":"

When an operation is submitted to the cluster, a cluster op spec needs to be specified. This is needed to control different aspects of the operation, including parallelism of an operation or increase the timeout for the operation and so on.

The following aspects of an operation can be configured:

Name Option Description Timeout timeout The duration after which Drove considers the operation to have timed out. Parallelism parallelism Parallelism of the task. (Range: 1-32) Failure Strategy failureStrategy Set this to STOP.

Note

For internal recovery operations, Drove generates it's own operations. For that, Drove applies the following cluster operation spec:

The default operation spec can be configured in the controller configuration file. It is recommended to set this to a something like 8 for faster recovery.

"},{"location":"applications/operations.html#how-to-cancel-an-operation","title":"How to cancel an operation","text":"

Operations can be requested to be cancelled asynchronously. A POST call needs to be made to leader controller endpoint on the api /apis/v1/operations/{applicationId}/cancel (1) to achieve this.

  1. applicationId is the Application ID for the application
curl --location --request POST 'http://drove.local:7000/apis/v1/operations/TEST_APP-3/cancel' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Warning

Operation cancellation is not instantaneous. Cancellation will be affected only after current execution of the active operation is complete.

"},{"location":"applications/operations.html#create-an-application","title":"Create an application","text":"

Before deploying containers on the cluster, an application needs to be created.

Preconditions:

State Transition:

To create an application, an Application Spec needs to be created first.

Once ready, CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local apps create sample/test_app.json\n

Sample Request Payload

{\n    \"type\": \"CREATE\",\n    \"spec\": {...}, //(1)!\n    \"opSpec\": { //(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Spec as mentioned in Application Specification
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"data\" : {\n        \"appId\" : \"TEST_APP-1\"\n    },\n    \"message\" : \"success\",\n    \"status\" : \"SUCCESS\"\n}\n

"},{"location":"applications/operations.html#starting-new-instances-of-an-application","title":"Starting new instances of an application","text":"

New instances can be started by issuing the START_INSTANCES command.

Preconditions - Application must be in one of the following states: MONITORING, RUNNING

State Transition:

The following command/payload will start 2 new instances of the application.

Drove CLIJSON
drove -c local apps deploy TEST_APP-1 2\n

Sample Request Payload

{\n    \"type\": \"START_INSTANCES\",\n    \"appId\": \"TEST_APP-1\",//(1)!\n    \"instances\": 2,//(2)!\n    \"opSpec\": {//(3)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 32,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Number of instances to be started
  3. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/operations.html#suspending-an-application","title":"Suspending an application","text":"

All instances of an application can be shut down by issuing the SUSPEND command.

Preconditions - Application must be in one of the following states: MONITORING, RUNNING

State Transition:

The following command/payload will suspend all instances of the application.

Drove CLIJSON
drove -c local apps suspend TEST_APP-1\n

Sample Request Payload

{\n    \"type\": \"SUSPEND\",\n    \"appId\": \"TEST_APP-1\",//(1)!\n    \"opSpec\": {//(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 32,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/operations.html#scaling-the-application-up-or-down","title":"Scaling the application up or down","text":"

Scaling the application to required number of containers can be achieved using the SCALE command. Application can be either scaled up or down using this command.

Preconditions - Application must be in one of the following states: MONITORING, RUNNING

State Transition:

Drove CLIJSON
drove -c local apps scale TEST_APP-1 2\n

Sample Request Payload

{\n    \"type\": \"SCALE\",\n    \"appId\": \"TEST_APP-1\", //(3)!\n    \"requiredInstances\": 2, //(1)!\n    \"opSpec\": { //(2)!\n        \"timeout\": \"1m\",\n        \"parallelism\": 20,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Absolute number of instances to be maintained on the cluster for the application
  2. Operation spec as mentioned in Cluster Op Spec
  3. Application ID

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Note

During scale down, older instances are stopped first

Tip

If implementing automation on top of Drove APIs, just use the SCALE command to scale up or down instead of using START_INSTANCES or SUSPEND separately.

"},{"location":"applications/operations.html#restarting-an-application","title":"Restarting an application","text":"

Application can be restarted by issuing the REPLACE_INSTANCES operation. In this case, first clusterOpSpec.parallelism number of containers are spun up first and then an equivalent number of them are spun down. This ensures that cluster maintains enough capacity is maintained in the cluster to handle incoming traffic as the restart is underway.

Warning

If the cluster does not have sufficient capacity to spin up new containers, this operation will get stuck. So adjust your parallelism accordingly.

Preconditions - Application must be in RUNNING state.

State Transition:

Drove CLIJSON
drove -c local apps restart TEST_APP-1\n

Sample Request Payload

{\n    \"type\": \"REPLACE_INSTANCES\",\n    \"appId\": \"TEST_APP-1\", //(1)!\n    \"instanceIds\": [], //(2)!\n    \"opSpec\": { //(3)!\n        \"timeout\": \"1m\",\n        \"parallelism\": 20,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Instances that need to be restarted. This is optional. If nothing is passed, all instances will be replaced.
  3. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Tip

To replace specific instances, pass their application instance ids (starts with AI-...) in the instanceIds parameter in the JSON payload.

"},{"location":"applications/operations.html#stop-or-replace-specific-instances-of-an-application","title":"Stop or replace specific instances of an application","text":"

Application instances can be killed by issuing the STOP_INSTANCES operation. Default behaviour of Drove is to replace killed instances by new instances. Such new instances are always spun up before the specified(old) instances are stopped. If skipRespawn parameter is set to true, the application instance is killed but no new instances are spun up to replace it.

Warning

If the cluster does not have sufficient capacity to spin up new containers, and skipRespawn is not set or set to false, this operation will get stuck.

Preconditions - Application must be in RUNNING state.

State Transition:

Drove CLIJSON
drove -c local apps appinstances kill TEST_APP-1 AI-601d160e-c692-4ddd-8b7f-4c09b30ed02e\n

Sample Request Payload

{\n    \"type\": \"STOP_INSTANCES\",\n    \"appId\" : \"TEST_APP-1\",//(1)!\n    \"instanceIds\" : [ \"AI-601d160e-c692-4ddd-8b7f-4c09b30ed02e\" ],//(2)!\n    \"skipRespawn\" : true,//(3)!\n    \"opSpec\": {//(4)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Instance ids to be stopped
  3. Do not spin up new containers to replace the stopped ones. This is set ot false by default.
  4. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/operations.html#destroy-an-application","title":"Destroy an application","text":"

To remove an application deployment (appName-version combo) the DESTROY command can be issued.

Preconditions:

State Transition:

To create an application, an Application Spec needs to be created first.

Once ready, CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local apps destroy TEST_APP_1\n

Sample Request Payload

{\n    \"type\": \"DESTROY\",\n    \"appId\" : \"TEST_APP-1\",//(1)!\n    \"opSpec\": {//(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 2,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Spec as mentioned in Application Specification
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Warning

All metadata for an app and it's instances are completely obliterated from Drove's storage once an app is destroyed

"},{"location":"applications/outage.html","title":"Outage Detection and Recovery","text":"

Drove tracks all instances for an app deployment in the cluster. It will ensure the required number of containers is always running on the cluster.

"},{"location":"applications/outage.html#instance-health-detection-and-tracking","title":"Instance health detection and tracking","text":"

Executor runs periodic health checks on the container according to check spec configuration. - Runs readiness checks to ensure container is started properly before declaring it healthy - Runs health checks on the container at regular intervals to ensure it is in operating condition

Behavior for both is configured by setting the appropriate options in the application specification.

Result of such health checks (both success and failure) are reported to the controller. Appropriate action is taken to shut down containers that fail readiness or health checks.

"},{"location":"applications/outage.html#container-crash","title":"Container crash","text":"

If container for an application crashes, Drove will automatically spin up a container in it's place.

"},{"location":"applications/outage.html#executor-node-hardware-failure","title":"Executor node hardware failure","text":"

If an executor node fails, instances running on that node will be lost. This is detected by the outage detector and new containers are spun up on other parts of the cluster.

"},{"location":"applications/outage.html#executor-service-temporary-unavailability","title":"Executor service temporary unavailability","text":"

On restart, executor service reads the metadata embedded in the container and registers them. It performs a reconciliation with the leader controller to kill any local containers if the unavailability was too long and controller has already spun up new alternatives.

"},{"location":"applications/outage.html#zombie-container-detection-and-cleanup","title":"Zombie (container) detection and cleanup","text":"

Executor service keeps track of all containers it is supposed to run by running periodic reconciliation with the leader controller. Any mismatch gets handled:

"},{"location":"applications/specification.html","title":"Application Specification","text":"

An application is defined using JSON. We use a sample configuration below to explain the options.

"},{"location":"applications/specification.html#sample-application-definition","title":"Sample Application Definition","text":"
{\n    \"name\": \"TEST_APP\", // (1)!\n    \"version\": \"1\", // (2)!\n    \"type\": \"SERVICE\", // (3)!\n    \"executable\": { //(4)!\n        \"type\": \"DOCKER\", // (5)!\n        \"url\": \"ghcr.io/appform-io/perf-test-server-httplib\",// (6)!\n        \"dockerPullTimeout\": \"100 seconds\"// (7)!\n    },\n    \"resources\": [//(20)!\n        {\n            \"type\": \"CPU\",\n            \"count\": 1//(21)!\n        },\n        {\n            \"type\": \"MEMORY\",\n            \"sizeInMB\": 128//(22)!\n        }\n    ],\n    \"volumes\": [//(12)!\n        {\n            \"pathInContainer\": \"/data\",//(13)!\n            \"pathOnHost\": \"/mnt/datavol\",//(14)!\n            \"mode\" : \"READ_WRITE\"//(15)!\n        }\n    ],\n    \"configs\" : [//(16)!\n        {\n            \"type\" : \"INLINE\",//(17)!\n            \"localFilename\": \"/testfiles/drove.txt\",//(18)!\n            \"data\" : \"RHJvdmUgdGVzdA==\"//(19)!\n        }\n    ],\n    \"placementPolicy\": {//(23)!\n        \"type\": \"ANY\"//(24)!\n    },\n    \"exposedPorts\": [//(8)!\n        {\n            \"name\": \"main\",//(9)!\n            \"port\": 8000,//(10)!\n            \"type\": \"HTTP\"//(11)!\n        }\n    ],\n    \"healthcheck\": {//(25)!\n        \"mode\": {//(26)!\n            \"type\": \"HTTP\", //(27)!\n            \"protocol\": \"HTTP\",//(28)!\n            \"portName\": \"main\",//(29)!\n            \"path\": \"/\",//(30)!\n            \"verb\": \"GET\",//(31)!\n            \"successCodes\": [//(32)!\n                200\n            ],\n            \"payload\": \"\", //(33)!\n            \"connectionTimeout\": \"1 second\" //(34)!\n        },\n        \"timeout\": \"1 second\",//(35)!\n        \"interval\": \"5 seconds\",//(36)!\n        \"attempts\": 3,//(37)!\n        \"initialDelay\": \"0 seconds\"//(38)!\n    },\n    \"readiness\": {//(39)!\n        \"mode\": {\n            \"type\": \"HTTP\",\n            \"protocol\": \"HTTP\",\n            \"portName\": \"main\",\n            \"path\": \"/\",\n            \"verb\": \"GET\",\n            \"successCodes\": [\n                200\n            ],\n            \"payload\": \"\",\n            \"connectionTimeout\": \"1 second\"\n        },\n        \"timeout\": \"1 second\",\n        \"interval\": \"3 seconds\",\n        \"attempts\": 3,\n        \"initialDelay\": \"0 seconds\"\n    },\n    \"exposureSpec\": {//(42)!\n        \"vhost\": \"testapp.local\", //(43)!\n        \"portName\": \"main\", //(44)!\n        \"mode\": \"ALL\"//(45)!\n    },\n    \"env\": {//(41)!\n        \"CORES\": \"8\"\n    },\n    \"args\" : [//(54)!\n        \"./entrypoint.sh\",\n        \"arg1\",\n        \"arg2\"\n    ],\n    \"tags\": { //(40)!\n        \"superSpecialApp\": \"yes_i_am\",\n        \"say_my_name\": \"heisenberg\"\n    },\n    \"preShutdown\": {//(46)!\n        \"hooks\": [ //(47)!\n            {\n                \"type\": \"HTTP\",\n                \"protocol\": \"HTTP\",\n                \"portName\": \"main\",\n                \"path\": \"/\",\n                \"verb\": \"GET\",\n                \"successCodes\": [\n                    200\n                ],\n                \"payload\": \"\",\n                \"connectionTimeout\": \"1 second\"\n            }\n        ],\n        \"waitBeforeKill\": \"3 seconds\"//(48)!\n    },\n    \"logging\": {//(49)!\n        \"type\": \"LOCAL\",//(50)!\n        \"maxSize\": \"100m\",//(51)!\n        \"maxFiles\": 3,//(52)!\n        \"compress\": true//(53)!\n    }\n}\n
  1. A human readable name for the application. This will remain constant for different versions of the app.
  2. A version number. Drove does not enforce any format for this, but it is recommended to increment this for changes in spec.
  3. This should be fixed to SERVICE for an application/service.
  4. Coordinates for the executable. Refer to Executable Specification for details.
  5. Right now the only type supported is DOCKER.
  6. Docker container address
  7. Timeout for container pull.
  8. The ports to be exposed from the container.
  9. A logical name for the port. This will be used to reference this port in other sections.
  10. Actual port number as mentioned in Dockerfile.
  11. Type of port. Can be: HTTP, HTTPS, TCP, UDP.
  12. Volumes to be mounted. Refer to Volume Specification for details.
  13. Path that will be visible inside the container for this mount.
  14. Actual path on the host machine for the mount.
  15. Mount mode can be READ_WRITE and READ_ONLY
  16. Configuration to be injected as file inside the container. Please refer to Config Specification for details.
  17. Type of config. Can be INLINE, EXECUTOR_LOCAL_FILE, CONTROLLER_HTTP_FETCH and EXECUTOR_HTTP_FETCH. Specifies how drove will get the contents to be injected..
  18. File name for the config inside the container.
  19. Serialized form of the data, this and other parameters will vary according to the type specified above.
  20. List of resources required to run this application. Check Resource Requirements Specification for more details.
  21. Number of CPU cores to be allocated.
  22. Amount of memory to be allocated expressed in Megabytes
  23. Specifies how the container will be placed on the cluster. Check Placement Policy for details.
  24. Type of placement can be ANY, ONE_PER_HOST, MATCH_TAG, NO_TAG, RULE_BASED, ANY and COMPOSITE. Rest of the parameters in this section will depend on the type.
  25. Health check to ensure service is running fine. Refer to Check Specification for details.
  26. Mode of health check, can be api call or command.
  27. Type of this check spec. Type can be HTTP or CMD. Rest of the options in this example are HTTP specific.
  28. API call protocol. Can be HTTP/HTTPS
  29. Port name as mentioned in the exposedPorts section.
  30. HTTP path. Include query params here.
  31. HTTP method. Can be GET,PUT or POST.
  32. Set of HTTP status codes which can be considered as success.
  33. Payload to be sent for POST and PUT calls.
  34. Connection timeout for the port.
  35. Timeout for the check run.
  36. Interval between check runs.
  37. Max attempts after which the overall check is considered to be a failure.
  38. Time to wait before starting check runs.
  39. Readiness check to pass for the container to be considered as ready. Refer to Check Specification for details.
  40. Key value metadata that can be used in external systems.
  41. Custom environment variables. Additional variables are injected by Drove as well. See Environment Variables section for details.
  42. Specifies the virtual host on which this container is exposed.
  43. FQDN for the virtual host.
  44. Port name as specified in exposedPorts section.
  45. Mode for exposure. Set this to ALL for now.
  46. Things to do before a container is shutdown. Check Pre Shutdown Behavior for more details.
  47. Hooks (HTTP api call or shell command) to run before shutting down the container. Format is same as health/readiness checks. Refer to HTTP Check Actions and Command Check Options for details.
  48. Time to wait before killing the container. The container will be in UNREADY state during this time and hence won't have api calls routed to it via Drove Gateway.
  49. Specify how docker log files are configured. Refer to Logging Specification
  50. Log to local file
  51. Maximum File Size
  52. Number of latest log files to retain
  53. Log files will be compressed
  54. List of command line arguments. See Command Line Arguments for details.
"},{"location":"applications/specification.html#executable-specification","title":"Executable Specification","text":"

Right now Drove supports only docker containers. However as engines, both docker and podman are supported. Drove executors will fetch the executable directly from the registry based on the configuration provided.

Name Option Description Type type Set type to DOCKER. URL url Docker container URL`. Timeout dockerPullTimeout Timeout for docker image pull.

Note

Drove supports docker registry authentication. This can be configured in the executor configuration file.

"},{"location":"applications/specification.html#resource-requirements-specification","title":"Resource Requirements Specification","text":"

This section specifies the hardware resources required to run the container. Right now only CPU and MEMORY are supported as resource types that can be reserved for a container.

"},{"location":"applications/specification.html#cpu-requirements","title":"CPU Requirements","text":"

Specifies number of cores to be assigned to the container.

Name Option Description Type type Set type to CPU for this. Count count Number of cores to be assigned."},{"location":"applications/specification.html#memory-requirements","title":"Memory Requirements","text":"

Specifies amount of memory to be allocated to a container.

Name Option Description Type type Set type to MEMORY for this. Count sizeInMB Amount of memory (in Mega Bytes) to be allocated.

Sample

[\n    {\n        \"type\": \"CPU\",\n        \"count\": 1\n    },\n    {\n        \"type\": \"MEMORY\",\n        \"sizeInMB\": 128\n    }\n]\n

Note

Both CPU and MEMORY configurations are mandatory.

"},{"location":"applications/specification.html#volume-specification","title":"Volume Specification","text":"

Files and directories can be mounted from the executor host into the container. The volumes section contains a list of volumes that need to be mounted.

Name Option Description Path In Container pathInContainer Path that will be visible inside the container for this mount. Path On Host pathOnHost Actual path on the host machine for the mount. Mount Mode mode Mount mode can be READ_WRITE and READ_ONLY to allow the containerized process to write or read to the volume.

Info

We do not support mounting remote volumes as of now.

"},{"location":"applications/specification.html#config-specification","title":"Config Specification","text":"

Drove supports injection of configuration files into containers. The specifications for the same are discussed below.

"},{"location":"applications/specification.html#inline-config","title":"Inline config","text":"

Inline configuration can be added in the Application Specification itself. This will manifest as a file inside the container.

The following details are needed for this:

Name Option Description Type type Set the value to INLINE Local Filename localFilename File name for the config inside the container. Data data Base64 encoded string for the data. The value for this will be masked on UI.

Config file:

port: 8080\nlogLevel: DEBUG\n
Corresponding config specification:
{\n    \"type\" : \"INLINE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"cG9ydDogODA4MApsb2dMZXZlbDogREVCVUcK\"\n}\n

Warning

The full base 64 encoded config data will get stored in Drove ZK and will be pushed to executors inline. It is not recommended to stream large config files to containers using this method. This will probably need additional configuration on your ZK cluster.

"},{"location":"applications/specification.html#locally-loaded-config","title":"Locally loaded config","text":"

Config file from a path on the executor directly. Such files can be distributed to the executor host using existing configuration management systems such as OpenTofu, Salt etc.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_LOCAL_FILE Local Filename localFilename File name for the config inside the container. File path filePathOnHost Path to the config file on executor host.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_LOCAL_FILE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"/mnt/configs/myservice/config.yml\"\n}\n

"},{"location":"applications/specification.html#controller-fetched-config","title":"Controller fetched Config","text":"

Config file can be fetched from a remote server by the controller. Once fetched, these will be streamed to the executor as part of the instance specification for starting a container.

The following details are needed for this:

Name Option Description Type type Set the value to CONTROLLER_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"CONTROLLER_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

The controller will make an API call for every single time it asks an executor to spin up a container. Please make sure to account for this in your configuration management system.

"},{"location":"applications/specification.html#executor-fetched-config","title":"Executor fetched Config","text":"

Config file can be fetched from a remote server by the executor before spinning up a container. Once fetched, the payload will be injected as a config file into the container.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

All executors will make an API call for every single time they spin up a container for this application. Please make sure to account for this in your configuration management system.

"},{"location":"applications/specification.html#http-call-specification","title":"HTTP Call Specification","text":"

This section details the options that can set when making http calls to a configuration management system from controllers or executors.

The following options are available for HTTP call:

Name Option Description Protocol protocol Protocol to use for upstream call. Can be HTTP or HTTPS. Hostname hostname Host to call. Port port Provide custom port. Defaults to 80 for http and 443 for https. API Path path Path component of the URL. Include query parameters here. Defaults to / HTTP Method verb Type of call, use GET, POST or PUT. Defaults to GET. Success Code successCodes List of HTTP status codes which is considered as success. Defaults to [200] Payload payload Data to be used for POST and PUT calls Connection Timeout connectionTimeout Timeout for upstream connection. Operation timeout operationTimeout Timeout for actual operation. Username username Username to be used basic auth. This field is masked out on the UI. Password password Password to be used for basic auth. This field is masked on the UI. Authorization Header authHeader Data to be passed in HTTP Authorization header. This field is masked on the UI. Additional Headers headers Any other headers to be passed to the upstream in the HTTP calls. This is a map of Skip SSL Checks insecure Skip hostname and certification checks during SSL handshake with the upstream."},{"location":"applications/specification.html#placement-policy-specification","title":"Placement Policy Specification","text":"

Placement policy governs how Drove deploys containers on the cluster. The following sections discuss the different placement policies available and how they can be configured to achieve optimal placement of containers.

Warning

All policies will work only at a {appName, version} combination level. They will not ensure constraints at an appName level. This means that for somethinge like a one per node placement, for the same appName, multiple containers can run on the same host if multiple deployments with different versions are active in a cluster. Same applies for all policies like N per host and so on.

Important details about executor tagging

"},{"location":"applications/specification.html#any-placement","title":"Any Placement","text":"

Containers for a {appName, version} combination can run on any un-tagged executor host.

Name Option Description Policy Type type Put ANY as policy.

Sample:

{\n    \"type\" : \"ANY\"\n}\n

Tip

For most use-cases this is the placement policy to use.

"},{"location":"applications/specification.html#one-per-host-placement","title":"One Per Host Placement","text":"

Ensures that only one container for a particular {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put ONE_PER_HOST as policy.

Sample:

{\n    \"type\" : \"ONE_PER_HOST\"\n}\n

"},{"location":"applications/specification.html#max-n-per-host-placement","title":"Max N Per Host Placement","text":"

Ensures that at most N containers for a {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put MAX_N_PER_HOST as policy. Max count max The maximum num of containers that can run on an executor. Range: 1-64

Sample:

{\n    \"type\" : \"MAX_N_PER_HOST\",\n    \"max\": 3\n}\n

"},{"location":"applications/specification.html#match-tag-placement","title":"Match Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has the tags as mentioned in the policy.

Name Option Description Policy Type type Put MATCH_TAG as policy. Max count tag The tag to match.

Sample:

{\n    \"type\" : \"MATCH_TAG\",\n    \"tag\": \"gpu_enabled\"\n}\n

"},{"location":"applications/specification.html#no-tag-placement","title":"No Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has no tags.

Name Option Description Policy Type type Put NO_TAG as policy.

Sample:

{\n    \"type\" : \"NO_TAG\"\n}\n

Info

The NO_TAG policy is mostly for internal use, and does not need to be specified when deploying containers that do not need any special placement logic.

"},{"location":"applications/specification.html#composite-policy-based-placement","title":"Composite Policy Based Placement","text":"

Composite policy can be used to combine policies together to create complicated placement requirements.

Name Option Description Policy Type type Put COMPOSITE as policy. Polices policies List of policies to combine Combiner combiner Can be AND and OR and signify all-match and any-match logic on the policies mentioned.

Sample:

{\n    \"type\" : \"COMPOSITE\",\n    \"policies\": [\n        {\n            \"type\": \"ONE_PER_HOST\"\n        },\n        {\n            \"type\": \"MATH_TAG\",\n            \"tag\": \"gpu_enabled\"\n        }\n    ],\n    \"combiner\" : \"AND\"\n}\n
The above policy will ensure that only one container of the relevant {appName,version} will run on GPU enabled machines.

Tip

It is easy to go into situations where no executors match complicated placement policies. Internally, we tend to keep things rather simple and use the ANY placement for most cases and maybe tags in a few places with over-provisioning or for hosts having special hardware

"},{"location":"applications/specification.html#environment-variables","title":"Environment variables","text":"

This config can be used to inject custom environment variables to containers. The values are defined as part of deployment specification, are same across the cluster and immutable to modifications from inside the container (ie any overrides from inside the container will not be visible across the cluster).

Sample:

{\n    \"MY_VARIABLE_1\": \"fizz\",\n    \"MY_VARIABLE_2\": \"buzz\"\n}\n

The following environment variables are injected by Drove to all containers:

Variable Name Value HOST Hostname where the container is running. This is for marathon compatibility. PORT_PORT_NUMBER A variable for every port specified in exposedPorts section. The value is the actual port on the host, the specified port is mapped to. For example if ports 8080 and 8081 are specified, two variables called PORT_8080 and PORT_8081 will be injected. DROVE_EXECUTOR_HOST Hostname where container is running. DROVE_CONTAINER_ID Container that is deployed DROVE_APP_NAME App name as specified in the Application Specification DROVE_INSTANCE_ID Actual instance ID generated by Drove DROVE_APP_ID Application ID as generated by Drove DROVE_APP_INSTANCE_AUTH_TOKEN A JWT string generated by Drove that can be used by this container to call /apis/v1/internal/... apis.

Warning

Do not pass secrets using environment variables. These variables are all visible on the UI as is. Please use Configs to inject secrets files and so on.

"},{"location":"applications/specification.html#command-line-arguments","title":"Command line arguments","text":"

A list of command line arguments that are sent to the container engine to execute inside the container. This is provides ways for you to configure your container behaviour based off such arguments. Please refer to docker documentation for details.

Danger

This might have security implications from a system point of view. As such Drove provides administrators a way to disable passing arguments at the cluster level by setting disableCmdlArgs to true in the controller configuration.

"},{"location":"applications/specification.html#check-specification","title":"Check Specification","text":"

One of the cornerstones of managing applications on the cluster is to ensure we keep track of instance health and manage their life cycle depending on their health state. We need to define how to monitor health for containers accordingly. The checks will be executed on Applications and a Check result is generated. The result consists of the following:

"},{"location":"applications/specification.html#common-options","title":"Common Options","text":"Name Option Description Mode mode The definition of a HTTP call or a Command to be executed in the container. See following sections for details. Timeout timeout Duration for which we wait before declaring a check as failed Interval interval Interval at which check will be retried Attempts attempts Number of times a check is retried before it is declared as a failure Initial Delay initialDelay Delay before executing the check for the first time.

Note

initialDelay is ignored when readiness checks and health checks are run in the recovery path as the container is already running at that point in time.

"},{"location":"applications/specification.html#http-check-options","title":"HTTP Check Options","text":"Name Option Description Type type Fixed to HTTP for HTTP checker Protocol protocol HTTP or HTTPS call to be made Port Name portName The name of the container port to make the http call on as specified in the Exposed Ports section in Application Spec Path path The api path to call HTTP method verb The HTTP Verb/Method to invoke. GET/PUT and POST are supported here Success Codes successCodes A set of HTTP status codes that we should consider as a success from this API. Payload payload A string payload that we can pass if the Verb is POST or PUT Connection Timeout connectionTimeout Maximum time for which the checker will wait for the connection to be set up with the container. Insecure insecure Skip hostname and certificate checks for HTTPS ports during checks."},{"location":"applications/specification.html#command-check-options","title":"Command Check Options","text":"Field Option Description Type type Fixed to CMD for command checker Command command Command to execute in the container. (Equivalent to docker exec -it <container> command>)"},{"location":"applications/specification.html#exposure-specification","title":"Exposure Specification","text":"

Exposure spec is used to specify the virtual host Drove Gateway exposes to outside world for communication with the containers.

The following information needs to be specified:

Name Option Description Virtual Host vhost The virtual host to be exposed on NGinx. This should be a fully qualified domain name. Port Name portName The portname to be exposed on the vhost. Port names are defined in exposedPorts section. Exposure Mode mode Use ALL here for now. Signifies that all healthy instances of the app are exposed to traffic.

Sample:

{\n    \"vhost\": \"teastapp.mydomain\",\n    \"port\": \"main\",\n    \"mode\": \"ALL\"\n}\n

Note

Application instances in any state other than HEALTHY are not considered for exposure. Please check Application Instance State Machine for an understanding of states of instances.

"},{"location":"applications/specification.html#configuring-pre-shutdown-behaviour","title":"Configuring Pre Shutdown Behaviour","text":"

Before a container is shut down, it is desirable to ensure things are spun down properly. This behaviour can be configured in the preShutdown section of the configuration.

Name Option Description Hooks hooks List of api calls and commands to be run on the container before it is killed. Each hook is either a HTTP Call Spec or Command Spec Wait Time waitBeforeKill Time to wait before killing the container.

Sample

{\n    \"hooks\": [\n        {\n            \"type\": \"HTTP\",\n            \"protocol\": \"HTTP\",\n            \"portName\": \"main\",\n            \"path\": \"/\",\n            \"verb\": \"GET\",\n            \"successCodes\": [\n                200\n            ],\n            \"payload\": \"\",\n            \"connectionTimeout\": \"1 second\"\n        }\n    ],\n    \"waitBeforeKill\": \"3 seconds\"//(48)!\n}\n

Note

The waitBeforeKill timed wait kicks in after all the hooks have been executed.

"},{"location":"applications/specification.html#logging-specification","title":"Logging Specification","text":"

Can be used to configure how container logs are managed on the system.

Note

This section affects the docker log driver. Drove will continue to stream logs to it's own logger which can be configured at executor level through the executor configuration file.

"},{"location":"applications/specification.html#local-logger-configuration","title":"Local Logger configuration","text":"

This is used to configure the json-file log driver.

Name Option Description Type type Set the value to LOCAL Max Size maxSize Maximum file size. Anything bigger than this will lead to rotation. Max Files maxFiles Maximum number of logs files to keep. Range: 1-100 Compress compress Enable log file compression.

Tip

If logging section is omitted, the following configuration is applied by default: - File size: 10m - Number of files: 3 - Compression: on

"},{"location":"applications/specification.html#rsyslog-configuration","title":"Rsyslog configuration","text":"

In case suers want to stream logs to an rsyslog server, the logging configuration needs to be set to RSYSLOG mode.

Name Option Description Type type Set the value to RSYSLOG Server server URL for the rsyslog server. Tag Prefix tagPrefix Prefix to add at the start of a tag Tag Suffix tagSuffix Suffix to add at the en of a tag.

Note

The default tag is the DROVE_INSTANCE_ID. The tagPrefix and tagSuffix will to before and after this

"},{"location":"cluster/cluster.html","title":"Anatomy of a Drove Cluster","text":"

The following diagram provides a high level overview of a typical Drove cluster. The overall topology consists of the following components:

"},{"location":"cluster/cluster.html#apache-zookeeper","title":"Apache ZooKeeper","text":"

Zookeeper is a central component in a Drove cluster. It is used in the following manner:

"},{"location":"cluster/cluster.html#controller","title":"Controller","text":"

The controller service is the brains of a Drove cluster. The role of the controller consists of the following:

"},{"location":"cluster/cluster.html#executors","title":"Executors","text":"

Executors are the agents running on the nodes where the containers are deployed. Role of the executors is the following:

"},{"location":"cluster/cluster.html#nginx-and-drove-gateway","title":"NGinx and Drove-Gateway","text":"

Almost all of the traffic between service containers is routed via the internal Ranger based service discovery system at PhonePe. However, traffic from the edge as well and between different protected environments are routed using the well-established virtual host (and additionally, in some unusual cases, header) based routing.

Tip

The NGinx deployment is standard across all Drove clusters. However, for clusters that receive a lot of traffic using Nginx, the cluster exposing the VHost for Drove itself might be separated from the one exposing the application virtual hosts to allow for easy scalability of the latter. The template for these are configured differently as needed respectively.

"},{"location":"cluster/cluster.html#other-components","title":"Other components","text":"

There are a few more components that are used for operational management and observability.

"},{"location":"cluster/cluster.html#telegraf","title":"Telegraf","text":"

PhonePe\u2019s internal metric management system uses a HTTP based metric collector. Telegraf is installed on all Drove nodes to collect metric from the metric port (Admin connector on Dropwizard) and push that information to our metric ingestion system. This information is then used to build dashboards as well as by our Anomaly detection and alerting systems.

"},{"location":"cluster/cluster.html#log-management","title":"Log Management","text":"

Drove provides a special logger called drove that can be configured to handle compression rotation and archival of container logs. Such container logs are stored on specialised partitions by application/application-instance-id or by source app name/ task id for application and task instances respectively. PhonePe\u2019s standardised log rotation tools are used to monitor and ship out such logs to our central log management system. The same can be replaced or enhanced by running something like promtail on Drove logs to ship out logs to tools like Grafana Loki.

"},{"location":"cluster/setup/controller.html","title":"Setting up Controllers","text":"

Controllers are the brains of Drove cluster. For HA, at least 2 controllers should be set up.

Please note the following behaviour about controllers:

"},{"location":"cluster/setup/controller.html#controller-configuration-file-reference","title":"Controller Configuration File Reference","text":"

The Drove Controller is written on the Dropwizard framework. The configuration to the service is set using a YAML file which needs to be injected into the container. A typical controller configuration file will look like the following:

server: #(1)!\n  applicationConnectors: #(2)!\n    - type: http\n      port: 4000\n  adminConnectors: #(3)!\n    - type: http\n      port: 4001\n  applicationContextPath: / #(4)!\n  requestLog: #(5)!\n    appenders:\n      - type: console\n        timeZone: ${DROVE_TIMEZONE}\n      - type: file\n        timeZone: ${DROVE_TIMEZONE}\n        currentLogFilename: /logs/drove-controller-access.log\n        archivedLogFilenamePattern: /logs/drove-controller-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\n\nlogging: #(6)!\n  level: INFO\n  loggers:\n    com.phonepe.drove: ${DROVE_LOG_LEVEL}\n\n  appenders:\n    - type: console #(7)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n    - type: file #(8)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      currentLogFilename: /logs/drove-controller.log\n      archivedLogFilenamePattern: /logs/drove-controller.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n      archive: true\n\n\nzookeeper: #(9)!\n  connectionString: ${ZK_CONNECTION_STRING}\n\nclusterAuth: #(10)!\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ${DROVE_CONTROLLER_SECRET}\n  - nodeType: EXECUTOR\n    secret: ${DROVE_EXECUTOR_SECRET}\n\nuserAuth: #(11)!\n  enabled: true\n  users:\n    - username: admin\n      password: ${DROVE_ADMIN_PASSWORD}\n      role: EXTERNAL_READ_WRITE\n    - username: guest\n      password: ${DROVE_GUEST_PASSWORD}\n      role: EXTERNAL_READ_ONLY\n\ninstanceAuth: #(12)!\n  secret: ${DROVE_INSTANCE_AUTH_SECRET}\n\noptions: #(13)!\n  maxStaleInstancesCount: 3\n  staleCheckInterval: 1m\n  staleAppAge: 1d\n  staleInstanceAge: 18h\n  staleTaskAge: 1d\n  clusterOpParallelism: 4\n
  1. Server listener configuration. See Dropwizard Server Configuration for the different options.
  2. Main port configuration. This is where the UI and APIs will be exposed. Check connector configuration docs for details.
  3. Admin port. You can take thread dumps, metrics, run healthchecks on the Drove controller on this port.
  4. Base path for UI. Keep this as is.
  5. Access logs configuration. See requestLog docs.
  6. Main logging configuration. See logging docs.
  7. Log to console. Useful in docker-compose.
  8. Log to rotating files. Useful for running servers.
  9. Configure how to connect to Zookeeper See Zookeeper Config for details.
  10. Configuration for authentication between nodes in the cluster. Please check intra node auth config for details.
  11. Configure user authentication to access the cluster. Please check User auth config for details.
  12. Signing secret for JWT to be embedded in application and task instances. Check Instance auth config for details.
  13. Special options to configure controller behaviour. See Controller Options for details.

Tip

In case you do not want to expose admin apis to outside the host, please set bindHost in the admin connectors section.

adminConnectors:\n  - type: http\n    port: 10001\n    bindHost: 127.0.0.1\n
"},{"location":"cluster/setup/controller.html#zookeeper-connection-configuration","title":"Zookeeper Connection Configuration","text":"

The following details can be configured.

Name Option Description Connection String connectionString The connection string of the form: zkserver:2181,zkserver2:2181... Data namespace namespace The top level node inside which all Drove data will be scoped. Defaults to drove if not set.

Sample

zookeeper:\n  connectionString: \"192.168.3.10:2181,192.168.3.11:2181,192.168.3.12:2181\"\n  namespace: drovetest\n
"},{"location":"cluster/setup/controller.html#intra-node-authentication-configuration","title":"Intra Node Authentication Configuration","text":"

Communication between controller and executor is protected by a shared-secret based authentication. The following configuration is meant to configure this. This section consists of a list of 2 members:

Each section consists of the following:

Name Option Description Node Type nodeType Type of node in the cluster. Can be CONTROLLER or EXECUTOR Secret secret The actual secret to be passed.

Sample

clusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ControllerSecretValue\n  - nodeType: EXECUTOR\n    secret: ExecutorSecret\n

Danger

The values are passed in the header as is. Please manage the config file ownership to ensure that the files are not world readable.

Tip

You can use pwgen -s 32 to generate secure random strings for usage as secrets.

"},{"location":"cluster/setup/controller.html#user-authentication-configuration","title":"User Authentication Configuration","text":"

This section is used to configure user details for human and other systems that need to call Drove APIs or access the Drove UI. This is implemented using basic auth.

The configuration consists of:

Name Option Description Enabled enabled Enable basic auth for the cluster Encoding encoding The actual encoding of the password. Can be PLAIN or CRYPT Caching cachingPolicy Caching policy for the authentication and authorization of the user. Please check CaffeineSpec docs for more details. Set to maximumSize=500, expireAfterAccess=30m by default List of users users A list of users recognized by the system

Each entry in the user list consists of:

Name Option Description User Name username The actual login username Password password The password for the user. Needs to be set to bcrypt string of the actual password if encoding is set to CRYPT in the parent section. User Role role The role of the user in the cluster. Can be EXTERNAL_READ_WRITE for users who have both read and write permissions or EXTERNAL_READ_ONLY for users with read-only permissions.

Sample

userAuth:\n  enabled: true\n  encoding: CRYPT\n  users:\n    - username: admin\n      password: \"$2y$10$pfGnPkYrJEGzasvVNPjRu.IJldV9TDa0Vh.u1UdimILWDuhvapc2O\"\n      role: EXTERNAL_READ_WRITE\n    - username: guest\n      password: \"$2y$10$uCJ7WxIvd13C.1oOTs28p.xpJShGiTWuDLY/sGH9JE8nrkSGBFkc6\"\n      role: EXTERNAL_READ_ONLY\n    - username: noread\n      password: \"$2y$10$8mr/zXL5rMW/s/jlBcgXHu0UvyzfdDDvyc.etfuoR.991sn9UOX/K\"\n

No authentication

To configure a cluster without authentication, remove this section entirely.

Operator role

If role is not set, the user will be able to access the UI, but will not have access to application logs. This comes in handy to provide access to other teams to explore your deployment topology, but not get access to your logs that might contain sensitive information.

Password Hashing

We strongly recommend using bcrypt passwords for authentication. You can use the following command to generate hashed password strings:

htpasswd -nbBC 10 <username> <password>|cut -d ':' -f2\n
"},{"location":"cluster/setup/controller.html#instance-authentication-configuration","title":"Instance Authentication Configuration","text":"

All application and task instances, get access to an unique JWT that is injected into it by Drove as the environment variable DROVE_APP_INSTANCE_AUTH_TOKEN. This token is signed using a secret. This secret can be configured by setting the secret parameter in the instanceAuth section.

Sample

instanceAuth:\n  secret: RandomSecret\n

"},{"location":"cluster/setup/controller.html#controller-options","title":"Controller Options","text":"

The following options can be set to influence the behavior of the Drove cluster and the controller.

Name Option Description Stale Check Interval staleCheckInterval Interval at which Drove checks for stale application and task metadata for cleanup. Defaults to 1 hour. Expressed in duration. Stale App Age staleAppAge Apps in MONITORING state are cleaned up after some time by Drove. This variable can be used to control the max time for which such apps are maintained in the cluster. Defaults to 7 days. Expressed in duration. Stale App Instances Count maxStaleInstancesCount Maximum number of application instances metadata for stopped or lost instances to be maintained in the cluster. Defaults to 100. Stale Instance Age staleInstanceAge Maximum age for a stale application instance to be retained. Defaults to 7 days. Expressed in duration. Stale Task Age staleTaskAge Maximum time for which metadata for a finished task is retained on the cluster. Defaults to 2 days. Expressed in duration. Event Storage Duration maxEventsStorageDuration Maximum time for which cluster events are retained on the cluster. Defaults to 1 hour. Expressed in duration. Default Operation Timeout clusterOpTimeout Timeout for operations that are initiated by drove itself. For example, instance spin up in case of executor failure, instance migrations etc. Defaults to 5 minutes. Expressed in duration. Operation threads clusterOpParallelism Signified the parallelism for operations internal to the cluster. Defaults to: 1. Range: 1-32. Audited Methods auditedHttpMethods Drove prints an audit log with user details when an api is called by an user. Defaults to [\"POST\", \"PUT\"]. Allowed mount directories allowedMountDirs If provided, Drove will ensure that application and task spec can mount only the directories mentioned in this set on executor host. Disable read-only auth disableReadAuth When userAuth is enabled, setting this option, will enforce authorization only on write operations. Disable command line arguments disableCmdlArgs When set to true, passing command line arguments will be disabled. Default: false (users can pass arguments.

Sample

options:\n  staleCheckInterval: 5m\n  staleAppAge: 2d\n  maxStaleInstancesCount: 20\n  staleInstanceAge: 1d\n  staleTaskAge: 2d\n  maxEventsStorageDuration: 30m\n  clusterOpParallelism: 32\n  allowedMountDirs:\n   - /mnt/scratch\n

"},{"location":"cluster/setup/controller.html#stale-data-cleanup","title":"Stale data cleanup","text":"

In order to keep internal memory footprint low, reduce the amount of data stored on Zookeeper, and provide a faster experience on the UI,Drove keeps cleaning up data for stale applications, application instances, task instances and cluster events.

The retention for such metadata can be controlled using the following config options:

Warning

Configuration changes done to these parameters will have direct impact on memory usage by the controller and memory and disk utilization on the Zookeeper cluster.

"},{"location":"cluster/setup/controller.html#internal-operations","title":"Internal Operations","text":"

Drove may need to create and issue operations on applications and tasks to manage cluster stability, for maintenance and other reasons. The following parameters can be used to control the speed and parallelism of such operations:

Tip

The default value of 1 for the clusterOpParallelism parameter is generally too low for most clusters. Unless there is a specific problem, it would be advisable to set this to at least 4. If number of instances is quite high for applications (order of tens or hundreds), feel free to set this to 32.

Increasing clusterOpParallelism will make recovery faster in case of executor failures, but it will increase cpu utilization on the controller by a little bit.

"},{"location":"cluster/setup/controller.html#security-related-options","title":"Security related options","text":"

The auditedHttpMethods parameter contains a list of all HTTP methods that need to be audited. This means that if the auditedHttpMethods contains POST and PUT, any drove HTTP POST or PUT apis being called will lead to a audit in the controller logs with the details of the user that made the call.

Warning

It would be advisable to not add GET to the list. This is because the UI keeps making calls to GET apis on drove to fetch data to render. These calls are automated and happen every few seconds from the browser. This will blow up controller logs size.

The allowedMountDirs option whitelists only some directories to be mounted on containers. If this is not provided, containers will be able to mount any directory on the executors.

Danger

It is highly recommended to set allowedMountDirs to a designated directory that containers might want to use as scratch space if needed. Keeping this empty will almost definitely cause security issues in the long run.

"},{"location":"cluster/setup/controller.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

We shall be volume mounting the config and log directories with the same name.

Prerequisite Setup

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

"},{"location":"cluster/setup/controller.html#setup-the-config-file","title":"Setup the config file","text":"

Create a relevant configuration file in /etc/drove/controller/controller.yml.

Sample

server:\n  applicationConnectors:\n    - type: http\n      port: 10000\n  adminConnectors:\n    - type: http\n      port: 10001\n  requestLog:\n    appenders:\n      - type: file\n        timeZone: IST\n        currentLogFilename: /var/log/drove/controller/drove-controller-access.log\n        archivedLogFilenamePattern: /var/log/drove/controller/drove-controller-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\nlogging:\n  level: INFO\n  loggers:\n    com.phonepe.drove: INFO\n\n\n  appenders:\n    - type: file\n      threshold: ALL\n      timeZone: IST\n      currentLogFilename: /var/log/drove/controller/drove-controller.log\n      archivedLogFilenamePattern: /var/log/drove/controller/drove-controller.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n\nzookeeper:\n  connectionString: \"192.168.56.10:2181\"\n\nclusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: \"0v8XvJrDc7r86ZY1QCByPTDPninI4Xii\"\n  - nodeType: EXECUTOR\n    secret: \"pOd9sIEXhv0wrGOVc7ebwNvR7twZqyTN\"\n\nuserAuth:\n  enabled: true\n  encoding: CRYPT\n  users:\n    - username: admin\n      password: \"$2y$10$pfGnPkYrJEGzasvVNPjRu.IJldV9TDa0Vh.u1UdimILWDuhvapc2O\"\n      role: EXTERNAL_READ_WRITE\n    - username: guest\n      password: \"$2y$10$uCJ7WxIvd13C.1oOTs28p.xpJShGiTWuDLY/sGH9JE8nrkSGBFkc6\"\n      role: EXTERNAL_READ_ONLY\n\n\ninstanceAuth:\n  secret: \"bd2SIgz9OMPG2L8wA6zxj21oLVLbuLFC\"\n\noptions:\n  maxStaleInstancesCount: 3\n  staleCheckInterval: 1m\n  staleAppAge: 2d\n  staleInstanceAge: 1d\n  staleTaskAge: 1d\n  clusterOpParallelism: 4\n  allowedMountDirs:\n   - /dev/null\n

"},{"location":"cluster/setup/controller.html#setup-required-environment-variables","title":"Setup required environment variables","text":"

Environment variables need to run the drove controller are setup in /etc/drove/controller/controller.env.

CONFIG_FILE_PATH=/etc/drove/controller/controller.yml\nJAVA_PROCESS_MIN_HEAP=2g\nJAVA_PROCESS_MAX_HEAP=2g\nZK_CONNECTION_STRING=\"192.168.3.10:2181\"\nJAVA_OPTS=\"-Xlog:gc:/var/log/drove/controller/gc.log -Xlog:gc:::filecount=3,filesize=10M -Xlog:gc::time,level,tags -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Dfile.encoding=utf-8 -Djute.maxbuffer=0x9fffff\"\n
"},{"location":"cluster/setup/controller.html#create-systemd-file","title":"Create systemd file","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.controller.service:

[Unit]\nDescription=Drove Controller Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-controller:latest\nExecStart=/usr/bin/docker run  \\\n    --env-file /etc/drove/controller/controller.env \\\n    --volume /etc/drove/controller:/etc/drove/controller:ro \\\n    --volume /var/log/drove/controller:/var/log/drove/controller \\\n    --publish 10000:10000  \\\n    --publish 10001:10001 \\\n    --hostname %H \\\n    --rm \\\n    --name drove.controller \\\n    ghcr.io/phonepe/drove-controller:latest\n\n[Install]\nWantedBy=multi-user.target\n

Verify the file with the following command:

systemd-analyze verify drove.controller.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.controller.service\n

"},{"location":"cluster/setup/controller.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.controller\nsystemctl start drove.controller\n

You can tail the logs at /var/logs/drove/controller/drove-controller.log.

The console would be available at http://<ip>:10000 and admin functionality will be available on http://<ip>:10001 according to the above config.

Health checks can be performed by running a curl as follows:

curl http://localhost:10001/healthcheck\n

Note

Once controllers are up, one of them will become the leader. You can check the leader by running the following command:

curl http://<ip>:10000/apis/v1/ping\n

Only on the leader you should get the following response along with a HTTP status 200/OK:

{\n    \"status\":\"SUCCESS\",\n    \"data\":\"pong\",\n    \"message\":\"success\"\n}\n

"},{"location":"cluster/setup/executor-setup.html","title":"Setting up Executor Nodes","text":"

We shall setup the executor nodes by setting up the hardware, operating system first and then the executor service itself.

"},{"location":"cluster/setup/executor-setup.html#considerations-and-tuning-for-hardware-and-operating-system","title":"Considerations and tuning for hardware and operating system","text":"

In the following sections we discus some aspects of scheduling, hardware and settings on the OS to ensure good performance.

"},{"location":"cluster/setup/executor-setup.html#cpu-and-memory-considerations","title":"CPU and Memory considerations","text":"

The executor nodes are the servers that host and run the actual docker containers. Drove will take into consideration the NUMA topology of these machines to optimize the placement for containers to extract the maximum performance. Along with this, Drove will cpuset the containers to the allocated cores in a non overlapping manner, so that the cores allocated to a container are dedicated to it. Memory allocated to a container is pinned as well and selected from the same NUMA node.

Needless to say the minimum amount of CPU that can be given to an application or task is 1. Fractional cpu allocation can be achieved in a predictable manner by configuring over provisioning on executor nodes.

"},{"location":"cluster/setup/executor-setup.html#over-provisioning-of-cpu-and-memory","title":"Over Provisioning of CPU and Memory","text":"

Drove does not do any kind of burst scaling or overcommitment to ensure application performance remains predictable even under load. Instead, in Drove, there is a feature to make executors appear to have more cores (and memory) than it actually has. This can be used to get more utilization out of executor nodes in clusters that do not need guaranteed performance (for example staging or dev testing clusters). This is achieved by enabling over provisioning.

Over provisioning needs to be configured in the executor configuration. It primarily consists of two configs:

VCores (virtual cores) are internal representation of a CPU core on the executor. If over provisioning is disabled, a vcore will correspond to a physical core. If over provisioning is enabled, 1 CPU core will generate cpu multiplier number of v cores. Drove does do cpuset even on containers running on nodes that have over provisioning enabled, however the physical cores that the containers get bound to are chosen at random, albeit from the same NUMA node. cpuset-mem is always done on the same NUMA node as well.

Mixed clusters

In some production clusters you might have applications that are non critical in terms of performance and are unable to utilize a full core. These can be tagged to be spun up on some nodes where over provisioning is enabled. Adopting such a cluster topology will ensure that containers that need high performance run on nodes without over provisioning and the smaller apps (like for example operations consoles etc) are run on separate nodes with over provisioning enabled. Just ensure the latter are tagged properly and during app deployment specify this tag in application spec or task spec.

"},{"location":"cluster/setup/executor-setup.html#disable-numa-pinning","title":"Disable NUMA Pinning","text":"

There is an option to disable memory and core pinning. In this situation, all cores from all NUM nodes show up as being part of one node. cpuset-mems is not called if numa pinning is disabled and therefore you will be leaving some memory performance on the table. We recommend not to dabble with this unless you have tasks and containers that need more than the number of cores available on a single NUMA node. This setting is enabled at executor level by setting disableNUMAPinning: true.

"},{"location":"cluster/setup/executor-setup.html#hyper-threading","title":"Hyper-threading","text":"

Whether Hyper Threading needs to be enabled or not is a bit dependent on applications deployed and how effectively they can utilize individual CPU cores. For mixed workloads, we recommend Hyper Threading to be enabled on the executor nodes.

"},{"location":"cluster/setup/executor-setup.html#isolating-container-and-os-processes","title":"Isolating container and OS processes","text":"

Typically we would not want containers to share CPU resources with processes for the operating system, Drove Executor Service as well as Docker engine (if using docker) and so on. While complete isolation would need creating a full scheduler (and passing isolcpus to GRUB parameters), we can get a good middle ground by ensuring such processes utilize only a few CPU cores on the system, and let the Drove executors deploy and pin containers to the rest.

This is achieved in two steps:

Let's say our server has 2 NUMA nodes, each with 40 hyper-threaded cores. We want to reserve the first 2 cores from each CPU to the OS processes. So we reserve cores [0,1,2,3] for the OS processes.

The following line in /etc/systemd/system.conf

#CPUAffinity=\n

needs to be changed to

CPUAffinity=0 1 2 3\n

Tip

Reboot the machine for this to take effect.

The changes can be validated post reboot by running the following command:

grep Cpus_allowed_list /proc/1/status\n

The expected output should be:

Cpus_allowed_list:  0-3\n

Note

Refer to this for more details.

"},{"location":"cluster/setup/executor-setup.html#gpu-computation","title":"GPU Computation","text":"

Nvidia based GPU compute can be enabled at executor level by installing relevant drivers. Please follow the setup guide to enable this. Remember to tag these nodes to isolate them from the primary cluster and use tags to deploy apps and tasks that need GPU.

"},{"location":"cluster/setup/executor-setup.html#storage-consideration","title":"Storage consideration","text":"

On executor nodes the disk might be under pressure if container (re)deployments are frequent or the containers log very heavily. As such, we recommend the logging directory for Drove be mounted on hardware that will be able to handle this load. Similar considerations need to be given to the log and package directory for docker or podman.

"},{"location":"cluster/setup/executor-setup.html#executor-configuration-reference","title":"Executor Configuration Reference","text":"

The Drove Executor is written on the Dropwizard framework. The configuration to the service is set using a YAML file which needs to be injected into the container. A typical controller configuration file will look like the following:

server: #(1)!\n  applicationConnectors: #(2)!\n    - type: http\n      port: 3000\n  adminConnectors: #(3)!\n    - type: http\n      port: 3001\n  applicationContextPath: /\n  requestLog:\n    appenders:\n      - type: console\n        timeZone: ${DROVE_TIMEZONE}\n      - type: file\n        timeZone: ${DROVE_TIMEZONE}\n        currentLogFilename: /logs/drove-executor-access.log\n        archivedLogFilenamePattern: /logs/drove-executor-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\nlogging:\n  level: INFO\n  loggers:\n    com.phonepe.drove: ${DROVE_LOG_LEVEL}\n\n  appenders: #(4)!\n    - type: console #(5)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{instanceLogId}] %message%n\"\n    - type: file #(6)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      currentLogFilename: /logs/drove-executor.log\n      archivedLogFilenamePattern: /logs/drove-executor.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n      archive: true\n\n    - type: drove #(7)!\n      logPath: \"/logs/applogs/\"\n      archivedLogFileSuffix: \"%d\"\n      archivedFileCount: 3\n      threshold: TRACE\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) | %-23date | %-30logger{0} | %message%n\"\n      archive: true\n\nzookeeper: #(8)!\n  connectionString: ${ZK_CONNECTION_STRING}\n\nclusterAuth: #(9)!\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ${DROVE_CONTROLLER_SECRET}\n  - nodeType: EXECUTOR\n    secret: ${DROVE_EXECUTOR_SECRET}\n\nresources: #(10)!\n  osCores: [ 0, 1 ]\n  exposedMemPercentage: 60\n  disableNUMAPinning: ${DROVE_DISABLE_NUMA_PINNING}\n  enableNvidiaGpu: ${DROVE_ENABLE_NVIDIA_GPU}\n\noptions: #(11)!\n  cacheImages: true\n  maxOpenFiles: 10_000\n  logBufferSize: 5m\n  cacheFileSize: 10m\n  cacheFileCount: 3\n
  1. Server listener configuration. See Dropwizard Server Configuration for the different options.
  2. Main port configuration. This is where the UI and APIs will be exposed. Check connector configuration docs for details.
  3. Admin port. You can take thread dumps, metrics, run healthchecks on the Drove controller on this port.
  4. Logging configuration. See logging docs.
  5. Log to console. Useful in docker-compose.
  6. Log to rotating files. Useful for running servers.
  7. Drove application logger configuration. See drove logger config for details.
  8. Configure how to connect to Zookeeper See Zookeeper Config for details.
  9. Configuration for authentication between nodes in the cluster. Please check intra node auth config for details.
  10. Resource configuration for this node.
  11. Options to configure executor behaviour. Check executor options section for details.

Tip

In case you do not want to expose admin apis to outside the host, please set bindHost in the admin connectors section.

adminConnectors:\n  - type: http\n    port: 10001\n    bindHost: 127.0.0.1\n
"},{"location":"cluster/setup/executor-setup.html#zookeeper-connection-configuration","title":"Zookeeper Connection Configuration","text":"

The following details can be configured.

Name Option Description Connection String connectionString The connection string of the form: zkserver:2181,zkserver2:2181... Data namespace namespace The top level node inside which all Drove data will be scoped. Defaults to drove if not set.

Sample

zookeeper:\n  connectionString: \"192.168.3.10:2181,192.168.3.11:2181,192.168.3.12:2181\"\n  namespace: drovetest\n

Note

This section is same across the cluster including both controller and executor.

"},{"location":"cluster/setup/executor-setup.html#intra-node-authentication-configuration","title":"Intra Node Authentication Configuration","text":"

Communication between controller and executor is protected by a shared-secret based authentication. The following configuration is meant to configure this. This section consists of a list of 2 members:

Each section consists of the following:

Name Option Description Node Type nodeType Type of node in the cluster. Can be CONTROLLER or EXECUTOR Secret secret The actual secret to be passed.

Sample

clusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ControllerSecretValue\n  - nodeType: EXECUTOR\n    secret: ExecutorSecret\n

Note

This section is same across the cluster including both controller and executor.

"},{"location":"cluster/setup/executor-setup.html#drove-application-logger-configuration","title":"Drove Application Logger Configuration","text":"

Drove will segregate application and task instance logs in a directory of your choice. The path for such files is set as: - <application id>/<instance id> for Application Instances - <sourceAppName>/<task id> for Task Instances

The Drove Log Appender is based of LogBack's Sifting Appender.

The following configuration options are supported:

Name Option Description Path logPath Directory to host the logs Archive old logs archive Whether to enable log rotation Archived File Suffix archivedLogFileSuffix Suffix for archived log files. Archived File Count archivedFileCount Count of archived log files. Older files are deleted. File Size maxFileSize Size of current log file after which it is archived and a new file is created. Unit: DataSize. Total Size totalSizeCap total size after which deletion takes place. Unit: DataSize. Buffer Size bufferSize Buffer size for the logger. (Set to 8KB by default). Used if immediateFlush is turned off. Immediate Flush immediateFlush Flush logs immediately. Set to true by default (recommended)

Sample

logging:\n  level: INFO\n  ...\n\n  appenders:\n    # Setup appenders for the executor process itself first\n    ...\n\n    - type: drove\n      logPath: \"/logs/applogs/\"\n      archivedLogFileSuffix: \"%d\"\n      archivedFileCount: 3\n      threshold: TRACE\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) | %-23date | %-30logger{0} | %message%n\"\n      archive: true\n

"},{"location":"cluster/setup/executor-setup.html#resource-configuration","title":"Resource Configuration","text":"

This section can be used to configure how resources are exposed from an executor to the cluster. We have discussed a few of the considerations that will drive the configuration that is being setup.

Name Option Description OS Cores osCores A list of cores reserved for use by operating system processes. See the relevant section for details on the pre-steps needed to achieve this. Exposed Memory exposedMemPercentage What percentage of the system memory can be used by the containers running on the host collectively. Range: 50-100 integer NUMA Pinning disableNUMAPinning Disable NUMA and CPU core pinning for containers. Pinning is on by default. (default: false) Nvidia GPU enableNvidiaGpu Enable GPU support on containers. This setting makes all available Nvidia GPUs on the current executor machine available for any container running on this executor. GPU resources are not discovered on the executor, managed and rationed between containers. Needs to be used in conjunction with tagging (see tags below) to ensure only the applications which require a GPU end up on the executor with GPUs. Tags tags A set of strings that can be used in TAG placement policy to route application and task instances to this executor. Over Provisioning overProvisioning Setup over provisioning configuration.

Tagging

The current hostname is always added as a tag by default and is handled specially to allow for non-tagged deployments to be routed to this executor. If any tag is specified in the tags config, this node will receive containers only when MATCH_TAG placement is used. Please check relevant sections to specify correct placement policies for applications and tasks.

Sample

resources:\n  osCores: [0,1,2,3]\n  exposedMemPercentage: 90\n

"},{"location":"cluster/setup/executor-setup.html#over-provisioning-configuration","title":"Over provisioning configuration","text":"

Drove strives to ensure that containers can run unencumbered on CPU cores allocated to them. This means that the minimum allocation unit possible is 1 for cores. It does not support fractional CPU.

However, there are situations where we would want some non-critical applications to run the cluster but not waste CPU. The overProvisioning configuration aims to provide user a way to turn off NUMA pinning on the executor and run more containers than it normally would.

To ensure predictability, we do not want pinned and non-pinned containers running on the same host. Hence, an executor host can either be running in pinned mode or in non-pinned mode.

To enable more containers than we could usually deploy and to still retain some level of control on how small you want a container to go, we specify multipliers on CPU and memory.

Example: - Let's say your executor server has 40 cores available. If you set cpuMultiplier as 4, this node will now show up as having 160 cores to the controller. - Let's say your server had 512GB of memory, setting memoryMultiplier to 2 will make drove see it as 1TB.

Name Option Description Enabled enabled Set this to true to enable over provisioning. Default: false CPU Multiplier cpuMultiplier Multiplier to be applied to enable cpu over provisioning. Default: 1. Range: 1-20 Memory Multiplier memoryMultiplier Multiplier to be applied to enable memory over provisioning. Default: 1. Range: 1-20

Sample

resources:\n  exposedMemPercentage: 90\n  overProvisioning:\n    enabled: true\n    memoryMultiplier: 1\n    cpuMultiplier: 3\n

Tip

This feature was developed to allow us to run our development environments cheaper. In such environments there is not much pressure on CPU or memory, but a large number of containers run as developers can spin up containers for features they are working on. There was no point is wasting a full core on containers that get hit twice a minute or less. On production we tend to err on the side of caution and allocate at least one core even to the most trivial applications as of the time of writing this.

"},{"location":"cluster/setup/executor-setup.html#executor-options","title":"Executor Options","text":"

The following options can be set to influence the behavior for the Drove executors.

Name Option Description Hostname hostname Override the hostname that gets exposed to the controller. Make sure this is resolvable. Cache Images cacheImages Cache container images. If this is not passed, a container image is removed when a container dies and no other instance is using the image. Command Timeout containerCommandTimeout Timeout used by the container engine client when issuing container commands to docker or podman Container Socket Path dockerSocketPath The path of socket for docker socket. Comes in handy to configure path for socket when using podman etc. Max Open Files maxOpenFiles Override the maximum number of file descriptors a container can open. Default: 470,000 Log Buffer Size logBufferSize The size of the buffer the executor uses to read logs from container. Unit DataSize. Range: 1-128MB. Default: 10MB Cache File Size cacheFileSize To limit disk usage, configure fixed size log file cache for containers. Unit: DataSize. Range: 10MB-100GB. Default: 20MB. Compression is always enabled. Cache File Count cacheFileSize To limit disk usage, configure fixed count of log file cache for containers. Unit: integer. Max: 1024. Default: 3

Sample

options:\n  logBufferSize: 20m\n  cacheFileSize: 30m\n  cacheFileCount: 3\n  cacheImages: true\n

"},{"location":"cluster/setup/executor-setup.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

We shall be volume mounting the config and log directories with the same name.

Prerequisite Setup

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

"},{"location":"cluster/setup/executor-setup.html#setup-the-config-file","title":"Setup the config file","text":"

Create a relevant configuration file in /etc/drove/controller/executor.yml.

Sample

server:\n  applicationConnectors:\n    - type: http\n      port: 11000\n  adminConnectors:\n    - type: http\n      port: 11001\n  requestLog:\n    appenders:\n      - type: file\n        timeZone: IST\n        currentLogFilename: /var/log/drove/executor/drove-executor-access.log\n        archivedLogFilenamePattern: /var/log/drove/executor/drove-executor-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\nlogging:\n  level: INFO\n  loggers:\n    com.phonepe.drove: INFO\n\n\n  appenders:\n    - type: file\n      threshold: ALL\n      timeZone: IST\n      currentLogFilename: /var/log/drove/executor/drove-executor.log\n      archivedLogFilenamePattern: /var/log/drove/executor/drove-executor.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n    - type: drove\n      logPath: \"/var/log/drove/executor/instance-logs\"\n      archivedLogFileSuffix: \"%d-%i\"\n      archivedFileCount: 0\n      maxFileSize: 1GiB\n      threshold: INFO\n      timeZone: IST\n      logFormat: \"%(%-5level) | %-23date | %-30logger{0} | %message%n\"\n      archive: true\n\nzookeeper:\n  connectionString: \"192.168.56.10:2181\"\n\nclusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: \"0v8XvJrDc7r86ZY1QCByPTDPninI4Xii\"\n  - nodeType: EXECUTOR\n    secret: \"pOd9sIEXhv0wrGOVc7ebwNvR7twZqyTN\"\n\nresources:\n  osCores: []\n  exposedMemPercentage: 90\n  disableNUMAPinning: true\n  overProvisioning:\n    enabled: true\n    memoryMultiplier: 10\n    cpuMultiplier: 10\n\noptions:\n  cacheImages: true\n  logBufferSize: 20m\n  cacheFileSize: 30m\n  cacheFileCount: 3\n  cacheImages: true\n

"},{"location":"cluster/setup/executor-setup.html#setup-required-environment-variables","title":"Setup required environment variables","text":"

Environment variables need to run the drove controller are setup in /etc/drove/executor/executor.env.

CONFIG_FILE_PATH=/etc/drove/executor/executor.yml\nJAVA_PROCESS_MIN_HEAP=1g\nJAVA_PROCESS_MAX_HEAP=1g\nZK_CONNECTION_STRING=\"192.168.56.10:2181\"\nJAVA_OPTS=\"-Xlog:gc:/var/log/drove/executor/gc.log -Xlog:gc:::filecount=3,filesize=10M -Xlog:gc::time,level,tags -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Dfile.encoding=utf-8 -Djute.maxbuffer=0x9fffff\"\n
"},{"location":"cluster/setup/executor-setup.html#create-systemd-file","title":"Create systemd file","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.executor.service:

[Unit]\nDescription=Drove Executor Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-executor:latest\nExecStart=/usr/bin/docker run  \\\n    --env-file /etc/drove/executor/executor.env \\\n    --volume /etc/drove/executor:/etc/drove/executor:ro \\\n    --volume /var/log/drove/executor:/var/log/drove/executor \\\n    --volume /var/run/docker.sock:/var/run/docker.sock \\\n    --publish 11000:11000  \\\n    --publish 11001:11001 \\\n    --hostname %H \\\n    --rm \\\n    --name drove.executor \\\n    ghcr.io/phonepe/drove-executor:latest\n\n[Install]\nWantedBy=multi-user.target\n
Verify the file with the following command:
systemd-analyze verify drove.executor.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.executor.service\n

"},{"location":"cluster/setup/executor-setup.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.executor\nsystemctl start drove.executor\n

You can tail the logs at /var/logs/drove/executor/drove-executor.log.

The executor should now show up on the Drove Console.

"},{"location":"cluster/setup/gateway.html","title":"Setting up Drove Gateway","text":"

The Drove Gateway works as a gateway to expose apps running on a drove cluster to rest of the world.

Drove Gateway container uses NGinx and a modified version of Nixy to track drove endpoints. More details about this can be found in the grove-gateway project.

"},{"location":"cluster/setup/gateway.html#drove-gateway-nixy-configuration-reference","title":"Drove Gateway Nixy Configuration Reference","text":"

The nixy running inside the gateway container is configured using a custom TOML file. This section looks into this file:

address = \"127.0.0.1\"# (1)!\nport = \"6000\"\n\n\n# Drove Options\ndrove = [#(2)!\n  \"http://controller1.mydomain:10000\",\n   \"http://controller1.mydomain:10000\"\n   ]\n\nleader_vhost = \"drove-staging.mydomain\"#(3)!\nevent_refresh_interval_sec = 5#(5)!\nuser = \"\"#(6)!\npass = \"\"\naccess_token = \"\"#(7)!\n\n# Parameters to control which apps are exposed as VHost\nrouting_tag = \"externally_exposed\"#(4)!\nrealm = \"api.mydomain,support.mydomain\"#(8)!\nrealm_suffix = \"-external.mydomain\"#(9)!\n\n# Nginx related config\n\nnginx_config = \"/etc/nginx/nginx.conf\"#(10)!\nnginx_template = \"/etc/drove/gateway/nginx.tmpl\"#(11)!\nnginx_cmd = \"nginx\"#(12)!\nnginx_ignore_check = true#(13)!\n\n# NGinx plus specific options\nnginxplusapiaddr=\"127.0.0.1\"#(14)!\nnginx_reload_disabled=true#(15)!\nmaxfailsupstream = 0#(16)!\nfailtimeoutupstream = \"1s\"\nslowstartupstream = \"0s\"\n
  1. Nixy listener configuration. Endpoint for nixy itself.

  2. List of Drove controllers. Add all controller nodes here. Nixy will automatically determine and track the current leader.

    Auto detection is disabled if a single endpoint is specified.

  3. Helps create a vhost entry that tracks the leader on the cluster. Use this to expose the Drove endpoint to users. The value for this will be available to the template engine as the LeaderVHost variable.

  4. If some special routing behaviour needs to be implemented in the template based on some tag metadata of the deployed apps, set the routing_tag option to set the tag name to be used. The actual value is derived from app instances and exposed to the template engine as the variable: RoutingTag. Optional.

    In this example, the RoutingTag variable will be set to the value specified in the routing_tag tag key specified when deploying the Drove Application. For example, if we want to expose the app we can set it to yes, and filter the VHost to be exposed in NGinx template when RoutingTag == \"yes\".

  5. Drove Gateway/Nixy works on event polling on controller. This is the polling interval. Especially if number of NGinx nodes is high. Default is 2 seconds. Unless cluster is really busy with a high rate of change of containers, this strikes a good balance between apps becoming discoverable vs putting the leader controller under heavy load.

  6. user and pass are optional params can be used to set basic auth credentials to the calls made to Drove controllers if basic auth is enabled on the cluster. Leave empty if no basic auth is required.

  7. If cluster has some custom header based auth, the following can be used. The contents on this parameter are passed verbatim to the Authorization HTTP header. Leave empty if no token auth is enabled on the cluster.

  8. By default drove-gateway will expose all vhost declared in the spec for all drove apps on a cluster (caveat: filtering can be done using RoutingTag as well). If specific vhosts need to be exposed, set the realms parameter to a comma separated list of realms. Optional.

  9. Beside perfect vhost matching, Drove Gateway supports suffix based matches as well. A single suffix is supported. Optional.

  10. Path to NGinx config.

  11. Path to the template file, based on which the template will be generated.

  12. NGinx command to use to reload the config. Set this to openresty optionally to use openresty.

  13. Ignore calling NGinx command to test the config. Set this to false or delete this line on production. Default: false.

  14. If using NGinx plus, set the endpoint to the local server here. If left empty, NGinx plus api based vhost update will be disabled.

  15. If specific vhosts are exposed, auto-discovery and updation of config (and NGinx reloads) might not be desired as it will cause connection drops. Set the following parameter to true to disable reloads. Nixy will only update upstreams using the nplus APIs. Default: false.

  16. Connection parameters for NGinx plus.

NGinx plus

NGinx plus is not shipped with this docker. If you want to use NGinx plus, please build nixy from the source tree here and build your own container.

"},{"location":"cluster/setup/gateway.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

We shall be volume mounting the config and log directories with the same name.

Prerequisite Setup

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

Go through the following steps to run drove-gateway as a service.

"},{"location":"cluster/setup/gateway.html#create-the-toml-config-for-nixy","title":"Create the TOML config for Nixy","text":"

Sample config file /etc/drove/gateway/gateway.toml:

address = \"127.0.0.1\"\nport = \"6000\"\n\n\n# Drove Options\ndrove = [\n  \"http://controller1.mydomain:10000\",\n   \"http://controller1.mydomain:10000\"\n   ]\n\nleader_vhost = \"drove-staging.mydomain\"\nevent_refresh_interval_sec = 5\nuser = \"guest\"\npass = \"guest\"\n\n\n# Nginx related config\nnginx_config = \"/etc/nginx/nginx.conf\"\nnginx_template = \"/etc/drove/gateway/nginx.tmpl\"\nnginx_cmd = \"nginx\"\nnginx_ignore_check = true\n

Replace domain names

Please remember to update mydomain to a valid domain you want to use.

"},{"location":"cluster/setup/gateway.html#create-template-for-nginx","title":"Create template for NGinx","text":"

Create a NGinx template with the following config in /etc/drove/gateway/nginx.tmpl

# Generated by drove-gateway {{datetime}}\n\nuser www-data;\nworker_processes auto;\npid /run/nginx.pid;\n\nevents {\n    use epoll;\n    worker_connections 2048;\n    multi_accept on;\n}\nhttp {\n    server_names_hash_bucket_size  128;\n    add_header X-Proxy {{ .Xproxy }} always;\n    access_log /var/log/nginx/access.log;\n    error_log /var/log/nginx/error.log warn;\n    server_tokens off;\n    client_max_body_size 128m;\n    proxy_buffer_size 128k;\n    proxy_buffers 4 256k;\n    proxy_busy_buffers_size 256k;\n    proxy_redirect off;\n    map $http_upgrade $connection_upgrade {\n        default upgrade;\n        ''      close;\n    }\n    # time out settings\n    proxy_send_timeout 120;\n    proxy_read_timeout 120;\n    send_timeout 120;\n    keepalive_timeout 10;\n\n    server {\n        listen       7000 default_server;\n        server_name  _;\n        # Everything is a 503\n        location / {\n            return 503;\n        }\n    }\n    {{if and .LeaderVHost .Leader.Endpoint}}\n    upstream {{.LeaderVHost}} {\n        server {{.Leader.Host}}:{{.Leader.Port}};\n    }\n    server {\n        listen 7000;\n        server_name {{.LeaderVHost}};\n        location / {\n            proxy_set_header HOST {{.Leader.Host}};\n            proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;\n            proxy_connect_timeout 30;\n            proxy_http_version 1.1;\n            proxy_set_header Upgrade $http_upgrade;\n            proxy_set_header Connection $connection_upgrade;\n            proxy_pass http://{{.LeaderVHost}};\n        }\n    }\n    {{end}}\n    {{- range $id, $app := .Apps}}\n    upstream {{$app.Vhost}} {\n        {{- range $app.Hosts}}\n        server {{ .Host }}:{{ .Port }};\n        {{- end}}\n    }\n    server {\n        listen 7000;\n        server_name {{$app.Vhost}};\n        location / {\n            proxy_set_header HOST $host;\n            proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;\n            proxy_connect_timeout 30;\n            proxy_http_version 1.1;\n            proxy_set_header Upgrade $http_upgrade;\n            proxy_set_header Connection $connection_upgrade;\n            proxy_pass http://{{$app.Vhost}};\n        }\n    }\n    {{- end}}\n}\n

The above template will do the following:

"},{"location":"cluster/setup/gateway.html#create-environment-file","title":"Create environment file","text":"

We want to configure the drove gateway container using the required environment variables. To do that, put the following in /etc/drove/gateway/gateway.env:

CONFIG_FILE_PATH=/etc/drove/gateway/gateway.toml\nTEMPLATE_FILE_PATH=/etc/drove/gateway/nginx.tmpl\n
"},{"location":"cluster/setup/gateway.html#create-systemd-file","title":"Create systemd file","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.gateway.service:

[Unit]\nDescription=Drove Gateway Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-gateway:latest\nExecStart=/usr/bin/docker run  \\\n    --env-file /etc/drove/gateway/gateway.env \\\n    --volume /etc/drove/gateway:/etc/drove/gateway:ro \\\n    --volume /var/log/drove/gateway:/var/log/nginx \\\n    --network host \\\n    --hostname %H \\\n    --rm \\\n    --name drove.gateway \\\n    ghcr.io/phonepe/drove-gateway:latest\n\n[Install]\nWantedBy=multi-user.target\n

Verify the file with the following command:

systemd-analyze verify drove.gateway.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.gateway.service\n

"},{"location":"cluster/setup/gateway.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.gateway\nsystemctl start drove.gateway\n
"},{"location":"cluster/setup/gateway.html#checking-logs","title":"Checking Logs","text":"

You can check logs using:

journalctl -u drove.gateway -f\n

NGinx logs would be available at /var/log/drove/gateway.

"},{"location":"cluster/setup/gateway.html#log-rotation-for-nginx","title":"Log rotation for NGinx","text":"

The gateway sets up log rotation for the access and errors logs with the following config:

/var/log/nginx/*.log {\n    rotate 5\n    size 10M\n    dateext\n    dateformat -%Y-%m-%d\n    missingok\n    compress\n    delaycompress\n    sharedscripts\n    notifempty\n    postrotate\n        test -r /var/run/nginx.pid && kill -USR1 `cat /var/run/nginx.pid`\n    endscript\n}\n

This will rotate both error and access logs when they hit 10MB and keep 5 logs.

Configure the above if you want and volume mount your config to /etc/logrotate.d/nginx to use different scheme as per your requirements.

"},{"location":"cluster/setup/maintenance.html","title":"Maintaining a Drove Cluster","text":"

There are a couple of constructs built into Drove to allow for easy maintenance.

"},{"location":"cluster/setup/maintenance.html#maintenance-mode","title":"Maintenance mode","text":"

Drove supports a maintenance mode to allow for software updates without affecting the containers running on the cluster.

Danger

In maintenance mode, outage detection is turned off and container failure for applications are not acted upon even if detected.

"},{"location":"cluster/setup/maintenance.html#engaging-maintenance-mode","title":"Engaging maintenance mode","text":"

Set cluster to maintenance mode.

Preconditions - Cluster must be in the following state: MAINTENANCE

Drove CLIJSON
drove -c local cluster maintenance-on\n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/set' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"MAINTENANCE\",\n        \"updated\": 1721630351178\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"cluster/setup/maintenance.html#disengaging-maintenance-mode","title":"Disengaging maintenance mode","text":"

Set cluster to normal mode.

Preconditions - Cluster must be in the following state: MAINTENANCE

Drove CLIJSON
drove -c local cluster maintenance-off\n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/unset' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"NORMAL\",\n        \"updated\": 1721630491296\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"cluster/setup/maintenance.html#updating-drove-version-across-the-cluster-quickly","title":"Updating drove version across the cluster quickly","text":"

We recommend the following sequence of steps:

  1. Find the leader controller for the cluster using drove ... cluster leader.
  2. Update the controller container on the nodes that are not the leader.

    If you are using the systemd file given here, you just need to restart the controller service using systemctl restart drove.controller

  3. Set cluster to maintenance mode using drove ... cluster maintenance-on.

  4. Update the leader controller.

    If you are using the systemd file given here, you just need to restart the leader controller service: systemctl restart drove.controller

  5. Update the executors.

    If you are using the systemd file given here, you just need to restart all executors: systemctl restart drove.executor

  6. Take cluster out of maintenance mode: drove ... cluster maintenance-off

"},{"location":"cluster/setup/maintenance.html#executor-blacklisting","title":"Executor blacklisting","text":"

In cases where we want to take an executor node out of the cluster for planned maintenance, we need to ensure application instances running on the node are replaced by containers on other nodes and the ones running here are shut down cleanly.

This is achieved by blacklisting the node.

Tip

Whenever blacklisting is done, it causes some flux in the application topology due to new container migration from blacklisted to normal nodes. To reduce the number of times this happens, plan to perform multiple operations togeter and blacklist and un-blacklist executors together.

Drove will optimize bulk blacklisting related app migrations and will migrate containers together for an app only once rather than once for every node.

Danger

Task instances are not migrated out. This is because it is impossible for Drove to know if a task can be migrated or not (i.e. killed and spun up on a new node in any order).

To blacklist executors do the following:

Drove CLIJSON
drove -c local executor blacklist dd2cbe76-9f60-3607-b7c1-bfee91c15623 ex1 ex2 \n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/blacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d&id=ex1&id=ex2' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"failed\": [\n            \"ex2\",\n            \"ex1\"\n        ],\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ]\n    },\n    \"message\": \"success\"\n}\n

To un-blacklist executors do the following:

Drove CLIJSON
drove -c local executor unblacklist dd2cbe76-9f60-3607-b7c1-bfee91c15623 ex1 ex2 \n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/unblacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d&id=ex1&id=ex2' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"failed\": [\n            \"ex2\",\n            \"ex1\"\n        ],\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ]\n    },\n    \"message\": \"success\"\n}\n

Note

Drove will not re-evaluate placement of existing Applications in RUNNING state once executors are brought back into rotation.

"},{"location":"cluster/setup/planning.html","title":"Planning your cluster","text":"

Running a drove cluster in production for critical workloads involves planning and preparation on factors like Availability, Scale, Security and Access management. The following issues should be considered while planning your drove cluster.

"},{"location":"cluster/setup/planning.html#criteria-for-planning","title":"Criteria for planning","text":"

The simplest form of a drove cluster would run controller, zookeeper, executor and gateway services all on the same machine while a highly available would separate out all components according to following considerations:

"},{"location":"cluster/setup/planning.html#cluster-configuration","title":"Cluster configuration","text":""},{"location":"cluster/setup/planning.html#controllers","title":"Controllers","text":"

Controllers will manage the cluster with application instances spread across multiple executors as per different placement policies. Controllers use leader-election to coordinate and will act as a single entity while each executor acts as a single entity that runs many different application instances.

"},{"location":"cluster/setup/planning.html#zookeeper","title":"Zookeeper","text":""},{"location":"cluster/setup/planning.html#executors","title":"Executors","text":""},{"location":"cluster/setup/planning.html#gateways","title":"Gateways","text":""},{"location":"cluster/setup/prerequisites.html","title":"Setting up the prerequisites","text":"

On all machines on the drove cluster, we would want to use the same user and have a consistent storage structure for configuration, logs etc.

Note

All commands o be issues as root. To get to admin/root mode issue the following command:

sudo su\n
"},{"location":"cluster/setup/prerequisites.html#setting-up-user","title":"Setting up user","text":"

We shall create an user called drove to be used to run all services and containers and assign the file ownership to this user.

adduser --system --group \"drove\" --home /var/lib/misc --no-create-home > /dev/null\n
We want to user to be able to run docker containers, so we add the user to the docker group:

groupadd docker\nusermod -aG docker drove\n
"},{"location":"cluster/setup/prerequisites.html#create-directories","title":"Create directories","text":"

We shall use the following locations to store configurations, logs etc:

We go ahead and create these locations and setup the correct permissions:

mkdir -p /etc/drove\nchown -R drove.drove /etc/drove\nchmod 700 /etc/drove\nchmod g+s /etc/drove\n\nmkdir -p /var/lib/drove\nchown -R drove.drove /var/lib/drove\nchmod 700 /var/lib/drove\n\nmkdir -p /var/log/drove\n

Danger

Ensure you run the chmod commands to remove read access everyone other than the owner.

"},{"location":"cluster/setup/units.html","title":"Units Reference","text":"

In the configuration files for Drove, we use the Duration and DataSize units to make configuration easier.

"},{"location":"cluster/setup/units.html#data-size","title":"Data Size","text":"

Use the following shortcuts to express sizes in human readable form such as 2GB etc:

"},{"location":"cluster/setup/units.html#duration","title":"Duration","text":"

Time durations in Drove can be expressed in human readable form, for example: 3d can be used to signify 3 days and so on. The list of valid duration unit suffixes are:

"},{"location":"cluster/setup/zookeeper.html","title":"Setting Up Zookeeper","text":"

We shall be running Zookeeper using the official Docker images. All data volumes etc will be mounted on the host machines.

The following ports will be exposed:

Danger

The ZK admin server does not shut down cleanly from time to time. And is not needed for anything related to Drove. If not needed, you should turn it off.

We assume the following to be the IP for the 3 zookeeper nodes:

"},{"location":"cluster/setup/zookeeper.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

"},{"location":"cluster/setup/zookeeper.html#important-files","title":"Important files","text":"

The zookeeper container stores snapshots, transaction logs and application logs on /data, /datalog and /logs directories respectively. We shall be volume mounting the following:

Docker will create these directories when container comes up for the first time.

Tip

The zk server id (as set above using the ZOO_MY_ID) can also be set by putting the server number in a file named myid in the /data directory.

Prerequisite Setup

If not done already, lease complete the prerequisite setup on all machines earmarked for the cluster.

"},{"location":"cluster/setup/zookeeper.html#setup-configuration-files","title":"Setup configuration files","text":"

Let's create the config directory:

mkdir -p /etc/drove/zk\n

We shall be creating 3 different configuration files to configure zookeeper:

"},{"location":"cluster/setup/zookeeper.html#setup-environment-variables","title":"Setup environment variables","text":"

Let us prepare the configuration. Put the following in a file: /etc/drove/zk/zk.env:

#(1)!\nZOO_TICK_TIME=2000\nZOO_INIT_LIMIT=10\nZOO_SYNC_LIMIT=5\nZOO_STANDALONE_ENABLED=false\nZOO_ADMINSERVER_ENABLED=false\n\n#(2)!\nZOO_AUTOPURGE_PURGEINTERVAL=12\nZOO_AUTOPURGE_SNAPRETAINCOUNT=5\n\n#(3)!\nZOO_MY_ID=1\nZOO_SERVERS=server.1=192.168.3.10:2888:3888;2181 server.2=192.168.3.11:2888:3888;2181 server.3=192.168.3.12:2888:3888;2181\n
  1. This is cluster level configuration to ensure the cluster topology remains stable through minor flaps
  2. This will control how much data we retain
  3. This section needs to change per server. Each server should have a different ZOO_MY_ID set. And the same numbers get referred to in ZOO_SERVERS section.

Warning

Info

Exhaustive set of options can be found on the Official Docker Page.

"},{"location":"cluster/setup/zookeeper.html#setup-jvm-parameters","title":"Setup JVM parameters","text":"

Put the following in /etc/drove/zk/java.env

export SERVER_JVMFLAGS='-Djute.maxbuffer=0x9fffff -Xmx4g -Xms4g -Dfile.encoding=utf-8 -XX:+UseG1GC -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError'\n

Configuring Max Data Size

Drove data per node can get a bit on the larger side from time to time depending on your application configuration. To be on the safe side, we need to increase the maximum data size per node. This is achieved by setting the JVM option -Djute.maxbuffer=0x9fffff on all cluster nodes in Drove. This is 10MB (approx). The actual payload doesn't reach anywhere close. However we shall be picking up payload compression in a future version to stop this variable from needing to be set.

For the Zookeeper Docker, the environment variable SERVER_JVMFLAGS needs to be set to -Djute.maxbuffer=0x9fffff.

Please refer to Zookeeper Advanced Configuration for further properties that can be tuned.

JVM Size

We set 4GB JVM heap size for ZK by adding appropriate options in SERVER_JVMFLAGS. Please make sure you have sized your machines to have 10-16GB of RAM at the very least. Tune the JVM size and machine size according to your needs.

q

JVMFLAGS environment variable

Do not set this variable in zk.env. Couple of reasons:

"},{"location":"cluster/setup/zookeeper.html#configure-logging","title":"Configure logging","text":"

We want to have physical log files on disk for debugging and audits and want the container to be ephemeral to allow for easy updates etc. To achieve this, put the following in /etc/drove/zk/logback.xml:

<!--\n Copyright 2022 The Apache Software Foundation\n\n Licensed to the Apache Software Foundation (ASF) under one\n or more contributor license agreements.  See the NOTICE file\n distributed with this work for additional information\n regarding copyright ownership.  The ASF licenses this file\n to you under the Apache License, Version 2.0 (the\n \"License\"); you may not use this file except in compliance\n with the License.  You may obtain a copy of the License at\n\n     http://www.apache.org/licenses/LICENSE-2.0\n\n Unless required by applicable law or agreed to in writing, software\n distributed under the License is distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n See the License for the specific language governing permissions and\n limitations under the License.\n\n Define some default values that can be overridden by system properties\n-->\n<configuration>\n  <!-- Uncomment this if you would like to expose Logback JMX beans -->\n  <!--jmxConfigurator /-->\n\n  <property name=\"zookeeper.console.threshold\" value=\"INFO\" />\n\n  <property name=\"zookeeper.log.dir\" value=\"/logs\" />\n  <property name=\"zookeeper.log.file\" value=\"zookeeper.log\" />\n  <property name=\"zookeeper.log.threshold\" value=\"INFO\" />\n  <property name=\"zookeeper.log.maxfilesize\" value=\"256MB\" />\n  <property name=\"zookeeper.log.maxbackupindex\" value=\"20\" />\n\n  <!--\n    console\n    Add \"console\" to root logger if you want to use this\n  -->\n  <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n    <encoder>\n      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>${zookeeper.console.threshold}</level>\n    </filter>\n  </appender>\n\n  <!--\n    Add ROLLINGFILE to root logger to get log file output\n  -->\n  <appender name=\"ROLLINGFILE\" class=\"ch.qos.logback.core.rolling.RollingFileAppender\">\n    <File>${zookeeper.log.dir}/${zookeeper.log.file}</File>\n    <encoder>\n      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>${zookeeper.log.threshold}</level>\n    </filter>\n    <rollingPolicy class=\"ch.qos.logback.core.rolling.FixedWindowRollingPolicy\">\n      <maxIndex>${zookeeper.log.maxbackupindex}</maxIndex>\n      <FileNamePattern>${zookeeper.log.dir}/${zookeeper.log.file}.%i</FileNamePattern>\n    </rollingPolicy>\n    <triggeringPolicy class=\"ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy\">\n      <MaxFileSize>${zookeeper.log.maxfilesize}</MaxFileSize>\n    </triggeringPolicy>\n  </appender>\n\n  <!--\n    Add TRACEFILE to root logger to get log file output\n    Log TRACE level and above messages to a log file\n  -->\n  <!--property name=\"zookeeper.tracelog.dir\" value=\"${zookeeper.log.dir}\" />\n  <property name=\"zookeeper.tracelog.file\" value=\"zookeeper_trace.log\" />\n  <appender name=\"TRACEFILE\" class=\"ch.qos.logback.core.FileAppender\">\n    <File>${zookeeper.tracelog.dir}/${zookeeper.tracelog.file}</File>\n    <encoder>\n      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>TRACE</level>\n    </filter>\n  </appender-->\n\n  <!--\n    zk audit logging\n  -->\n  <property name=\"zookeeper.auditlog.file\" value=\"zookeeper_audit.log\" />\n  <property name=\"zookeeper.auditlog.threshold\" value=\"INFO\" />\n  <property name=\"audit.logger\" value=\"INFO, RFAAUDIT\" />\n\n  <appender name=\"RFAAUDIT\" class=\"ch.qos.logback.core.rolling.RollingFileAppender\">\n    <File>${zookeeper.log.dir}/${zookeeper.auditlog.file}</File>\n    <encoder>\n      <pattern>%d{ISO8601} %p %c{2}: %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>${zookeeper.auditlog.threshold}</level>\n    </filter>\n    <rollingPolicy class=\"ch.qos.logback.core.rolling.FixedWindowRollingPolicy\">\n      <maxIndex>10</maxIndex>\n      <FileNamePattern>${zookeeper.log.dir}/${zookeeper.auditlog.file}.%i</FileNamePattern>\n    </rollingPolicy>\n    <triggeringPolicy class=\"ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy\">\n      <MaxFileSize>10MB</MaxFileSize>\n    </triggeringPolicy>\n  </appender>\n\n  <logger name=\"org.apache.zookeeper.audit.Slf4jAuditLogger\" additivity=\"false\" level=\"${audit.logger}\">\n    <appender-ref ref=\"RFAAUDIT\" />\n  </logger>\n\n  <root level=\"INFO\">\n    <appender-ref ref=\"CONSOLE\" />\n    <appender-ref ref=\"ROLLINGFILE\" />\n  </root>\n</configuration>\n

Tip

This is a customization of the original file from Zookeeper source tree. Please refer to documentation to configure logging.

"},{"location":"cluster/setup/zookeeper.html#create-systemd-file","title":"Create Systemd File","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.zookeeper.service:

[Unit]\nDescription=Drove Zookeeper Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull zookeeper:3.8\nExecStart=/usr/bin/docker run \\\n    --env-file /etc/drove/zk/zk.env \\\n    --volume /var/lib/drove/zk/data:/data \\\n    --volume /var/lib/drove/zk/datalog:/datalog \\\n    --volume /var/log/drove/zk:/logs \\\n    --volume /etc/drove/zk/logback.xml:/conf/logback.xml \\\n    --volume /etc/drove/zk/java.env:/conf/java.env \\\n    --publish 2181:2181 \\\n    --publish 2888:2888 \\\n    --publish 3888:3888 \\\n    --rm \\\n    --name drove.zookeeper \\\n    zookeeper:3.8\n\n[Install]\nWantedBy=multi-user.target\n

Verify the file with the following command:

systemd-analyze verify drove.zookeeper.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.zookeeper.service\n

"},{"location":"cluster/setup/zookeeper.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.zookeeper\nsystemctl start drove.zookeeper\n

You can check server status using the following:

echo srvr | nc localhost 2181\n

Tip

Replace localhost on the above command with the actual ZK server IPs to test remote connectivity.

Note

You can access the ZK client from the container using the following command:

docker exec -it drove.zookeeper bin/zkCli.sh\n

To connect to remote host you can use the following:

docker exec -it drove.zookeeper bin/zkCli.sh -server <server name or ip>:2181\n

"},{"location":"extra/cli.html","title":"Drove CLI","text":"

Details for the Drove CLI, including installation and usage can be found in the cli repo.

Repo link: https://github.com/PhonePe/drove-cli.

"},{"location":"extra/epoch.html","title":"Epoch","text":"

Epoch is a cron type scheduler that spins up container jobs on Drove.

Details for using epoch can be found in the epoch repo.

Link for Epoch repo: https://github.com/PhonePe/epoch.

"},{"location":"extra/epoch.html#epoch-cli","title":"Epoch CLI","text":"

There is a cli client for interaction with epoch. Details for installation and usage can be found in the epoch CLI repo.

Link for Epoch CLI repo: https://github.com/phonepe/epoch-cli.

"},{"location":"extra/libraries.html","title":"Libraries","text":"

Drove is written in Java. We provide a few libraries that can be used to integrate with a Drove cluster.

"},{"location":"extra/libraries.html#setup","title":"Setup","text":"

Setup the drove version

<properties>\n    <!--other properties-->\n    <drove.version>1.29</drove.version>\n</properties>\n

Checking the latest version

Latest version can be checked at the github packages page here

All libraries are located in sub packages of the top level package com.phonepe.drove.

Java Version Compatibility

Using Drove libraries would need Java versions 17+.

"},{"location":"extra/libraries.html#drove-model","title":"Drove Model","text":"

The model library for the classes used in request and response. It has dependency on jackson and dropwizard-validation.

"},{"location":"extra/libraries.html#dependency","title":"Dependency","text":"
<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-models</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#drove-client","title":"Drove Client","text":"

We provide a client library that can be used to connect to a Drove cluster. The cluster accepts controller endpoints as parameter (among other things) and automatically tracks the leader controller. If a single controller endpoint is provided, this functionality is turned off.

Please note that the client does not provide specific functions corresponding to different api calls from the controller, it acts as a simple endpoint discovery mechanism for drove cluster. Please refer to API section for details on individual apis.

"},{"location":"extra/libraries.html#transport","title":"Transport","text":"

The transport layer in the client is used to actually make HTTP calls to the Drove server. A new transport can be used by implementing the get(), post(), put() and delete() methods in the DroveHttpTransport interface.

By default Drove client uses Java internal HTTP client as a trivial transport implementation. We also provide an Apache Http Components based implementation.

Tip

Do not use the default transport in production. Please use the HTTP Components based transport or your custom ones.

"},{"location":"extra/libraries.html#dependencies","title":"Dependencies","text":"
 <dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-client</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-client-httpcomponent-transport</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#sample-code","title":"Sample code","text":"
public class DroveCluster implements AutoCloseable {\n\n    @Getter\n    private final DroveClient droveClient;\n\n    public DroveCluster() {\n        final var config = new DroveConfig()\n            .setEndpoints(List.of(\"http://controller1:4000,http://controller2:4000\"));\n\n        this.droveClient = new DroveClient(config,\n                                      List.of(new BasicAuthDecorator(\"guest\", \"guest\")),\n                                           new DroveHttpComponentsTransport(config.getCluster()));\n    }\n\n    @Override\n    public void close() throws Exception {\n        this.droveClient.close();\n    }\n}\n

RequestDecorator

This interface can be implemented to augment requests with special headers like for example Authorization, as well as for other stuff like adding content type etc etc.

"},{"location":"extra/libraries.html#drove-event-listener","title":"Drove Event Listener","text":"

This library provides callbacks that can be used to listen and react to events happening on the Drove cluster.

"},{"location":"extra/libraries.html#dependencies_1","title":"Dependencies","text":"
<!--Include Drove client-->\n<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-events-client</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#sample-code_1","title":"Sample Code","text":"
final var droveClient = ... //build your java transport, client here\n\n//Create and setup your object mapper\nfinal var mapper = new ObjectMapper();\nmapper.registerModule(new ParameterNamesModule());\nmapper.setSerializationInclusion(JsonInclude.Include.NON_EMPTY);\nmapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);\nmapper.disable(SerializationFeature.FAIL_ON_EMPTY_BEANS);\nmapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);\nmapper.enable(MapperFeature.ACCEPT_CASE_INSENSITIVE_ENUMS);\n\nfinal var listener = new DroveRemoteEventListener(droveClient, //Create listener\n                                                    mapper,\n                                                    new DroveEventPollingOffsetInMemoryStore(),\n                                                    Duration.ofSeconds(1));\n\nlistener.onEventReceived() //Connect signal handlers\n    .connect(events -> {\n        log.info(\"Remote Events: {}\", events);\n    });\n\nlistener.start(); //Start listening\n\n\n//Once done close the listener\nlistener.close();\n

Event Types

Please check the com.phonepe.drove.models.events package for the different event types and classes.

Event Polling Offset Store

The event poller library uses polling to find new events based on an offset. The event polling offset store is used to store and retrieve this offset. The DroveEventPollingOffsetInMemoryStore default store stores this information in-memory. Implement DroveEventPollingOffsetStore to a more permanent storage if you want this to be more permanent.

"},{"location":"extra/libraries.html#drove-hazelcast-cluster-discovery","title":"Drove Hazelcast Cluster Discovery","text":"

Drove provides an implementation of the Hazelcast discovery SPI so that containers deployed on a drove cluster can discover each other. This client uses the token injected by drove in the DROVE_APP_INSTANCE_AUTH_TOKEN environment variable to get sibling information from the controller.

"},{"location":"extra/libraries.html#dependencies_2","title":"Dependencies","text":"
<!--Include Drove client-->\n<!--Include Hazelcast-->\n<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-events-client</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#sample-code_2","title":"Sample Code","text":"
//Setup hazelcast\nConfig config = new Config();\n\n// Enable discovery\nconfig.setProperty(\"hazelcast.discovery.enabled\", \"true\");\nconfig.setProperty(\"hazelcast.discovery.public.ip.enabled\", \"true\");\nconfig.setProperty(\"hazelcast.socket.client.bind.any\", \"true\");\nconfig.setProperty(\"hazelcast.socket.bind.any\", \"false\");\n\n//Setup networking\nNetworkConfig networkConfig = config.getNetworkConfig();\nnetworkConfig.getInterfaces().addInterface(\"0.0.0.0\").setEnabled(true);\nnetworkConfig.setPort(port); //Port is the port exposed on the container for hazelcast clustering\n\n// Setup Drove discovery\nJoinConfig joinConfig = networkConfig.getJoin();\n\nDiscoveryConfig discoveryConfig = joinConfig.getDiscoveryConfig();\nDiscoveryStrategyConfig discoveryStrategyConfig =\n        new DiscoveryStrategyConfig(new DroveDiscoveryStrategyFactory());\ndiscoveryStrategyConfig.addProperty(\"drove-endpoint\", \"http://controller1:4000,http://controller2:4000\"); //Controller endpoints\ndiscoveryStrategyConfig.addProperty(\"port-name\", \"hazelcast\"); // Name of the hazelcast port defined in Application spec\ndiscoveryStrategyConfig.addProperty(\"transport\", \"com.phonepe.drove.client.transport.httpcomponent.DroveHttpComponentsTransport\");\ndiscoveryStrategyConfig.addProperty(\"cluster-by-app-name\", true); //Cluster container across multiple app versions\ndiscoveryConfig.addDiscoveryStrategyConfig(discoveryStrategyConfig);\n\n//Create hazelcast node\nval node = Hazelcast.newHazelcastInstance(config);\n\n//Once connected, node.getCluster() will be non null\n

Peer discovery modes

By default the containers will only discover and connect to containers from the same application id. If you need to connect to containers from all versions of the same application please set the cluster-by-app-name property to true as in the above example.

"},{"location":"extra/nvidia.html","title":"Setting up Nvidia GPU computation on executor","text":"

Prerequisite: Docker version 19.0.3+. Check Docker versions and nvidia for details.

Below steps are for ubuntu primarily for other distros check the associated links.

"},{"location":"extra/nvidia.html#install-nvidia-drivers-on-hosts","title":"Install nvidia drivers on hosts","text":"

Ubuntu provides packaged drivers for nvidia. Driver installation Guide

Recommended

ubuntu-drivers list --gpgpu\nubuntu-drivers install --gpgpu nvidia:535-server\n

Alternatively apt can be used, but may require additional steps Manual install

# Check for the latest stable version \napt search nvidia-driver.*server\napt install -y nvidia-driver-535-server  nvidia-utils-535-server \n

For other distros check Guide

"},{"location":"extra/nvidia.html#install-nvidia-container-toolkit","title":"Install Nvidia-container-toolkit","text":"

Add nvidia repo

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list\n\napt install -y nvidia-container-toolkit\n
For other distros check guide here

Configure docker with nvidia toolkit

nvidia-ctk runtime configure --runtime=docker\n\nsystemctl restart docker #Restart Docker\n
"},{"location":"extra/nvidia.html#verify-installation","title":"Verify installation","text":"

On Host nvidia-smi -l In docker container docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

+-----------------------------------------------------------------------------+\n| NVIDIA-SMI 535.86.10    Driver Version: 535.86.10    CUDA Version: 12.2     |\n|-------------------------------+----------------------+----------------------+\n| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n|                               |                      |               MIG M. |\n|===============================+======================+======================|\n|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |\n| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |\n|                               |                      |                  N/A |\n+-------------------------------+----------------------+----------------------+\n\n+-----------------------------------------------------------------------------+\n| Processes:                                                                  |\n|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n|        ID   ID                                                   Usage      |\n|=============================================================================|\n|  No running processes found                                                 |\n+-----------------------------------------------------------------------------+\n
Verification guide

"},{"location":"extra/nvidia.html#enable-nvidia-support-on-drove","title":"Enable nvidia support on drove","text":"

Enable Nvidia support in drove-executor.yml and restart drove-executor

...\nresources:\n  ...\n  enableNvidiaGpu: true\n...\n

"},{"location":"tasks/index.html","title":"Introduction","text":"

A task is a representation for transient containerized workloads on the cluster. A task instance is supposed to have a much shorter life-time than an application instance. Use tasks to spin up things like automation scripts etc.

"},{"location":"tasks/index.html#primary-differences-with-an-application","title":"Primary differences with an application","text":"

Please note the following important differences between a task instance and application instances

Tip

Use epoch to spin up tasks in a periodic manner

A task specification contains the following sections:

"},{"location":"tasks/index.html#task-id","title":"Task ID","text":"

Identification of a task is a bit more complicated on Drove. There is a Task ID ({sourceAppName}-{taskId}) which is used internally in drove. This is returned to the client when task is created.

However, clients are supposed to use the {sourceAppName,taskId} combo they have sent in the task spec to address and send commands to their tasks.

"},{"location":"tasks/index.html#task-states-and-operations","title":"Task States and operations","text":"

Tasks on Drove have their own life cycle modelled as a state machine. State transitions can be triggered by issuing operations using the APIs.

"},{"location":"tasks/index.html#states","title":"States","text":"

Tasks on a Drove cluster can be one of the following states:

"},{"location":"tasks/index.html#operations","title":"Operations","text":"

The following task operations are recognized by Drove:

Tip

All operations need Cluster Operation Spec which can be used to control the timeout and parallelism of tasks generated by the operation.

"},{"location":"tasks/index.html#task-state-machine","title":"Task State Machine","text":"

The following state machine signifies the states and transitions as affected by cluster state and operations issued.

"},{"location":"tasks/operations.html","title":"Task Operations","text":"

This page discusses operations relevant to Task management. Please go over the Task State Machine to understand the different states a task can be in and how operations applied (and external changes) move a task from one state to another.

Note

Please go through Cluster Op Spec to understand the operation parameters being sent.

For tasks only the timeout parameter is relevant.

Note

Only one operation can be active on a particular task identified by a {sourceAppName,taskId} at a time.

Warning

Only the leader controller will accept and process operations. To avoid confusion, use the controller endpoint exposed by Drove Gateway to issue commands.

"},{"location":"tasks/operations.html#cluster-operation-specification","title":"Cluster Operation Specification","text":"

When an operation is submitted to the cluster, a cluster op spec needs to be specified. This is needed to control different aspects of the operation, including parallelism of an operation or increase the timeout for the operation and so on.

The following aspects of an operation can be configured:

Name Option Description Timeout timeout The duration after which Drove considers the operation to have timed out. Parallelism parallelism Parallelism of the task. (Range: 1-32) Failure Strategy failureStrategy Set this to STOP.

Note

For internal recovery operations, Drove generates it's own operations. For that, Drove applies the following cluster operation spec:

The default operation spec can be configured in the controller configuration file. It is recommended to set this to a something like 8 for faster recovery.

"},{"location":"tasks/operations.html#how-to-initiate-an-operation","title":"How to initiate an operation","text":"

Tip

Use the Drove CLI to perform all manual operations.

All operations for task lifecycle management need to be issued via a POST HTTP call to the leader controller endpoint on the path /apis/v1/tasks/operations. API will return HTTP OK/200 and relevant json response as payload.

Sample api call:

curl --location 'http://drove.local:7000/apis/v1/tasks/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"KILL\",\n    \"sourceAppName\" : \"TEST_APP\",\n    \"taskId\" : \"T0012\",\n    \"opSpec\": {\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Note

In the above examples, http://drove.local:7000 is the endpoint of the leader. TEST_APP is the name of the application that started this task and taskId is a unique client generated id. Authorization is basic auth.

Warning

Task operations are not cancellable.

"},{"location":"tasks/operations.html#create-a-task","title":"Create a task","text":"

A task can be created issuing the following command.

Preconditions: - Task with same {sourceAppName,taskId} should not exist on the cluster.

State Transition:

To create a task a Task Spec needs to be created first.

Once ready, CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local tasks create sample/test_task.json\n

Sample Request Payload

{\n    \"type\": \"CREATE\",\n    \"spec\": {...}, //(1)!\n    \"opSpec\": { //(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Spec as mentioned in Task Specification
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"taskId\": \"TEST_APP-T0012\"\n    },\n    \"message\": \"success\"\n}\n

Warning

There are no separate create/run steps in a task. Creation will start execution automatically and immediately.

"},{"location":"tasks/operations.html#kill-a-task","title":"Kill a task","text":"

A task can be created issuing the following command.

Preconditions: - Task with same {sourceAppName,taskId} needs to exist on the cluster.

State Transition:

CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local tasks kill TEST_APP T0012\n

Sample Request Payload

{\n    \"type\": \"KILL\",\n    \"sourceAppName\" : \"TEST_APP\",//(1)!\n    \"taskId\" : \"T0012\",//(2)!\n    \"opSpec\": {//(3)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Source app name as mentioned in spec during task creation
  2. Task ID as mentioned in the spec
  3. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"taskId\": \"T0012\"\n    },\n    \"message\": \"success\"\n}\n

Note

Task metadata will remain on the cluster for some time. Metadata cleanup for tasks is automatic and can be configured in the controller configuration.

"},{"location":"tasks/specification.html","title":"Task Specification","text":"

A task is defined using JSON. We use a sample configuration below to explain the options.

"},{"location":"tasks/specification.html#sample-task-definition","title":"Sample Task Definition","text":"
{\n    \"sourceAppName\": \"TEST_APP\",//(1)!\n    \"taskId\": \"T0012\",//(2)!\n    \"executable\": {//(3)!\n        \"type\": \"DOCKER\", // (4)!\n        \"url\": \"ghcr.io/appform-io/test-task\",//(5)!\n        \"dockerPullTimeout\": \"100 seconds\"//(6)!\n    },\n     \"resources\": [//(7)!\n        {\n            \"type\": \"CPU\",\n            \"count\": 1//(8)!\n        },\n        {\n            \"type\": \"MEMORY\",\n            \"sizeInMB\": 128//(9)!\n        }\n    ],\n    \"volumes\": [//(10)!\n        {\n            \"pathInContainer\": \"/data\",//(11)!\n            \"pathOnHost\": \"/mnt/datavol\",//(12)!\n            \"mode\" : \"READ_WRITE\"//(13)!\n        }\n    ],\n    \"configs\" : [//(14)!\n        {\n            \"type\" : \"INLINE\",//(15)!\n            \"localFilename\": \"/testfiles/drove.txt\",//(16)!\n            \"data\" : \"RHJvdmUgdGVzdA==\"//(17)!\n        }\n    ],\n    \"placementPolicy\": {//(18)!\n        \"type\": \"ANY\"//(19)!\n    },\n    \"env\": {//(20)!\n        \"CORES\": \"8\"\n    },\n    \"args\" : [] //(27)!\n    \"tags\": { //(21)!\n        \"superSpecialApp\": \"yes_i_am\",\n        \"say_my_name\": \"heisenberg\"\n    },\n    \"logging\": {//(22)!\n        \"type\": \"LOCAL\",//(23)!\n        \"maxSize\": \"100m\",//(24)!\n        \"maxFiles\": 3,//(25)!\n        \"compress\": true//(26)!\n    }\n}\n
  1. Name of the application that has started the task. Make sure this is a valid application on the cluster.
  2. An unique ID for this task. Uniqueness is up to the user, Drove will scope it in the sourceAppName namespace.
  3. Coordinates for the executable. Refer to Executable Specification for details.
  4. Right now the only type supported is DOCKER.
  5. Docker container address
  6. Timeout for container pull.
  7. Volumes to be mounted. Refer to Volume Specification for details.
  8. Path that will be visible inside the container for this mount.
  9. Actual path on the host machine for the mount.
  10. Mount mode can be READ_WRITE and READ_ONLY
  11. Configuration to be injected as file inside the container. Please refer Config Specification for details.
  12. Type of config. Can be INLINE, EXECUTOR_LOCAL_FILE, ONTROLLER_HTTP_FETCHandEXECUTOR_HTTP_FETCH`. Specifies how drove will t the contents to be injected..
  13. File name for the config inside the container.
  14. Serialized form of the data, this and other parameters will vary cording to the type specified above.
  15. List of resources required to run this application. Check Resource Requirements Specification for more tails.
  16. Number of CPU cores to be allocated.
  17. Amount of memory to be allocated expressed in Megabytes
  18. Specifies how the container will be placed on the cluster. Check Placement Policy for details.
  19. Type of placement can be ANY, ONE_PER_HOST, MATCH_TAG, NO_TAG, RULE_BASED, ANY and COMPOSITE. Rest of the parameters in this section will depend on the type.
  20. Custom environment variables. Additional variables are injected by Drove as well. See Environment Variables section for tails.
  21. Key value metadata that can be used in external systems.
  22. Specify how docker log files are configured. Refer to Logging Specification
  23. Log to local file
  24. Maximum File Size
  25. Number of latest log files to retain
  26. Log files will be compressed
  27. List of command line arguments. See Command Line Arguments for details.

Warning

Please make sure sourceAppName is set to a correct application name as specified in the name parameter of a running application on the cluster.

If this is not done, stale task metadata will not be cleaned up and your metadata store performance will get affected over time.

"},{"location":"tasks/specification.html#executable-specification","title":"Executable Specification","text":"

Right now Drove supports only docker containers. However as engines, both docker and podman are supported. Drove executors will fetch the executable directly from the registry based on the configuration provided.

Name Option Description Type type Set type to DOCKER. URL url Docker container URL`. Timeout dockerPullTimeout Timeout for docker image pull.

Note

Drove supports docker registry authentication. This can be configured in the executor configuration file.

"},{"location":"tasks/specification.html#resource-requirements-specification","title":"Resource Requirements Specification","text":"

This section specifies the hardware resources required to run the container. Right now only CPU and MEMORY are supported as resource types that can be reserved for a container.

"},{"location":"tasks/specification.html#cpu-requirements","title":"CPU Requirements","text":"

Specifies number of cores to be assigned to the container.

Name Option Description Type type Set type to CPU for this. Count count Number of cores to be assigned."},{"location":"tasks/specification.html#memory-requirements","title":"Memory Requirements","text":"

Specifies amount of memory to be allocated to a container.

Name Option Description Type type Set type to MEMORY for this. Count sizeInMB Amount of memory (in Mega Bytes) to be allocated.

Sample

[\n    {\n        \"type\": \"CPU\",\n        \"count\": 1\n    },\n    {\n        \"type\": \"MEMORY\",\n        \"sizeInMB\": 128\n    }\n]\n

Note

Both CPU and MEMORY configurations are mandatory.

"},{"location":"tasks/specification.html#volume-specification","title":"Volume Specification","text":"

Files and directories can be mounted from the executor host into the container. The volumes section contains a list of volumes that need to be mounted.

Name Option Description Path In Container pathInContainer Path that will be visible inside the container for this mount. Path On Host pathOnHost Actual path on the host machine for the mount. Mount Mode mode Mount mode can be READ_WRITE and READ_ONLY to allow the containerized process to write or read to the volume.

Info

We do not support mounting remote volumes as of now.

"},{"location":"tasks/specification.html#config-specification","title":"Config Specification","text":"

Drove supports injection of configuration files into containers. The specifications for the same are discussed below.

"},{"location":"tasks/specification.html#inline-config","title":"Inline config","text":"

Inline configuration can be added in the Application Specification itself. This will manifest as a file inside the container.

The following details are needed for this:

Name Option Description Type type Set the value to INLINE Local Filename localFilename File name for the config inside the container. Data data Base64 encoded string for the data. The value for this will be masked on UI.

Config file:

port: 8080\nlogLevel: DEBUG\n
Corresponding config specification:
{\n    \"type\" : \"INLINE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"cG9ydDogODA4MApsb2dMZXZlbDogREVCVUcK\"\n}\n

Warning

The full base 64 encoded config data will get stored in Drove ZK and will be pushed to executors inline. It is not recommended to stream large config files to containers using this method. This will probably need additional configuration on your ZK cluster.

"},{"location":"tasks/specification.html#locally-loaded-config","title":"Locally loaded config","text":"

Config file from a path on the executor directly. Such files can be distributed to the executor host using existing configuration management systems such as OpenTofu, Salt etc.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_LOCAL_FILE Local Filename localFilename File name for the config inside the container. File path filePathOnHost Path to the config file on executor host.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_LOCAL_FILE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"/mnt/configs/myservice/config.yml\"\n}\n

"},{"location":"tasks/specification.html#controller-fetched-config","title":"Controller fetched Config","text":"

Config file can be fetched from a remote server by the controller. Once fetched, these will be streamed to the executor as part of the instance specification for starting a container.

The following details are needed for this:

Name Option Description Type type Set the value to CONTROLLER_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"CONTROLLER_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

The controller will make an API call for every single time it asks an executor to spin up a container. Please make sure to account for this in your configuration management system.

"},{"location":"tasks/specification.html#executor-fetched-config","title":"Executor fetched Config","text":"

Config file can be fetched from a remote server by the executor before spinning up a container. Once fetched, the payload will be injected as a config file into the container.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

All executors will make an API call for every single time they spin up a container for this application. Please make sure to account for this in your configuration management system.

"},{"location":"tasks/specification.html#http-call-specification","title":"HTTP Call Specification","text":"

This section details the options that can set when making http calls to a configuration management system from controllers or executors.

The following options are available for HTTP call:

Name Option Description Protocol protocol Protocol to use for upstream call. Can be HTTP or HTTPS. Hostname hostname Host to call. Port port Provide custom port. Defaults to 80 for http and 443 for https. API Path path Path component of the URL. Include query parameters here. Defaults to / HTTP Method verb Type of call, use GET, POST or PUT. Defaults to GET. Success Code successCodes List of HTTP status codes which is considered as success. Defaults to [200] Payload payload Data to be used for POST and PUT calls Connection Timeout connectionTimeout Timeout for upstream connection. Operation timeout operationTimeout Timeout for actual operation. Username username Username to be used basic auth. This field is masked out on the UI. Password password Password to be used for basic auth. This field is masked on the UI. Authorization Header authHeader Data to be passed in HTTP Authorization header. This field is masked on the UI. Additional Headers headers Any other headers to be passed to the upstream in the HTTP calls. This is a map of Skip SSL Checks insecure Skip hostname and certification checks during SSL handshake with the upstream."},{"location":"tasks/specification.html#placement-policy-specification","title":"Placement Policy Specification","text":"

Placement policy governs how Drove deploys containers on the cluster. The following sections discuss the different placement policies available and how they can be configured to achieve optimal placement of containers.

Warning

All policies will work only at a {appName, version} combination level. They will not ensure constraints at an appName level. This means that for somethinge like a one per node placement, for the same appName, multiple containers can run on the same host if multiple deployments with different versions are active in a cluster. Same applies for all policies like N per host and so on.

Important details about executor tagging

"},{"location":"tasks/specification.html#any-placement","title":"Any Placement","text":"

Containers for a {appName, version} combination can run on any un-tagged executor host.

Name Option Description Policy Type type Put ANY as policy.

Sample:

{\n    \"type\" : \"ANY\"\n}\n

Tip

For most use-cases this is the placement policy to use.

"},{"location":"tasks/specification.html#one-per-host-placement","title":"One Per Host Placement","text":"

Ensures that only one container for a particular {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put ONE_PER_HOST as policy.

Sample:

{\n    \"type\" : \"ONE_PER_HOST\"\n}\n

"},{"location":"tasks/specification.html#max-n-per-host-placement","title":"Max N Per Host Placement","text":"

Ensures that at most N containers for a {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put MAX_N_PER_HOST as policy. Max count max The maximum num of containers that can run on an executor. Range: 1-64

Sample:

{\n    \"type\" : \"MAX_N_PER_HOST\",\n    \"max\": 3\n}\n

"},{"location":"tasks/specification.html#match-tag-placement","title":"Match Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has the tags as mentioned in the policy.

Name Option Description Policy Type type Put MATCH_TAG as policy. Max count tag The tag to match.

Sample:

{\n    \"type\" : \"MATCH_TAG\",\n    \"tag\": \"gpu_enabled\"\n}\n

"},{"location":"tasks/specification.html#no-tag-placement","title":"No Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has no tags.

Name Option Description Policy Type type Put NO_TAG as policy.

Sample:

{\n    \"type\" : \"NO_TAG\"\n}\n

Info

The NO_TAG policy is mostly for internal use, and does not need to be specified when deploying containers that do not need any special placement logic.

"},{"location":"tasks/specification.html#composite-policy-based-placement","title":"Composite Policy Based Placement","text":"

Composite policy can be used to combine policies together to create complicated placement requirements.

Name Option Description Policy Type type Put COMPOSITE as policy. Polices policies List of policies to combine Combiner combiner Can be AND and OR and signify all-match and any-match logic on the policies mentioned.

Sample:

{\n    \"type\" : \"COMPOSITE\",\n    \"policies\": [\n        {\n            \"type\": \"ONE_PER_HOST\"\n        },\n        {\n            \"type\": \"MATH_TAG\",\n            \"tag\": \"gpu_enabled\"\n        }\n    ],\n    \"combiner\" : \"AND\"\n}\n
The above policy will ensure that only one container of the relevant {appName,version} will run on GPU enabled machines.

Tip

It is easy to go into situations where no executors match complicated placement policies. Internally, we tend to keep things rather simple and use the ANY placement for most cases and maybe tags in a few places with over-provisioning or for hosts having special hardware

"},{"location":"tasks/specification.html#environment-variables","title":"Environment variables","text":"

This config can be used to inject custom environment variables to containers. The values are defined as part of deployment specification, are same across the cluster and immutable to modifications from inside the container (ie any overrides from inside the container will not be visible across the cluster).

Sample:

{\n    \"MY_VARIABLE_1\": \"fizz\",\n    \"MY_VARIABLE_2\": \"buzz\"\n}\n

The following environment variables are injected by Drove to all containers:

Variable Name Value HOST Hostname where the container is running. This is for marathon compatibility. PORT_PORT_NUMBER A variable for every port specified in exposedPorts section. The value is the actual port on the host, the specified port is mapped to. For example if ports 8080 and 8081 are specified, two variables called PORT_8080 and PORT_8081 will be injected. DROVE_EXECUTOR_HOST Hostname where container is running. DROVE_CONTAINER_ID Container that is deployed DROVE_APP_NAME App name as specified in the Application Specification DROVE_INSTANCE_ID Actual instance ID generated by Drove DROVE_APP_ID Application ID as generated by Drove DROVE_APP_INSTANCE_AUTH_TOKEN A JWT string generated by Drove that can be used by this container to call /apis/v1/internal/... apis.

Warning

Do not pass secrets using environment variables. These variables are all visible on the UI as is. Please use Configs to inject secrets files and so on.

"},{"location":"tasks/specification.html#command-line-arguments","title":"Command line arguments","text":"

A list of command line arguments that are sent to the container engine to execute inside the container. This is provides ways for you to configure your container behaviour based off such arguments. Please refer to docker documentation for details.

Danger

This might have security implications from a system point of view. As such Drove provides administrators a way to disable passing arguments at the cluster level by setting disableCmdlArgs to true in the controller configuration.

"},{"location":"tasks/specification.html#logging-specification","title":"Logging Specification","text":"

Can be used to configure how container logs are managed on the system.

Note

This section affects the docker log driver. Drove will continue to stream logs to it's own logger which can be configured at executor level through the executor configuration file.

"},{"location":"tasks/specification.html#local-logger-configuration","title":"Local Logger configuration","text":"

This is used to configure the json-file log driver.

Name Option Description Type type Set the value to LOCAL Max Size maxSize Maximum file size. Anything bigger than this will lead to rotation. Max Files maxFiles Maximum number of logs files to keep. Range: 1-100 Compress compress Enable log file compression.

Tip

If logging section is omitted, the following configuration is applied by default: - File size: 10m - Number of files: 3 - Compression: on

"},{"location":"tasks/specification.html#rsyslog-configuration","title":"Rsyslog configuration","text":"

In case suers want to stream logs to an rsyslog server, the logging configuration needs to be set to RSYSLOG mode.

Name Option Description Type type Set the value to RSYSLOG Server server URL for the rsyslog server. Tag Prefix tagPrefix Prefix to add at the start of a tag Tag Suffix tagSuffix Suffix to add at the en of a tag.

Note

The default tag is the DROVE_INSTANCE_ID. The tagPrefix and tagSuffix will to before and after this

"}]} \ No newline at end of file diff --git a/search/search_index.json b/search/search_index.json new file mode 100644 index 0000000..deacbc1 --- /dev/null +++ b/search/search_index.json @@ -0,0 +1 @@ +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Introduction","text":"

Drove is a container orchestrator built at PhonePe. It is focused on simplicity, container performance, and easy operations.

"},{"location":"index.html#features","title":"Features","text":"

The following sections go over the features.

"},{"location":"index.html#functional","title":"Functional","text":""},{"location":"index.html#operations","title":"Operations","text":""},{"location":"index.html#performance","title":"Performance","text":""},{"location":"index.html#resilience","title":"Resilience","text":""},{"location":"index.html#security","title":"Security","text":""},{"location":"index.html#observability","title":"Observability","text":""},{"location":"index.html#unsupported-features","title":"Unsupported Features","text":""},{"location":"index.html#terminology","title":"Terminology","text":"

Before we delve into the details, let's get acquainted with the required terminology:

"},{"location":"index.html#github-repositories","title":"Github Repositories","text":""},{"location":"index.html#license","title":"License","text":"

Apache 2

"},{"location":"getting-started.html","title":"Getting Started","text":"

To get a trivial cluster up and running on a machine, the compose file can be used.

"},{"location":"getting-started.html#update-etc-hosts-to-interact-wih-nginx","title":"Update etc hosts to interact wih nginx","text":"

Add the following lines to /etc/hosts

127.0.0.1   drove.local\n127.0.0.1   testapp.local\n

"},{"location":"getting-started.html#download-the-compose-file","title":"Download the compose file","text":"
wget https://raw.githubusercontent.com/PhonePe/drove-orchestrator/master/compose/compose.yaml\n
"},{"location":"getting-started.html#bringing-up-a-demo-cluster","title":"Bringing up a demo cluster","text":"

cd compose\ndocker-compose up\n
This will start zookeeper,drove controller, executor and nginx/drove-gateway. The following ports are used:

Drove credentials would be admin/admin and guest/guest for read-write and read-only permissions respectively.

You should be able to access the UI at http://drove.local:7000

"},{"location":"getting-started.html#install-drove-cli","title":"Install drove-cli","text":"

Install the CLI for drove

pip install drove-cli\n

"},{"location":"getting-started.html#create-client-configuration","title":"Create Client Configuration","text":"

Put the following in ${HOME}/.drove

[local]\nendpoint = http://drove.local:4000\nusername = admin\npassword = admin\n
"},{"location":"getting-started.html#deploy-an-app","title":"Deploy an app","text":"

Get the sample app spec:

wget https://raw.githubusercontent.com/PhonePe/drove-cli/master/sample/test_app.json\n

Now deploy the app.

drove -c local apps create test_app.json\n

"},{"location":"getting-started.html#scale-the-app","title":"Scale the app","text":"

drove -c local apps scale TEST_APP-1 1 -w\n
This would expose the app as testapp.local. Endpoint would be: http://testapp.local:7000.

You can test the app by running the following commands:

curl http://testapp.local:7000/\ncurl http://testapp.local:7000/files/drove.txt\n
"},{"location":"getting-started.html#suspend-and-destroy-the-app","title":"Suspend and destroy the app","text":"
drove -c local apps scale TEST_APP-1 0 -w\ndrove -c local apps destroy TEST_APP-1\n
"},{"location":"getting-started.html#accessing-the-code","title":"Accessing the code","text":"

Code is hosted on github.

Cloning everything:

git clone git@github.com:PhonePe/drove-orchestrator.git\ngit submodule init\ngit submodule update\n
"},{"location":"apis/index.html","title":"Introduction","text":"

This section lists all the APIs that a user can communicate with.

"},{"location":"apis/index.html#making-an-api-call","title":"Making an API call","text":"

Use a standard HTTP client in the language of your choice to make a call to the leader controller (the cluster virtual host exposed by drove-gateway-nginx).

Tip

In case you are using Java, we recommend using the drove-client library along with the http-transport.

If multiple controllers endpoints are provided, the client will track the leader automatically. This will reduce your dependency on drove-gateway.

"},{"location":"apis/index.html#authentication","title":"Authentication","text":"

Drove uses basic auth for authentication. (You can extend to use any other auth format like OAuth). The basic auth credentials need to be sent out in the standard format in the Authorization header.

"},{"location":"apis/index.html#response-format","title":"Response format","text":"

The response format is standard for all API calls:

{\n    \"status\": \"SUCCESS\",//(1)!\n    \"data\": {//(2)!\n        \"taskId\": \"T0012\"\n    },\n    \"message\": \"success\"//(3)!\n}\n
  1. SUCCESS or FAILURE as the case may be.
  2. Content of this field is contextual to the response.
  3. Will contain success if the call was successful or relevant error message.

Warning

APIs will return relevant HTTP status codes in case of error (for example 400 for validation errors, 401 for authentication failure). However, you must always ensure that the status field is set to SUCCESS for assuming the api call is succesful, even when HTTP status code is 2xx.

APIs in Drove belong to the following major classes:

Tip

Response models for these apis can be found in drove-models

Note

There are no publicly accessible APIs exposed by individual executors.

"},{"location":"apis/application.html","title":"Application Management","text":""},{"location":"apis/application.html#issue-application-operation-command","title":"Issue application operation command","text":"

POST /apis/v1/applications/operations

Request

curl --location 'http://drove.local:7000/apis/v1/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"SCALE\",\n    \"appId\": \"TEST_APP-1\",\n    \"requiredInstances\": 1,\n    \"opSpec\": {\n        \"timeout\": \"1m\",\n        \"parallelism\": 20,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Tip

Relevant payloads for application commands can be found in application operations section.

"},{"location":"apis/application.html#cancel-currently-running-operation","title":"Cancel currently running operation","text":"

POST /apis/v1/applications/operations/{appId}/cancel

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/operations/TEST_APP/cancel' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-list-of-applications","title":"Get list of applications","text":"

GET /apis/v1/applications

Request

curl --location 'http://drove.local:7000/apis/v1/applications' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"TEST_APP-1\": {\n            \"id\": \"TEST_APP-1\",\n            \"name\": \"TEST_APP\",\n            \"requiredInstances\": 0,\n            \"healthyInstances\": 0,\n            \"totalCPUs\": 0,\n            \"totalMemory\": 0,\n            \"state\": \"MONITORING\",\n            \"created\": 1719826995764,\n            \"updated\": 1719892126096\n        }\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-info-for-an-app","title":"Get info for an app","text":"

GET /apis/v1/applications/{id}

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"id\": \"TEST_APP-1\",\n        \"name\": \"TEST_APP\",\n        \"requiredInstances\": 1,\n        \"healthyInstances\": 1,\n        \"totalCPUs\": 1,\n        \"totalMemory\": 128,\n        \"state\": \"RUNNING\",\n        \"created\": 1719826995764,\n        \"updated\": 1719892279019\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-raw-json-specs","title":"Get raw JSON specs","text":"

GET /apis/v1/applications/{id}/spec

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/spec' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"name\": \"TEST_APP\",\n        \"version\": \"1\",\n        \"executable\": {\n            \"type\": \"DOCKER\",\n            \"url\": \"ghcr.io/appform-io/perf-test-server-httplib\",\n            \"dockerPullTimeout\": \"100 seconds\"\n        },\n        \"exposedPorts\": [\n            {\n                \"name\": \"main\",\n                \"port\": 8000,\n                \"type\": \"HTTP\"\n            }\n        ],\n        \"volumes\": [],\n        \"configs\": [\n            {\n                \"type\": \"INLINE\",\n                \"localFilename\": \"/testfiles/drove.txt\",\n                \"data\": \"\"\n            }\n        ],\n        \"type\": \"SERVICE\",\n        \"resources\": [\n            {\n                \"type\": \"CPU\",\n                \"count\": 1\n            },\n            {\n                \"type\": \"MEMORY\",\n                \"sizeInMB\": 128\n            }\n        ],\n        \"placementPolicy\": {\n            \"type\": \"ANY\"\n        },\n        \"healthcheck\": {\n            \"mode\": {\n                \"type\": \"HTTP\",\n                \"protocol\": \"HTTP\",\n                \"portName\": \"main\",\n                \"path\": \"/\",\n                \"verb\": \"GET\",\n                \"successCodes\": [\n                    200\n                ],\n                \"payload\": \"\",\n                \"connectionTimeout\": \"1 second\",\n                \"insecure\": false\n            },\n            \"timeout\": \"1 second\",\n            \"interval\": \"5 seconds\",\n            \"attempts\": 3,\n            \"initialDelay\": \"0 seconds\"\n        },\n        \"readiness\": {\n            \"mode\": {\n                \"type\": \"HTTP\",\n                \"protocol\": \"HTTP\",\n                \"portName\": \"main\",\n                \"path\": \"/\",\n                \"verb\": \"GET\",\n                \"successCodes\": [\n                    200\n                ],\n                \"payload\": \"\",\n                \"connectionTimeout\": \"1 second\",\n                \"insecure\": false\n            },\n            \"timeout\": \"1 second\",\n            \"interval\": \"3 seconds\",\n            \"attempts\": 3,\n            \"initialDelay\": \"0 seconds\"\n        },\n        \"tags\": {\n            \"superSpecialApp\": \"yes_i_am\",\n            \"say_my_name\": \"heisenberg\"\n        },\n        \"env\": {\n            \"CORES\": \"8\"\n        },\n        \"exposureSpec\": {\n            \"vhost\": \"testapp.local\",\n            \"portName\": \"main\",\n            \"mode\": \"ALL\"\n        },\n        \"preShutdown\": {\n            \"hooks\": [\n                {\n                    \"type\": \"HTTP\",\n                    \"protocol\": \"HTTP\",\n                    \"portName\": \"main\",\n                    \"path\": \"/\",\n                    \"verb\": \"GET\",\n                    \"successCodes\": [\n                        200\n                    ],\n                    \"payload\": \"\",\n                    \"connectionTimeout\": \"1 second\",\n                    \"insecure\": false\n                }\n            ],\n            \"waitBeforeKill\": \"3 seconds\"\n        }\n    },\n    \"message\": \"success\"\n}\n

Note

configs section data will not be returned by any api calls

"},{"location":"apis/application.html#get-list-of-currently-active-instances","title":"Get list of currently active instances","text":"

GET /apis/v1/applications/{id}/instances

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"appId\": \"TEST_APP-1\",\n            \"appName\": \"TEST_APP\",\n            \"instanceId\": \"AI-58eb1111-8c2c-4ea2-a159-8fc68010a146\",\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"localInfo\": {\n                \"hostname\": \"ppessdev\",\n                \"ports\": {\n                    \"main\": {\n                        \"containerPort\": 8000,\n                        \"hostPort\": 33857,\n                        \"portType\": \"HTTP\"\n                    }\n                }\n            },\n            \"resources\": [\n                {\n                    \"type\": \"CPU\",\n                    \"cores\": {\n                        \"0\": [\n                            2\n                        ]\n                    }\n                },\n                {\n                    \"type\": \"MEMORY\",\n                    \"memoryInMB\": {\n                        \"0\": 128\n                    }\n                }\n            ],\n            \"state\": \"HEALTHY\",\n            \"metadata\": {},\n            \"errorMessage\": \"\",\n            \"created\": 1719892354194,\n            \"updated\": 1719893180105\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-list-of-old-instances","title":"Get list of old instances","text":"

GET /apis/v1/applications/{id}/instances/old

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances/old' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"appId\": \"TEST_APP-1\",\n            \"appName\": \"TEST_APP\",\n            \"instanceId\": \"AI-869e34ed-ebf3-4908-bf48-719475ca5640\",\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"resources\": [\n                {\n                    \"type\": \"CPU\",\n                    \"cores\": {\n                        \"0\": [\n                            2\n                        ]\n                    }\n                },\n                {\n                    \"type\": \"MEMORY\",\n                    \"memoryInMB\": {\n                        \"0\": 128\n                    }\n                }\n            ],\n            \"state\": \"STOPPED\",\n            \"metadata\": {},\n            \"errorMessage\": \"Error while pulling image ghcr.io/appform-io/perf-test-server-httplib: Status 500: {\\\"message\\\":\\\"Get \\\\\\\"https://ghcr.io/v2/\\\\\\\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\\\"}\\n\",\n            \"created\": 1719892279039,\n            \"updated\": 1719892354099\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#get-info-for-an-instance","title":"Get info for an instance","text":"

GET /apis/v1/applications/{appId}/instances/{instanceId}

Request

curl --location 'http://drove.local:7000/apis/v1/applications/TEST_APP-1/instances/AI-58eb1111-8c2c-4ea2-a159-8fc68010a146' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\",\n        \"appName\": \"TEST_APP\",\n        \"instanceId\": \"AI-58eb1111-8c2c-4ea2-a159-8fc68010a146\",\n        \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n        \"localInfo\": {\n            \"hostname\": \"ppessdev\",\n            \"ports\": {\n                \"main\": {\n                    \"containerPort\": 8000,\n                    \"hostPort\": 33857,\n                    \"portType\": \"HTTP\"\n                }\n            }\n        },\n        \"resources\": [\n            {\n                \"type\": \"CPU\",\n                \"cores\": {\n                    \"0\": [\n                        2\n                    ]\n                }\n            },\n            {\n                \"type\": \"MEMORY\",\n                \"memoryInMB\": {\n                    \"0\": 128\n                }\n            }\n        ],\n        \"state\": \"HEALTHY\",\n        \"metadata\": {},\n        \"errorMessage\": \"\",\n        \"created\": 1719892354194,\n        \"updated\": 1719893440105\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/application.html#application-endpoints","title":"Application Endpoints","text":"

GET /apis/v1/endpoints

Info

This API provides up-to-date information about the host and port information about application instances running on the cluster. This information can be used for Service Discovery systems to keep their information in sync with changes in the topology of applications running on the cluster.

Tip

Any tag specified in the application specification is also exposed on endpoint. This can be used to implement complicated routing logic if needed in the NGinx template on Drove Gateway.

Request

curl --location 'http://drove.local:7000/apis/v1/endpoints' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"appId\": \"TEST_APP-1\",\n            \"vhost\": \"testapp.local\",\n            \"tags\": {\n                \"superSpecialApp\": \"yes_i_am\",\n                \"say_my_name\": \"heisenberg\"\n            },\n            \"hosts\": [\n                {\n                    \"host\": \"ppessdev\",\n                    \"port\": 44315,\n                    \"portType\": \"HTTP\"\n                }\n            ]\n        },\n        {\n            \"appId\": \"TEST_APP-2\",\n            \"vhost\": \"testapp.local\",\n            \"tags\": {\n                \"superSpecialApp\": \"yes_i_am\",\n                \"say_my_name\": \"heisenberg\"\n            },\n            \"hosts\": [\n                {\n                    \"host\": \"ppessdev\",\n                    \"port\": 46623,\n                    \"portType\": \"HTTP\"\n                }\n            ]\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html","title":"Cluster Management","text":""},{"location":"apis/cluster.html#ping-api","title":"Ping API","text":"

GET /apis/v1/ping

Request

curl --location 'http://drove.local:7000/apis/v1/ping' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": \"pong\",\n    \"message\": \"success\"\n}\n

Tip

Use this api call to determine the leader in a cluster. This api will return a HTTP 200 only for the leader controller. All other controllers in the cluster will return 4xx for this api call.

"},{"location":"apis/cluster.html#cluster-management_1","title":"Cluster Management","text":""},{"location":"apis/cluster.html#get-current-cluster-state","title":"Get current cluster state","text":"

GET /apis/v1/cluster

Request

curl --location 'http://drove.local:7000/apis/v1/cluster' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"leader\": \"ppessdev:4000\",\n        \"state\": \"NORMAL\",\n        \"numExecutors\": 1,\n        \"numApplications\": 1,\n        \"numActiveApplications\": 1,\n        \"freeCores\": 9,\n        \"usedCores\": 1,\n        \"totalCores\": 10,\n        \"freeMemory\": 18898,\n        \"usedMemory\": 128,\n        \"totalMemory\": 19026\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#set-maintenance-mode-on-cluster","title":"Set maintenance mode on cluster","text":"

POST /apis/v1/cluster/maintenance/set

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/set' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"MAINTENANCE\",\n        \"updated\": 1719897526772\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#remove-maintenance-mode-from-cluster","title":"Remove maintenance mode from cluster","text":"

POST /apis/v1/cluster/maintenance/unset

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/unset' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"NORMAL\",\n        \"updated\": 1719897573226\n    },\n    \"message\": \"success\"\n}\n

Warning

Cluster will remain in maintenance mode for some time (about 2 minutes) internally even after maintenance mode is removed.

"},{"location":"apis/cluster.html#executor-management","title":"Executor Management","text":""},{"location":"apis/cluster.html#get-list-of-executors","title":"Get list of executors","text":"

GET /apis/v1/cluster/executors

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/executors' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"hostname\": \"ppessdev\",\n            \"port\": 3000,\n            \"transportType\": \"HTTP\",\n            \"freeCores\": 9,\n            \"usedCores\": 1,\n            \"freeMemory\": 18898,\n            \"usedMemory\": 128,\n            \"tags\": [\n                \"ppessdev\"\n            ],\n            \"state\": \"ACTIVE\"\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#get-detailed-info-for-one-executor","title":"Get detailed info for one executor","text":"

GET /apis/v1/cluster/executors/{id}

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/executors/a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"type\": \"EXECUTOR\",\n        \"hostname\": \"ppessdev\",\n        \"port\": 3000,\n        \"transportType\": \"HTTP\",\n        \"updated\": 1719897100104,\n        \"state\": {\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"cpus\": {\n                \"type\": \"CPU\",\n                \"freeCores\": {\n                    \"0\": [\n                        3,\n                        4,\n                        5,\n                        6,\n                        7,\n                        8,\n                        9,\n                        10,\n                        11\n                    ]\n                },\n                \"usedCores\": {\n                    \"0\": [\n                        2\n                    ]\n                }\n            },\n            \"memory\": {\n                \"type\": \"MEMORY\",\n                \"freeMemory\": {\n                    \"0\": 18898\n                },\n                \"usedMemory\": {\n                    \"0\": 128\n                }\n            }\n        },\n        \"instances\": [\n            {\n                \"appId\": \"TEST_APP-1\",\n                \"appName\": \"TEST_APP\",\n                \"instanceId\": \"AI-58eb1111-8c2c-4ea2-a159-8fc68010a146\",\n                \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n                \"localInfo\": {\n                    \"hostname\": \"ppessdev\",\n                    \"ports\": {\n                        \"main\": {\n                            \"containerPort\": 8000,\n                            \"hostPort\": 33857,\n                            \"portType\": \"HTTP\"\n                        }\n                    }\n                },\n                \"resources\": [\n                    {\n                        \"type\": \"CPU\",\n                        \"cores\": {\n                            \"0\": [\n                                2\n                            ]\n                        }\n                    },\n                    {\n                        \"type\": \"MEMORY\",\n                        \"memoryInMB\": {\n                            \"0\": 128\n                        }\n                    }\n                ],\n                \"state\": \"HEALTHY\",\n                \"metadata\": {},\n                \"errorMessage\": \"\",\n                \"created\": 1719892354194,\n                \"updated\": 1719897100104\n            }\n        ],\n        \"tasks\": [],\n        \"tags\": [\n            \"ppessdev\"\n        ],\n        \"blacklisted\": false\n    },\n    \"message\": \"success\"\n}\n
"},{"location":"apis/cluster.html#take-executor-out-of-rotation","title":"Take executor out of rotation","text":"

POST /apis/v1/cluster/executors/blacklist

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/blacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Note

Unlike other POST apis, the executors to be blacklisted are passed as query parameter id. To blacklist multiple executors, pass .../blacklist?id=<id1>&id=<id2>...

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ],\n        \"failed\": []\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#bring-executor-back-into-rotation","title":"Bring executor back into rotation","text":"

POST /apis/v1/cluster/executors/unblacklist

Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/unblacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Note

Unlike other POST apis, the executors to be un-blacklisted are passed as query parameter id. To un-blacklist multiple executors, pass .../unblacklist?id=<id1>&id=<id2>...

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ],\n        \"failed\": []\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"apis/cluster.html#drove-cluster-events","title":"Drove Cluster Events","text":"

The following APIs can be used to monitor events on Drove. If the data needs to be consumed, the /latest API should be used. For simply knowing if an event of a certain type has occurred or not, the /summary is sufficient.

"},{"location":"apis/cluster.html#event-list","title":"Event List","text":"

GET /apis/v1/cluster/events/latest

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/events/latest?size=1024&lastSyncTime=0' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"events\": [\n            {\n                \"metadata\": {\n                    \"CURRENT_INSTANCES\": 0,\n                    \"APP_ID\": \"TEST_APP-1\",\n                    \"PLACEMENT_POLICY\": \"ANY\",\n                    \"APP_VERSION\": \"1\",\n                    \"CPU_COUNT\": 1,\n                    \"CURRENT_STATE\": \"RUNNING\",\n                    \"PORTS\": \"main:8000:http\",\n                    \"MEMORY\": 128,\n                    \"EXECUTABLE\": \"ghcr.io/appform-io/perf-test-server-httplib\",\n                    \"VHOST\": \"testapp.local\",\n                    \"APP_NAME\": \"TEST_APP\"\n                },\n                \"type\": \"APP_STATE_CHANGE\",\n                \"id\": \"a2b7d673-2bc2-4084-8415-d8d37cafa63d\",\n                \"time\": 1719977632050\n            },\n            {\n                \"metadata\": {\n                    \"APP_NAME\": \"TEST_APP\",\n                    \"APP_ID\": \"TEST_APP-1\",\n                    \"PORTS\": \"main:44315:http\",\n                    \"EXECUTOR_ID\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n                    \"EXECUTOR_HOST\": \"ppessdev\",\n                    \"CREATED\": 1719977629042,\n                    \"INSTANCE_ID\": \"AI-5efbb94f-835c-4c62-a073-a68437e60339\",\n                    \"CURRENT_STATE\": \"HEALTHY\"\n                },\n                \"type\": \"INSTANCE_STATE_CHANGE\",\n                \"id\": \"55d5876f-94ac-4c5d-a580-9c3b296add46\",\n                \"time\": 1719977631534\n            }\n        ],\n        \"lastSyncTime\": 1719977632050//(1)!\n    },\n    \"message\": \"success\"\n}\n

  1. Pass this as the parameter lastSyncTime in the next call to events api to receive latest events.
Query Parameter Validation Description lastSyncTime +ve long range Time when the last sync call happened on the server. Defaults to 0 (initial sync). size 1-1024 Number of latest events to return. Defaults to 1024. We recommend leaving this as is."},{"location":"apis/cluster.html#event-summary","title":"Event Summary","text":"

GET /apis/v1/cluster/events/summary

Request

curl --location 'http://drove.local:7000/apis/v1/cluster/events/summary?lastSyncTime=0' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n
Response
{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"eventsCount\": {\n            \"INSTANCE_STATE_CHANGE\": 8,\n            \"APP_STATE_CHANGE\": 17,\n            \"EXECUTOR_BLACKLISTED\": 1,\n            \"EXECUTOR_UN_BLACKLISTED\": 1\n        },\n        \"lastSyncTime\": 1719977632050//(1)!\n    },\n    \"message\": \"success\"\n}\n

  1. Pass this as the parameter lastSyncTime in the next call to events api to receive latest events.
"},{"location":"apis/cluster.html#continuous-monitoring-for-events","title":"Continuous monitoring for events","text":"

This is applicable for both the APIs listed above

Info

Model for the events can be found here.

Tip

Java programs should definitely look at using the event listener library to listen to cluster events

"},{"location":"apis/logs.html","title":"Log Related APIs","text":""},{"location":"apis/logs.html#get-list-if-log-files","title":"Get list if log files","text":"

Application GET /apis/v1/logfiles/applications/{appId}/{instanceId}/list

Task GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/list

Request

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/list' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"files\": [\n        \"output.log-2024-07-04\",\n        \"output.log-2024-07-03\",\n        \"output.log\"\n    ]\n}\n

"},{"location":"apis/logs.html#download-log-files","title":"Download Log Files","text":"

Application GET /apis/v1/logfiles/applications/{appId}/{instanceId}/download/{fileName}

Task GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/download/{fileName}

Request

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/download/output.log' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

Note

The Content-Disposition header is set properly to the actual filename. For the above example it would be set to attachment; filename=output.log.

"},{"location":"apis/logs.html#read-chunks-from-log","title":"Read chunks from log","text":"

Application GET /apis/v1/logfiles/applications/{appId}/{instanceId}/read/{fileName}

Task GET /apis/v1/logfiles/tasks/{sourceAppName}/{taskId}/read/{fileName}

Query Parameter Validation Description offset Default -1, should be positive number The offset of the file to read from. length Should be a positive number Number of bytes to read.

Request

curl --location 'http://drove.local:7000/apis/v1/logfiles/applications/TEST_APP-1/AI-5efbb94f-835c-4c62-a073-a68437e60339/read/output.log' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"data\": \"\", //(1)!\n    \"offset\": 43318 //(2)!\n}\n

  1. Will contain raw data or empty string (in case of first call)
  2. Offset to be passed in the next call
"},{"location":"apis/logs.html#how-to-tail-logs","title":"How to tail logs","text":"
  1. Have a fixed buffer size in ming 1024/4096 etc
  2. Make a call to /read api with offset=-1, length = buffer size
  3. The call will return no data, but will have a valid offset
  4. Pass this offset in the next call, data will be returned if available (or empty). The response will also return the offset to pass in the .ext call.
  5. The data returned might be empty or less than length depending on availability.
  6. Keep repeating (4) to keep tailing log

Warning

"},{"location":"apis/task.html","title":"Task Management","text":""},{"location":"apis/task.html#issue-task-operation","title":"Issue task operation","text":"

POST /apis/v1/tasks/operations

Request

curl --location 'http://drove.local:7000/apis/v1/tasks/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"KILL\",\n    \"sourceAppName\" : \"TEST_APP\",\n    \"taskId\" : \"T0012\",\n    \"opSpec\": {\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"taskId\": \"T0012\"\n    },\n    \"message\": \"success\"\n}\n

Tip

Relevant payloads for task commands can be found in task operations section.

"},{"location":"apis/task.html#search-for-task","title":"Search for task","text":"

POST /apis/v1/tasks/search

"},{"location":"apis/task.html#list-all-tasks","title":"List all tasks","text":"

GET /apis/v1/tasks

Request

curl --location 'http://drove.local:7000/apis/v1/tasks' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": [\n        {\n            \"sourceAppName\": \"TEST_APP\",\n            \"taskId\": \"T0013\",\n            \"instanceId\": \"TI-c2140806-2bb5-4ed3-9bb9-0c0c5fd0d8d6\",\n            \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n            \"hostname\": \"ppessdev\",\n            \"executable\": {\n                \"type\": \"DOCKER\",\n                \"url\": \"ghcr.io/appform-io/test-task\",\n                \"dockerPullTimeout\": \"100 seconds\"\n            },\n            \"resources\": [\n                {\n                    \"type\": \"CPU\",\n                    \"cores\": {\n                        \"0\": [\n                            2\n                        ]\n                    }\n                },\n                {\n                    \"type\": \"MEMORY\",\n                    \"memoryInMB\": {\n                        \"0\": 512\n                    }\n                }\n            ],\n            \"volumes\": [],\n            \"env\": {\n                \"ITERATIONS\": \"10\"\n            },\n            \"state\": \"RUNNING\",\n            \"metadata\": {},\n            \"errorMessage\": \"\",\n            \"created\": 1719827035480,\n            \"updated\": 1719827038414\n        }\n    ],\n    \"message\": \"success\"\n}\n

"},{"location":"apis/task.html#get-task-instance-details","title":"Get Task Instance Details","text":"

GET /apis/v1/tasks/{sourceAppName}/instances/{taskId}

Request

curl --location 'http://drove.local:7000/apis/v1/tasks/TEST_APP/instances/T0012' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4='\n

Response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"sourceAppName\": \"TEST_APP\",\n        \"taskId\": \"T0012\",\n        \"instanceId\": \"TI-6cf36f5c-6480-4ed5-9e2d-f79d9648529a\",\n        \"executorId\": \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\",\n        \"hostname\": \"ppessdev\",\n        \"executable\": {\n            \"type\": \"DOCKER\",\n            \"url\": \"ghcr.io/appform-io/test-task\",\n            \"dockerPullTimeout\": \"100 seconds\"\n        },\n        \"resources\": [\n            {\n                \"type\": \"CPU\",\n                \"cores\": {\n                    \"0\": [\n                        3\n                    ]\n                }\n            },\n            {\n                \"type\": \"MEMORY\",\n                \"memoryInMB\": {\n                    \"0\": 512\n                }\n            }\n        ],\n        \"volumes\": [],\n        \"env\": {\n            \"ITERATIONS\": \"10\"\n        },\n        \"state\": \"STOPPED\",\n        \"metadata\": {},\n        \"taskResult\": {\n            \"status\": \"SUCCESSFUL\",\n            \"exitCode\": 0\n        },\n        \"errorMessage\": \"\",\n        \"created\": 1719823470267,\n        \"updated\": 1719823483836\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/index.html","title":"Introduction","text":"

An application is a virtual representation of a running service in the system.

Running containers for an application are called application instances.

An application specification contains the following details about the application:

Info

Once a spec is registered to the cluster, it can not be changed

"},{"location":"applications/index.html#application-id","title":"Application ID","text":"

Once an application is created on the cluster, an Application id is generated. The format of this id currently is: {name}-{version}. All further operations to be done on the application will need to refer to it by this ID.

"},{"location":"applications/index.html#application-states-and-operations","title":"Application States and Operations","text":"

An application on a Drove cluster follows a fixed lifecycle modelled as a state machine. State transitions are triggered by operations. Operations can be issued externally using API calls or may be generated internally by the application monitoring system.

"},{"location":"applications/index.html#states","title":"States","text":"

Applications on a Drove cluster can be one of the following states:

"},{"location":"applications/index.html#operations","title":"Operations","text":"

The following application operations are recognized by Drove:

Tip

All operations can take an optional Cluster Operation Spec which can be used to control the timeout and parallelism of tasks generated by the operation.

"},{"location":"applications/index.html#application-state-machine","title":"Application State Machine","text":"

The following state machine signifies the states and transitions as affected by cluster state and operations issued.

"},{"location":"applications/instances.html","title":"Application Instances","text":"

Application instances are running containers for an application. The state machine for instances are managed in a decentralised manner on the cluster nodes locally and not by the controllers. This includes running health checks, readiness checks and shutdown hooks on the container, container loss detection and container state recovery on executor service restart.

Regular updates about the instance state are provided by executors to the controllers and are used to keep the application state up-to-date or trigger application operations to bring the applications to stable states.

"},{"location":"applications/instances.html#application-instance-states","title":"Application Instance States","text":"

An application instance can be in one of the following states at one point in time:

"},{"location":"applications/instances.html#application-instance-state-machine","title":"Application Instance State Machine","text":"

Instance state machine transitions might be triggered on receipt of commands issued by the controller or due to internal changes in the container (might have died or started failing health checks) as well as external factors like executor service restarts.

Note

No operations are allowed to be performed on application instances directly through the executor

"},{"location":"applications/operations.html","title":"Application Operations","text":"

This page discusses operations relevant to Application management. Please go over the Application State Machine and Application Instance State Machine to understand the different states an application (and it's instances) can be in and how operations applied move an application from one state to another.

Note

Please go through Cluster Op Spec to understand the operation parameters being sent.

Note

Only one operation can be active on a particular {appName,version} combination.

Warning

Only the leader controller will accept and process operations. To avoid confusion, use the controller endpoint exposed by Drove Gateway to issue commands.

"},{"location":"applications/operations.html#how-to-initiate-an-operation","title":"How to initiate an operation","text":"

Tip

Use the Drove CLI to perform all manual operations.

All operations for application lifecycle management need to be issued via a POST HTTP call to the leader controller endpoint on the path /apis/v1/applications/operations. API will return HTTP OK/200 and relevant json response as payload.

Sample api call:

curl --location 'http://drove.local:7000/apis/v1/applications/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"START_INSTANCES\",\n    \"appId\": \"TEST_APP-3\",\n    \"instances\": 1,\n    \"opSpec\": {\n        \"timeout\": \"5m\",\n        \"parallelism\": 32,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Note

In the above examples, http://drove.local:7000 is the endpoint of the leader. TEST_APP-3 is the Application ID. Authorization is basic auth.

"},{"location":"applications/operations.html#cluster-operation-specification","title":"Cluster Operation Specification","text":"

When an operation is submitted to the cluster, a cluster op spec needs to be specified. This is needed to control different aspects of the operation, including parallelism of an operation or increase the timeout for the operation and so on.

The following aspects of an operation can be configured:

Name Option Description Timeout timeout The duration after which Drove considers the operation to have timed out. Parallelism parallelism Parallelism of the task. (Range: 1-32) Failure Strategy failureStrategy Set this to STOP.

Note

For internal recovery operations, Drove generates it's own operations. For that, Drove applies the following cluster operation spec:

The default operation spec can be configured in the controller configuration file. It is recommended to set this to a something like 8 for faster recovery.

"},{"location":"applications/operations.html#how-to-cancel-an-operation","title":"How to cancel an operation","text":"

Operations can be requested to be cancelled asynchronously. A POST call needs to be made to leader controller endpoint on the api /apis/v1/operations/{applicationId}/cancel (1) to achieve this.

  1. applicationId is the Application ID for the application
curl --location --request POST 'http://drove.local:7000/apis/v1/operations/TEST_APP-3/cancel' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Warning

Operation cancellation is not instantaneous. Cancellation will be affected only after current execution of the active operation is complete.

"},{"location":"applications/operations.html#create-an-application","title":"Create an application","text":"

Before deploying containers on the cluster, an application needs to be created.

Preconditions:

State Transition:

To create an application, an Application Spec needs to be created first.

Once ready, CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local apps create sample/test_app.json\n

Sample Request Payload

{\n    \"type\": \"CREATE\",\n    \"spec\": {...}, //(1)!\n    \"opSpec\": { //(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Spec as mentioned in Application Specification
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"data\" : {\n        \"appId\" : \"TEST_APP-1\"\n    },\n    \"message\" : \"success\",\n    \"status\" : \"SUCCESS\"\n}\n

"},{"location":"applications/operations.html#starting-new-instances-of-an-application","title":"Starting new instances of an application","text":"

New instances can be started by issuing the START_INSTANCES command.

Preconditions - Application must be in one of the following states: MONITORING, RUNNING

State Transition:

The following command/payload will start 2 new instances of the application.

Drove CLIJSON
drove -c local apps deploy TEST_APP-1 2\n

Sample Request Payload

{\n    \"type\": \"START_INSTANCES\",\n    \"appId\": \"TEST_APP-1\",//(1)!\n    \"instances\": 2,//(2)!\n    \"opSpec\": {//(3)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 32,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Number of instances to be started
  3. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/operations.html#suspending-an-application","title":"Suspending an application","text":"

All instances of an application can be shut down by issuing the SUSPEND command.

Preconditions - Application must be in one of the following states: MONITORING, RUNNING

State Transition:

The following command/payload will suspend all instances of the application.

Drove CLIJSON
drove -c local apps suspend TEST_APP-1\n

Sample Request Payload

{\n    \"type\": \"SUSPEND\",\n    \"appId\": \"TEST_APP-1\",//(1)!\n    \"opSpec\": {//(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 32,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/operations.html#scaling-the-application-up-or-down","title":"Scaling the application up or down","text":"

Scaling the application to required number of containers can be achieved using the SCALE command. Application can be either scaled up or down using this command.

Preconditions - Application must be in one of the following states: MONITORING, RUNNING

State Transition:

Drove CLIJSON
drove -c local apps scale TEST_APP-1 2\n

Sample Request Payload

{\n    \"type\": \"SCALE\",\n    \"appId\": \"TEST_APP-1\", //(3)!\n    \"requiredInstances\": 2, //(1)!\n    \"opSpec\": { //(2)!\n        \"timeout\": \"1m\",\n        \"parallelism\": 20,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Absolute number of instances to be maintained on the cluster for the application
  2. Operation spec as mentioned in Cluster Op Spec
  3. Application ID

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Note

During scale down, older instances are stopped first

Tip

If implementing automation on top of Drove APIs, just use the SCALE command to scale up or down instead of using START_INSTANCES or SUSPEND separately.

"},{"location":"applications/operations.html#restarting-an-application","title":"Restarting an application","text":"

Application can be restarted by issuing the REPLACE_INSTANCES operation. In this case, first clusterOpSpec.parallelism number of containers are spun up first and then an equivalent number of them are spun down. This ensures that cluster maintains enough capacity is maintained in the cluster to handle incoming traffic as the restart is underway.

Warning

If the cluster does not have sufficient capacity to spin up new containers, this operation will get stuck. So adjust your parallelism accordingly.

Preconditions - Application must be in RUNNING state.

State Transition:

Drove CLIJSON
drove -c local apps restart TEST_APP-1\n

Sample Request Payload

{\n    \"type\": \"REPLACE_INSTANCES\",\n    \"appId\": \"TEST_APP-1\", //(1)!\n    \"instanceIds\": [], //(2)!\n    \"opSpec\": { //(3)!\n        \"timeout\": \"1m\",\n        \"parallelism\": 20,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Instances that need to be restarted. This is optional. If nothing is passed, all instances will be replaced.
  3. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Tip

To replace specific instances, pass their application instance ids (starts with AI-...) in the instanceIds parameter in the JSON payload.

"},{"location":"applications/operations.html#stop-or-replace-specific-instances-of-an-application","title":"Stop or replace specific instances of an application","text":"

Application instances can be killed by issuing the STOP_INSTANCES operation. Default behaviour of Drove is to replace killed instances by new instances. Such new instances are always spun up before the specified(old) instances are stopped. If skipRespawn parameter is set to true, the application instance is killed but no new instances are spun up to replace it.

Warning

If the cluster does not have sufficient capacity to spin up new containers, and skipRespawn is not set or set to false, this operation will get stuck.

Preconditions - Application must be in RUNNING state.

State Transition:

Drove CLIJSON
drove -c local apps appinstances kill TEST_APP-1 AI-601d160e-c692-4ddd-8b7f-4c09b30ed02e\n

Sample Request Payload

{\n    \"type\": \"STOP_INSTANCES\",\n    \"appId\" : \"TEST_APP-1\",//(1)!\n    \"instanceIds\" : [ \"AI-601d160e-c692-4ddd-8b7f-4c09b30ed02e\" ],//(2)!\n    \"skipRespawn\" : true,//(3)!\n    \"opSpec\": {//(4)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Application ID
  2. Instance ids to be stopped
  3. Do not spin up new containers to replace the stopped ones. This is set ot false by default.
  4. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"applications/operations.html#destroy-an-application","title":"Destroy an application","text":"

To remove an application deployment (appName-version combo) the DESTROY command can be issued.

Preconditions:

State Transition:

To create an application, an Application Spec needs to be created first.

Once ready, CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local apps destroy TEST_APP_1\n

Sample Request Payload

{\n    \"type\": \"DESTROY\",\n    \"appId\" : \"TEST_APP-1\",//(1)!\n    \"opSpec\": {//(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 2,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Spec as mentioned in Application Specification
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"appId\": \"TEST_APP-1\"\n    },\n    \"message\": \"success\"\n}\n

Warning

All metadata for an app and it's instances are completely obliterated from Drove's storage once an app is destroyed

"},{"location":"applications/outage.html","title":"Outage Detection and Recovery","text":"

Drove tracks all instances for an app deployment in the cluster. It will ensure the required number of containers is always running on the cluster.

"},{"location":"applications/outage.html#instance-health-detection-and-tracking","title":"Instance health detection and tracking","text":"

Executor runs periodic health checks on the container according to check spec configuration. - Runs readiness checks to ensure container is started properly before declaring it healthy - Runs health checks on the container at regular intervals to ensure it is in operating condition

Behavior for both is configured by setting the appropriate options in the application specification.

Result of such health checks (both success and failure) are reported to the controller. Appropriate action is taken to shut down containers that fail readiness or health checks.

"},{"location":"applications/outage.html#container-crash","title":"Container crash","text":"

If container for an application crashes, Drove will automatically spin up a container in it's place.

"},{"location":"applications/outage.html#executor-node-hardware-failure","title":"Executor node hardware failure","text":"

If an executor node fails, instances running on that node will be lost. This is detected by the outage detector and new containers are spun up on other parts of the cluster.

"},{"location":"applications/outage.html#executor-service-temporary-unavailability","title":"Executor service temporary unavailability","text":"

On restart, executor service reads the metadata embedded in the container and registers them. It performs a reconciliation with the leader controller to kill any local containers if the unavailability was too long and controller has already spun up new alternatives.

"},{"location":"applications/outage.html#zombie-container-detection-and-cleanup","title":"Zombie (container) detection and cleanup","text":"

Executor service keeps track of all containers it is supposed to run by running periodic reconciliation with the leader controller. Any mismatch gets handled:

"},{"location":"applications/specification.html","title":"Application Specification","text":"

An application is defined using JSON. We use a sample configuration below to explain the options.

"},{"location":"applications/specification.html#sample-application-definition","title":"Sample Application Definition","text":"
{\n    \"name\": \"TEST_APP\", // (1)!\n    \"version\": \"1\", // (2)!\n    \"type\": \"SERVICE\", // (3)!\n    \"executable\": { //(4)!\n        \"type\": \"DOCKER\", // (5)!\n        \"url\": \"ghcr.io/appform-io/perf-test-server-httplib\",// (6)!\n        \"dockerPullTimeout\": \"100 seconds\"// (7)!\n    },\n    \"resources\": [//(20)!\n        {\n            \"type\": \"CPU\",\n            \"count\": 1//(21)!\n        },\n        {\n            \"type\": \"MEMORY\",\n            \"sizeInMB\": 128//(22)!\n        }\n    ],\n    \"volumes\": [//(12)!\n        {\n            \"pathInContainer\": \"/data\",//(13)!\n            \"pathOnHost\": \"/mnt/datavol\",//(14)!\n            \"mode\" : \"READ_WRITE\"//(15)!\n        }\n    ],\n    \"configs\" : [//(16)!\n        {\n            \"type\" : \"INLINE\",//(17)!\n            \"localFilename\": \"/testfiles/drove.txt\",//(18)!\n            \"data\" : \"RHJvdmUgdGVzdA==\"//(19)!\n        }\n    ],\n    \"placementPolicy\": {//(23)!\n        \"type\": \"ANY\"//(24)!\n    },\n    \"exposedPorts\": [//(8)!\n        {\n            \"name\": \"main\",//(9)!\n            \"port\": 8000,//(10)!\n            \"type\": \"HTTP\"//(11)!\n        }\n    ],\n    \"healthcheck\": {//(25)!\n        \"mode\": {//(26)!\n            \"type\": \"HTTP\", //(27)!\n            \"protocol\": \"HTTP\",//(28)!\n            \"portName\": \"main\",//(29)!\n            \"path\": \"/\",//(30)!\n            \"verb\": \"GET\",//(31)!\n            \"successCodes\": [//(32)!\n                200\n            ],\n            \"payload\": \"\", //(33)!\n            \"connectionTimeout\": \"1 second\" //(34)!\n        },\n        \"timeout\": \"1 second\",//(35)!\n        \"interval\": \"5 seconds\",//(36)!\n        \"attempts\": 3,//(37)!\n        \"initialDelay\": \"0 seconds\"//(38)!\n    },\n    \"readiness\": {//(39)!\n        \"mode\": {\n            \"type\": \"HTTP\",\n            \"protocol\": \"HTTP\",\n            \"portName\": \"main\",\n            \"path\": \"/\",\n            \"verb\": \"GET\",\n            \"successCodes\": [\n                200\n            ],\n            \"payload\": \"\",\n            \"connectionTimeout\": \"1 second\"\n        },\n        \"timeout\": \"1 second\",\n        \"interval\": \"3 seconds\",\n        \"attempts\": 3,\n        \"initialDelay\": \"0 seconds\"\n    },\n    \"exposureSpec\": {//(42)!\n        \"vhost\": \"testapp.local\", //(43)!\n        \"portName\": \"main\", //(44)!\n        \"mode\": \"ALL\"//(45)!\n    },\n    \"env\": {//(41)!\n        \"CORES\": \"8\"\n    },\n    \"args\" : [//(54)!\n        \"./entrypoint.sh\",\n        \"arg1\",\n        \"arg2\"\n    ],\n    \"tags\": { //(40)!\n        \"superSpecialApp\": \"yes_i_am\",\n        \"say_my_name\": \"heisenberg\"\n    },\n    \"preShutdown\": {//(46)!\n        \"hooks\": [ //(47)!\n            {\n                \"type\": \"HTTP\",\n                \"protocol\": \"HTTP\",\n                \"portName\": \"main\",\n                \"path\": \"/\",\n                \"verb\": \"GET\",\n                \"successCodes\": [\n                    200\n                ],\n                \"payload\": \"\",\n                \"connectionTimeout\": \"1 second\"\n            }\n        ],\n        \"waitBeforeKill\": \"3 seconds\"//(48)!\n    },\n    \"logging\": {//(49)!\n        \"type\": \"LOCAL\",//(50)!\n        \"maxSize\": \"100m\",//(51)!\n        \"maxFiles\": 3,//(52)!\n        \"compress\": true//(53)!\n    }\n}\n
  1. A human readable name for the application. This will remain constant for different versions of the app.
  2. A version number. Drove does not enforce any format for this, but it is recommended to increment this for changes in spec.
  3. This should be fixed to SERVICE for an application/service.
  4. Coordinates for the executable. Refer to Executable Specification for details.
  5. Right now the only type supported is DOCKER.
  6. Docker container address
  7. Timeout for container pull.
  8. The ports to be exposed from the container.
  9. A logical name for the port. This will be used to reference this port in other sections.
  10. Actual port number as mentioned in Dockerfile.
  11. Type of port. Can be: HTTP, HTTPS, TCP, UDP.
  12. Volumes to be mounted. Refer to Volume Specification for details.
  13. Path that will be visible inside the container for this mount.
  14. Actual path on the host machine for the mount.
  15. Mount mode can be READ_WRITE and READ_ONLY
  16. Configuration to be injected as file inside the container. Please refer to Config Specification for details.
  17. Type of config. Can be INLINE, EXECUTOR_LOCAL_FILE, CONTROLLER_HTTP_FETCH and EXECUTOR_HTTP_FETCH. Specifies how drove will get the contents to be injected..
  18. File name for the config inside the container.
  19. Serialized form of the data, this and other parameters will vary according to the type specified above.
  20. List of resources required to run this application. Check Resource Requirements Specification for more details.
  21. Number of CPU cores to be allocated.
  22. Amount of memory to be allocated expressed in Megabytes
  23. Specifies how the container will be placed on the cluster. Check Placement Policy for details.
  24. Type of placement can be ANY, ONE_PER_HOST, MATCH_TAG, NO_TAG, RULE_BASED, ANY and COMPOSITE. Rest of the parameters in this section will depend on the type.
  25. Health check to ensure service is running fine. Refer to Check Specification for details.
  26. Mode of health check, can be api call or command.
  27. Type of this check spec. Type can be HTTP or CMD. Rest of the options in this example are HTTP specific.
  28. API call protocol. Can be HTTP/HTTPS
  29. Port name as mentioned in the exposedPorts section.
  30. HTTP path. Include query params here.
  31. HTTP method. Can be GET,PUT or POST.
  32. Set of HTTP status codes which can be considered as success.
  33. Payload to be sent for POST and PUT calls.
  34. Connection timeout for the port.
  35. Timeout for the check run.
  36. Interval between check runs.
  37. Max attempts after which the overall check is considered to be a failure.
  38. Time to wait before starting check runs.
  39. Readiness check to pass for the container to be considered as ready. Refer to Check Specification for details.
  40. Key value metadata that can be used in external systems.
  41. Custom environment variables. Additional variables are injected by Drove as well. See Environment Variables section for details.
  42. Specifies the virtual host on which this container is exposed.
  43. FQDN for the virtual host.
  44. Port name as specified in exposedPorts section.
  45. Mode for exposure. Set this to ALL for now.
  46. Things to do before a container is shutdown. Check Pre Shutdown Behavior for more details.
  47. Hooks (HTTP api call or shell command) to run before shutting down the container. Format is same as health/readiness checks. Refer to HTTP Check Actions and Command Check Options for details.
  48. Time to wait before killing the container. The container will be in UNREADY state during this time and hence won't have api calls routed to it via Drove Gateway.
  49. Specify how docker log files are configured. Refer to Logging Specification
  50. Log to local file
  51. Maximum File Size
  52. Number of latest log files to retain
  53. Log files will be compressed
  54. List of command line arguments. See Command Line Arguments for details.
"},{"location":"applications/specification.html#executable-specification","title":"Executable Specification","text":"

Right now Drove supports only docker containers. However as engines, both docker and podman are supported. Drove executors will fetch the executable directly from the registry based on the configuration provided.

Name Option Description Type type Set type to DOCKER. URL url Docker container URL`. Timeout dockerPullTimeout Timeout for docker image pull.

Note

Drove supports docker registry authentication. This can be configured in the executor configuration file.

"},{"location":"applications/specification.html#resource-requirements-specification","title":"Resource Requirements Specification","text":"

This section specifies the hardware resources required to run the container. Right now only CPU and MEMORY are supported as resource types that can be reserved for a container.

"},{"location":"applications/specification.html#cpu-requirements","title":"CPU Requirements","text":"

Specifies number of cores to be assigned to the container.

Name Option Description Type type Set type to CPU for this. Count count Number of cores to be assigned."},{"location":"applications/specification.html#memory-requirements","title":"Memory Requirements","text":"

Specifies amount of memory to be allocated to a container.

Name Option Description Type type Set type to MEMORY for this. Count sizeInMB Amount of memory (in Mega Bytes) to be allocated.

Sample

[\n    {\n        \"type\": \"CPU\",\n        \"count\": 1\n    },\n    {\n        \"type\": \"MEMORY\",\n        \"sizeInMB\": 128\n    }\n]\n

Note

Both CPU and MEMORY configurations are mandatory.

"},{"location":"applications/specification.html#volume-specification","title":"Volume Specification","text":"

Files and directories can be mounted from the executor host into the container. The volumes section contains a list of volumes that need to be mounted.

Name Option Description Path In Container pathInContainer Path that will be visible inside the container for this mount. Path On Host pathOnHost Actual path on the host machine for the mount. Mount Mode mode Mount mode can be READ_WRITE and READ_ONLY to allow the containerized process to write or read to the volume.

Info

We do not support mounting remote volumes as of now.

"},{"location":"applications/specification.html#config-specification","title":"Config Specification","text":"

Drove supports injection of configuration files into containers. The specifications for the same are discussed below.

"},{"location":"applications/specification.html#inline-config","title":"Inline config","text":"

Inline configuration can be added in the Application Specification itself. This will manifest as a file inside the container.

The following details are needed for this:

Name Option Description Type type Set the value to INLINE Local Filename localFilename File name for the config inside the container. Data data Base64 encoded string for the data. The value for this will be masked on UI.

Config file:

port: 8080\nlogLevel: DEBUG\n
Corresponding config specification:
{\n    \"type\" : \"INLINE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"cG9ydDogODA4MApsb2dMZXZlbDogREVCVUcK\"\n}\n

Warning

The full base 64 encoded config data will get stored in Drove ZK and will be pushed to executors inline. It is not recommended to stream large config files to containers using this method. This will probably need additional configuration on your ZK cluster.

"},{"location":"applications/specification.html#locally-loaded-config","title":"Locally loaded config","text":"

Config file from a path on the executor directly. Such files can be distributed to the executor host using existing configuration management systems such as OpenTofu, Salt etc.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_LOCAL_FILE Local Filename localFilename File name for the config inside the container. File path filePathOnHost Path to the config file on executor host.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_LOCAL_FILE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"/mnt/configs/myservice/config.yml\"\n}\n

"},{"location":"applications/specification.html#controller-fetched-config","title":"Controller fetched Config","text":"

Config file can be fetched from a remote server by the controller. Once fetched, these will be streamed to the executor as part of the instance specification for starting a container.

The following details are needed for this:

Name Option Description Type type Set the value to CONTROLLER_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"CONTROLLER_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

The controller will make an API call for every single time it asks an executor to spin up a container. Please make sure to account for this in your configuration management system.

"},{"location":"applications/specification.html#executor-fetched-config","title":"Executor fetched Config","text":"

Config file can be fetched from a remote server by the executor before spinning up a container. Once fetched, the payload will be injected as a config file into the container.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

All executors will make an API call for every single time they spin up a container for this application. Please make sure to account for this in your configuration management system.

"},{"location":"applications/specification.html#http-call-specification","title":"HTTP Call Specification","text":"

This section details the options that can set when making http calls to a configuration management system from controllers or executors.

The following options are available for HTTP call:

Name Option Description Protocol protocol Protocol to use for upstream call. Can be HTTP or HTTPS. Hostname hostname Host to call. Port port Provide custom port. Defaults to 80 for http and 443 for https. API Path path Path component of the URL. Include query parameters here. Defaults to / HTTP Method verb Type of call, use GET, POST or PUT. Defaults to GET. Success Code successCodes List of HTTP status codes which is considered as success. Defaults to [200] Payload payload Data to be used for POST and PUT calls Connection Timeout connectionTimeout Timeout for upstream connection. Operation timeout operationTimeout Timeout for actual operation. Username username Username to be used basic auth. This field is masked out on the UI. Password password Password to be used for basic auth. This field is masked on the UI. Authorization Header authHeader Data to be passed in HTTP Authorization header. This field is masked on the UI. Additional Headers headers Any other headers to be passed to the upstream in the HTTP calls. This is a map of Skip SSL Checks insecure Skip hostname and certification checks during SSL handshake with the upstream."},{"location":"applications/specification.html#placement-policy-specification","title":"Placement Policy Specification","text":"

Placement policy governs how Drove deploys containers on the cluster. The following sections discuss the different placement policies available and how they can be configured to achieve optimal placement of containers.

Warning

All policies will work only at a {appName, version} combination level. They will not ensure constraints at an appName level. This means that for somethinge like a one per node placement, for the same appName, multiple containers can run on the same host if multiple deployments with different versions are active in a cluster. Same applies for all policies like N per host and so on.

Important details about executor tagging

"},{"location":"applications/specification.html#any-placement","title":"Any Placement","text":"

Containers for a {appName, version} combination can run on any un-tagged executor host.

Name Option Description Policy Type type Put ANY as policy.

Sample:

{\n    \"type\" : \"ANY\"\n}\n

Tip

For most use-cases this is the placement policy to use.

"},{"location":"applications/specification.html#one-per-host-placement","title":"One Per Host Placement","text":"

Ensures that only one container for a particular {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put ONE_PER_HOST as policy.

Sample:

{\n    \"type\" : \"ONE_PER_HOST\"\n}\n

"},{"location":"applications/specification.html#max-n-per-host-placement","title":"Max N Per Host Placement","text":"

Ensures that at most N containers for a {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put MAX_N_PER_HOST as policy. Max count max The maximum num of containers that can run on an executor. Range: 1-64

Sample:

{\n    \"type\" : \"MAX_N_PER_HOST\",\n    \"max\": 3\n}\n

"},{"location":"applications/specification.html#match-tag-placement","title":"Match Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has the tags as mentioned in the policy.

Name Option Description Policy Type type Put MATCH_TAG as policy. Max count tag The tag to match.

Sample:

{\n    \"type\" : \"MATCH_TAG\",\n    \"tag\": \"gpu_enabled\"\n}\n

"},{"location":"applications/specification.html#no-tag-placement","title":"No Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has no tags.

Name Option Description Policy Type type Put NO_TAG as policy.

Sample:

{\n    \"type\" : \"NO_TAG\"\n}\n

Info

The NO_TAG policy is mostly for internal use, and does not need to be specified when deploying containers that do not need any special placement logic.

"},{"location":"applications/specification.html#composite-policy-based-placement","title":"Composite Policy Based Placement","text":"

Composite policy can be used to combine policies together to create complicated placement requirements.

Name Option Description Policy Type type Put COMPOSITE as policy. Polices policies List of policies to combine Combiner combiner Can be AND and OR and signify all-match and any-match logic on the policies mentioned.

Sample:

{\n    \"type\" : \"COMPOSITE\",\n    \"policies\": [\n        {\n            \"type\": \"ONE_PER_HOST\"\n        },\n        {\n            \"type\": \"MATH_TAG\",\n            \"tag\": \"gpu_enabled\"\n        }\n    ],\n    \"combiner\" : \"AND\"\n}\n
The above policy will ensure that only one container of the relevant {appName,version} will run on GPU enabled machines.

Tip

It is easy to go into situations where no executors match complicated placement policies. Internally, we tend to keep things rather simple and use the ANY placement for most cases and maybe tags in a few places with over-provisioning or for hosts having special hardware

"},{"location":"applications/specification.html#environment-variables","title":"Environment variables","text":"

This config can be used to inject custom environment variables to containers. The values are defined as part of deployment specification, are same across the cluster and immutable to modifications from inside the container (ie any overrides from inside the container will not be visible across the cluster).

Sample:

{\n    \"MY_VARIABLE_1\": \"fizz\",\n    \"MY_VARIABLE_2\": \"buzz\"\n}\n

The following environment variables are injected by Drove to all containers:

Variable Name Value HOST Hostname where the container is running. This is for marathon compatibility. PORT_PORT_NUMBER A variable for every port specified in exposedPorts section. The value is the actual port on the host, the specified port is mapped to. For example if ports 8080 and 8081 are specified, two variables called PORT_8080 and PORT_8081 will be injected. DROVE_EXECUTOR_HOST Hostname where container is running. DROVE_CONTAINER_ID Container that is deployed DROVE_APP_NAME App name as specified in the Application Specification DROVE_INSTANCE_ID Actual instance ID generated by Drove DROVE_APP_ID Application ID as generated by Drove DROVE_APP_INSTANCE_AUTH_TOKEN A JWT string generated by Drove that can be used by this container to call /apis/v1/internal/... apis.

Warning

Do not pass secrets using environment variables. These variables are all visible on the UI as is. Please use Configs to inject secrets files and so on.

"},{"location":"applications/specification.html#command-line-arguments","title":"Command line arguments","text":"

A list of command line arguments that are sent to the container engine to execute inside the container. This is provides ways for you to configure your container behaviour based off such arguments. Please refer to docker documentation for details.

Danger

This might have security implications from a system point of view. As such Drove provides administrators a way to disable passing arguments at the cluster level by setting disableCmdlArgs to true in the controller configuration.

"},{"location":"applications/specification.html#check-specification","title":"Check Specification","text":"

One of the cornerstones of managing applications on the cluster is to ensure we keep track of instance health and manage their life cycle depending on their health state. We need to define how to monitor health for containers accordingly. The checks will be executed on Applications and a Check result is generated. The result consists of the following:

"},{"location":"applications/specification.html#common-options","title":"Common Options","text":"Name Option Description Mode mode The definition of a HTTP call or a Command to be executed in the container. See following sections for details. Timeout timeout Duration for which we wait before declaring a check as failed Interval interval Interval at which check will be retried Attempts attempts Number of times a check is retried before it is declared as a failure Initial Delay initialDelay Delay before executing the check for the first time.

Note

initialDelay is ignored when readiness checks and health checks are run in the recovery path as the container is already running at that point in time.

"},{"location":"applications/specification.html#http-check-options","title":"HTTP Check Options","text":"Name Option Description Type type Fixed to HTTP for HTTP checker Protocol protocol HTTP or HTTPS call to be made Port Name portName The name of the container port to make the http call on as specified in the Exposed Ports section in Application Spec Path path The api path to call HTTP method verb The HTTP Verb/Method to invoke. GET/PUT and POST are supported here Success Codes successCodes A set of HTTP status codes that we should consider as a success from this API. Payload payload A string payload that we can pass if the Verb is POST or PUT Connection Timeout connectionTimeout Maximum time for which the checker will wait for the connection to be set up with the container. Insecure insecure Skip hostname and certificate checks for HTTPS ports during checks."},{"location":"applications/specification.html#command-check-options","title":"Command Check Options","text":"Field Option Description Type type Fixed to CMD for command checker Command command Command to execute in the container. (Equivalent to docker exec -it <container> command>)"},{"location":"applications/specification.html#exposure-specification","title":"Exposure Specification","text":"

Exposure spec is used to specify the virtual host Drove Gateway exposes to outside world for communication with the containers.

The following information needs to be specified:

Name Option Description Virtual Host vhost The virtual host to be exposed on NGinx. This should be a fully qualified domain name. Port Name portName The portname to be exposed on the vhost. Port names are defined in exposedPorts section. Exposure Mode mode Use ALL here for now. Signifies that all healthy instances of the app are exposed to traffic.

Sample:

{\n    \"vhost\": \"teastapp.mydomain\",\n    \"port\": \"main\",\n    \"mode\": \"ALL\"\n}\n

Note

Application instances in any state other than HEALTHY are not considered for exposure. Please check Application Instance State Machine for an understanding of states of instances.

"},{"location":"applications/specification.html#configuring-pre-shutdown-behaviour","title":"Configuring Pre Shutdown Behaviour","text":"

Before a container is shut down, it is desirable to ensure things are spun down properly. This behaviour can be configured in the preShutdown section of the configuration.

Name Option Description Hooks hooks List of api calls and commands to be run on the container before it is killed. Each hook is either a HTTP Call Spec or Command Spec Wait Time waitBeforeKill Time to wait before killing the container.

Sample

{\n    \"hooks\": [\n        {\n            \"type\": \"HTTP\",\n            \"protocol\": \"HTTP\",\n            \"portName\": \"main\",\n            \"path\": \"/\",\n            \"verb\": \"GET\",\n            \"successCodes\": [\n                200\n            ],\n            \"payload\": \"\",\n            \"connectionTimeout\": \"1 second\"\n        }\n    ],\n    \"waitBeforeKill\": \"3 seconds\"//(48)!\n}\n

Note

The waitBeforeKill timed wait kicks in after all the hooks have been executed.

"},{"location":"applications/specification.html#logging-specification","title":"Logging Specification","text":"

Can be used to configure how container logs are managed on the system.

Note

This section affects the docker log driver. Drove will continue to stream logs to it's own logger which can be configured at executor level through the executor configuration file.

"},{"location":"applications/specification.html#local-logger-configuration","title":"Local Logger configuration","text":"

This is used to configure the json-file log driver.

Name Option Description Type type Set the value to LOCAL Max Size maxSize Maximum file size. Anything bigger than this will lead to rotation. Max Files maxFiles Maximum number of logs files to keep. Range: 1-100 Compress compress Enable log file compression.

Tip

If logging section is omitted, the following configuration is applied by default: - File size: 10m - Number of files: 3 - Compression: on

"},{"location":"applications/specification.html#rsyslog-configuration","title":"Rsyslog configuration","text":"

In case suers want to stream logs to an rsyslog server, the logging configuration needs to be set to RSYSLOG mode.

Name Option Description Type type Set the value to RSYSLOG Server server URL for the rsyslog server. Tag Prefix tagPrefix Prefix to add at the start of a tag Tag Suffix tagSuffix Suffix to add at the en of a tag.

Note

The default tag is the DROVE_INSTANCE_ID. The tagPrefix and tagSuffix will to before and after this

"},{"location":"cluster/cluster.html","title":"Anatomy of a Drove Cluster","text":"

The following diagram provides a high level overview of a typical Drove cluster. The overall topology consists of the following components:

"},{"location":"cluster/cluster.html#apache-zookeeper","title":"Apache ZooKeeper","text":"

Zookeeper is a central component in a Drove cluster. It is used in the following manner:

"},{"location":"cluster/cluster.html#controller","title":"Controller","text":"

The controller service is the brains of a Drove cluster. The role of the controller consists of the following:

"},{"location":"cluster/cluster.html#executors","title":"Executors","text":"

Executors are the agents running on the nodes where the containers are deployed. Role of the executors is the following:

"},{"location":"cluster/cluster.html#nginx-and-drove-gateway","title":"NGinx and Drove-Gateway","text":"

Almost all of the traffic between service containers is routed via the internal Ranger based service discovery system at PhonePe. However, traffic from the edge as well and between different protected environments are routed using the well-established virtual host (and additionally, in some unusual cases, header) based routing.

Tip

The NGinx deployment is standard across all Drove clusters. However, for clusters that receive a lot of traffic using Nginx, the cluster exposing the VHost for Drove itself might be separated from the one exposing the application virtual hosts to allow for easy scalability of the latter. The template for these are configured differently as needed respectively.

"},{"location":"cluster/cluster.html#other-components","title":"Other components","text":"

There are a few more components that are used for operational management and observability.

"},{"location":"cluster/cluster.html#telegraf","title":"Telegraf","text":"

PhonePe\u2019s internal metric management system uses a HTTP based metric collector. Telegraf is installed on all Drove nodes to collect metric from the metric port (Admin connector on Dropwizard) and push that information to our metric ingestion system. This information is then used to build dashboards as well as by our Anomaly detection and alerting systems.

"},{"location":"cluster/cluster.html#log-management","title":"Log Management","text":"

Drove provides a special logger called drove that can be configured to handle compression rotation and archival of container logs. Such container logs are stored on specialised partitions by application/application-instance-id or by source app name/ task id for application and task instances respectively. PhonePe\u2019s standardised log rotation tools are used to monitor and ship out such logs to our central log management system. The same can be replaced or enhanced by running something like promtail on Drove logs to ship out logs to tools like Grafana Loki.

"},{"location":"cluster/setup/controller.html","title":"Setting up Controllers","text":"

Controllers are the brains of Drove cluster. For HA, at least 2 controllers should be set up.

Please note the following behaviour about controllers:

"},{"location":"cluster/setup/controller.html#controller-configuration-file-reference","title":"Controller Configuration File Reference","text":"

The Drove Controller is written on the Dropwizard framework. The configuration to the service is set using a YAML file which needs to be injected into the container. A typical controller configuration file will look like the following:

server: #(1)!\n  applicationConnectors: #(2)!\n    - type: http\n      port: 4000\n  adminConnectors: #(3)!\n    - type: http\n      port: 4001\n  applicationContextPath: / #(4)!\n  requestLog: #(5)!\n    appenders:\n      - type: console\n        timeZone: ${DROVE_TIMEZONE}\n      - type: file\n        timeZone: ${DROVE_TIMEZONE}\n        currentLogFilename: /logs/drove-controller-access.log\n        archivedLogFilenamePattern: /logs/drove-controller-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\n\nlogging: #(6)!\n  level: INFO\n  loggers:\n    com.phonepe.drove: ${DROVE_LOG_LEVEL}\n\n  appenders:\n    - type: console #(7)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n    - type: file #(8)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      currentLogFilename: /logs/drove-controller.log\n      archivedLogFilenamePattern: /logs/drove-controller.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n      archive: true\n\n\nzookeeper: #(9)!\n  connectionString: ${ZK_CONNECTION_STRING}\n\nclusterAuth: #(10)!\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ${DROVE_CONTROLLER_SECRET}\n  - nodeType: EXECUTOR\n    secret: ${DROVE_EXECUTOR_SECRET}\n\nuserAuth: #(11)!\n  enabled: true\n  users:\n    - username: admin\n      password: ${DROVE_ADMIN_PASSWORD}\n      role: EXTERNAL_READ_WRITE\n    - username: guest\n      password: ${DROVE_GUEST_PASSWORD}\n      role: EXTERNAL_READ_ONLY\n\ninstanceAuth: #(12)!\n  secret: ${DROVE_INSTANCE_AUTH_SECRET}\n\noptions: #(13)!\n  maxStaleInstancesCount: 3\n  staleCheckInterval: 1m\n  staleAppAge: 1d\n  staleInstanceAge: 18h\n  staleTaskAge: 1d\n  clusterOpParallelism: 4\n
  1. Server listener configuration. See Dropwizard Server Configuration for the different options.
  2. Main port configuration. This is where the UI and APIs will be exposed. Check connector configuration docs for details.
  3. Admin port. You can take thread dumps, metrics, run healthchecks on the Drove controller on this port.
  4. Base path for UI. Keep this as is.
  5. Access logs configuration. See requestLog docs.
  6. Main logging configuration. See logging docs.
  7. Log to console. Useful in docker-compose.
  8. Log to rotating files. Useful for running servers.
  9. Configure how to connect to Zookeeper See Zookeeper Config for details.
  10. Configuration for authentication between nodes in the cluster. Please check intra node auth config for details.
  11. Configure user authentication to access the cluster. Please check User auth config for details.
  12. Signing secret for JWT to be embedded in application and task instances. Check Instance auth config for details.
  13. Special options to configure controller behaviour. See Controller Options for details.

Tip

In case you do not want to expose admin apis to outside the host, please set bindHost in the admin connectors section.

adminConnectors:\n  - type: http\n    port: 10001\n    bindHost: 127.0.0.1\n
"},{"location":"cluster/setup/controller.html#zookeeper-connection-configuration","title":"Zookeeper Connection Configuration","text":"

The following details can be configured.

Name Option Description Connection String connectionString The connection string of the form: zkserver:2181,zkserver2:2181... Data namespace namespace The top level node inside which all Drove data will be scoped. Defaults to drove if not set.

Sample

zookeeper:\n  connectionString: \"192.168.3.10:2181,192.168.3.11:2181,192.168.3.12:2181\"\n  namespace: drovetest\n
"},{"location":"cluster/setup/controller.html#intra-node-authentication-configuration","title":"Intra Node Authentication Configuration","text":"

Communication between controller and executor is protected by a shared-secret based authentication. The following configuration is meant to configure this. This section consists of a list of 2 members:

Each section consists of the following:

Name Option Description Node Type nodeType Type of node in the cluster. Can be CONTROLLER or EXECUTOR Secret secret The actual secret to be passed.

Sample

clusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ControllerSecretValue\n  - nodeType: EXECUTOR\n    secret: ExecutorSecret\n

Danger

The values are passed in the header as is. Please manage the config file ownership to ensure that the files are not world readable.

Tip

You can use pwgen -s 32 to generate secure random strings for usage as secrets.

"},{"location":"cluster/setup/controller.html#user-authentication-configuration","title":"User Authentication Configuration","text":"

This section is used to configure user details for human and other systems that need to call Drove APIs or access the Drove UI. This is implemented using basic auth.

The configuration consists of:

Name Option Description Enabled enabled Enable basic auth for the cluster Encoding encoding The actual encoding of the password. Can be PLAIN or CRYPT Caching cachingPolicy Caching policy for the authentication and authorization of the user. Please check CaffeineSpec docs for more details. Set to maximumSize=500, expireAfterAccess=30m by default List of users users A list of users recognized by the system

Each entry in the user list consists of:

Name Option Description User Name username The actual login username Password password The password for the user. Needs to be set to bcrypt string of the actual password if encoding is set to CRYPT in the parent section. User Role role The role of the user in the cluster. Can be EXTERNAL_READ_WRITE for users who have both read and write permissions or EXTERNAL_READ_ONLY for users with read-only permissions.

Sample

userAuth:\n  enabled: true\n  encoding: CRYPT\n  users:\n    - username: admin\n      password: \"$2y$10$pfGnPkYrJEGzasvVNPjRu.IJldV9TDa0Vh.u1UdimILWDuhvapc2O\"\n      role: EXTERNAL_READ_WRITE\n    - username: guest\n      password: \"$2y$10$uCJ7WxIvd13C.1oOTs28p.xpJShGiTWuDLY/sGH9JE8nrkSGBFkc6\"\n      role: EXTERNAL_READ_ONLY\n    - username: noread\n      password: \"$2y$10$8mr/zXL5rMW/s/jlBcgXHu0UvyzfdDDvyc.etfuoR.991sn9UOX/K\"\n

No authentication

To configure a cluster without authentication, remove this section entirely.

Operator role

If role is not set, the user will be able to access the UI, but will not have access to application logs. This comes in handy to provide access to other teams to explore your deployment topology, but not get access to your logs that might contain sensitive information.

Password Hashing

We strongly recommend using bcrypt passwords for authentication. You can use the following command to generate hashed password strings:

htpasswd -nbBC 10 <username> <password>|cut -d ':' -f2\n
"},{"location":"cluster/setup/controller.html#instance-authentication-configuration","title":"Instance Authentication Configuration","text":"

All application and task instances, get access to an unique JWT that is injected into it by Drove as the environment variable DROVE_APP_INSTANCE_AUTH_TOKEN. This token is signed using a secret. This secret can be configured by setting the secret parameter in the instanceAuth section.

Sample

instanceAuth:\n  secret: RandomSecret\n

"},{"location":"cluster/setup/controller.html#controller-options","title":"Controller Options","text":"

The following options can be set to influence the behavior of the Drove cluster and the controller.

Name Option Description Stale Check Interval staleCheckInterval Interval at which Drove checks for stale application and task metadata for cleanup. Defaults to 1 hour. Expressed in duration. Stale App Age staleAppAge Apps in MONITORING state are cleaned up after some time by Drove. This variable can be used to control the max time for which such apps are maintained in the cluster. Defaults to 7 days. Expressed in duration. Stale App Instances Count maxStaleInstancesCount Maximum number of application instances metadata for stopped or lost instances to be maintained in the cluster. Defaults to 100. Stale Instance Age staleInstanceAge Maximum age for a stale application instance to be retained. Defaults to 7 days. Expressed in duration. Stale Task Age staleTaskAge Maximum time for which metadata for a finished task is retained on the cluster. Defaults to 2 days. Expressed in duration. Event Storage Duration maxEventsStorageDuration Maximum time for which cluster events are retained on the cluster. Defaults to 1 hour. Expressed in duration. Default Operation Timeout clusterOpTimeout Timeout for operations that are initiated by drove itself. For example, instance spin up in case of executor failure, instance migrations etc. Defaults to 5 minutes. Expressed in duration. Operation threads clusterOpParallelism Signified the parallelism for operations internal to the cluster. Defaults to: 1. Range: 1-32. Audited Methods auditedHttpMethods Drove prints an audit log with user details when an api is called by an user. Defaults to [\"POST\", \"PUT\"]. Allowed mount directories allowedMountDirs If provided, Drove will ensure that application and task spec can mount only the directories mentioned in this set on executor host. Disable read-only auth disableReadAuth When userAuth is enabled, setting this option, will enforce authorization only on write operations. Disable command line arguments disableCmdlArgs When set to true, passing command line arguments will be disabled. Default: false (users can pass arguments.

Sample

options:\n  staleCheckInterval: 5m\n  staleAppAge: 2d\n  maxStaleInstancesCount: 20\n  staleInstanceAge: 1d\n  staleTaskAge: 2d\n  maxEventsStorageDuration: 30m\n  clusterOpParallelism: 32\n  allowedMountDirs:\n   - /mnt/scratch\n

"},{"location":"cluster/setup/controller.html#stale-data-cleanup","title":"Stale data cleanup","text":"

In order to keep internal memory footprint low, reduce the amount of data stored on Zookeeper, and provide a faster experience on the UI,Drove keeps cleaning up data for stale applications, application instances, task instances and cluster events.

The retention for such metadata can be controlled using the following config options:

Warning

Configuration changes done to these parameters will have direct impact on memory usage by the controller and memory and disk utilization on the Zookeeper cluster.

"},{"location":"cluster/setup/controller.html#internal-operations","title":"Internal Operations","text":"

Drove may need to create and issue operations on applications and tasks to manage cluster stability, for maintenance and other reasons. The following parameters can be used to control the speed and parallelism of such operations:

Tip

The default value of 1 for the clusterOpParallelism parameter is generally too low for most clusters. Unless there is a specific problem, it would be advisable to set this to at least 4. If number of instances is quite high for applications (order of tens or hundreds), feel free to set this to 32.

Increasing clusterOpParallelism will make recovery faster in case of executor failures, but it will increase cpu utilization on the controller by a little bit.

"},{"location":"cluster/setup/controller.html#security-related-options","title":"Security related options","text":"

The auditedHttpMethods parameter contains a list of all HTTP methods that need to be audited. This means that if the auditedHttpMethods contains POST and PUT, any drove HTTP POST or PUT apis being called will lead to a audit in the controller logs with the details of the user that made the call.

Warning

It would be advisable to not add GET to the list. This is because the UI keeps making calls to GET apis on drove to fetch data to render. These calls are automated and happen every few seconds from the browser. This will blow up controller logs size.

The allowedMountDirs option whitelists only some directories to be mounted on containers. If this is not provided, containers will be able to mount any directory on the executors.

Danger

It is highly recommended to set allowedMountDirs to a designated directory that containers might want to use as scratch space if needed. Keeping this empty will almost definitely cause security issues in the long run.

"},{"location":"cluster/setup/controller.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

We shall be volume mounting the config and log directories with the same name.

Prerequisite Setup

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

"},{"location":"cluster/setup/controller.html#setup-the-config-file","title":"Setup the config file","text":"

Create a relevant configuration file in /etc/drove/controller/controller.yml.

Sample

server:\n  applicationConnectors:\n    - type: http\n      port: 10000\n  adminConnectors:\n    - type: http\n      port: 10001\n  requestLog:\n    appenders:\n      - type: file\n        timeZone: IST\n        currentLogFilename: /var/log/drove/controller/drove-controller-access.log\n        archivedLogFilenamePattern: /var/log/drove/controller/drove-controller-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\nlogging:\n  level: INFO\n  loggers:\n    com.phonepe.drove: INFO\n\n\n  appenders:\n    - type: file\n      threshold: ALL\n      timeZone: IST\n      currentLogFilename: /var/log/drove/controller/drove-controller.log\n      archivedLogFilenamePattern: /var/log/drove/controller/drove-controller.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n\nzookeeper:\n  connectionString: \"192.168.56.10:2181\"\n\nclusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: \"0v8XvJrDc7r86ZY1QCByPTDPninI4Xii\"\n  - nodeType: EXECUTOR\n    secret: \"pOd9sIEXhv0wrGOVc7ebwNvR7twZqyTN\"\n\nuserAuth:\n  enabled: true\n  encoding: CRYPT\n  users:\n    - username: admin\n      password: \"$2y$10$pfGnPkYrJEGzasvVNPjRu.IJldV9TDa0Vh.u1UdimILWDuhvapc2O\"\n      role: EXTERNAL_READ_WRITE\n    - username: guest\n      password: \"$2y$10$uCJ7WxIvd13C.1oOTs28p.xpJShGiTWuDLY/sGH9JE8nrkSGBFkc6\"\n      role: EXTERNAL_READ_ONLY\n\n\ninstanceAuth:\n  secret: \"bd2SIgz9OMPG2L8wA6zxj21oLVLbuLFC\"\n\noptions:\n  maxStaleInstancesCount: 3\n  staleCheckInterval: 1m\n  staleAppAge: 2d\n  staleInstanceAge: 1d\n  staleTaskAge: 1d\n  clusterOpParallelism: 4\n  allowedMountDirs:\n   - /dev/null\n

"},{"location":"cluster/setup/controller.html#setup-required-environment-variables","title":"Setup required environment variables","text":"

Environment variables need to run the drove controller are setup in /etc/drove/controller/controller.env.

CONFIG_FILE_PATH=/etc/drove/controller/controller.yml\nJAVA_PROCESS_MIN_HEAP=2g\nJAVA_PROCESS_MAX_HEAP=2g\nZK_CONNECTION_STRING=\"192.168.3.10:2181\"\nJAVA_OPTS=\"-Xlog:gc:/var/log/drove/controller/gc.log -Xlog:gc:::filecount=3,filesize=10M -Xlog:gc::time,level,tags -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Dfile.encoding=utf-8 -Djute.maxbuffer=0x9fffff\"\n
"},{"location":"cluster/setup/controller.html#create-systemd-file","title":"Create systemd file","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.controller.service:

[Unit]\nDescription=Drove Controller Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-controller:latest\nExecStart=/usr/bin/docker run  \\\n    --env-file /etc/drove/controller/controller.env \\\n    --volume /etc/drove/controller:/etc/drove/controller:ro \\\n    --volume /var/log/drove/controller:/var/log/drove/controller \\\n    --publish 10000:10000  \\\n    --publish 10001:10001 \\\n    --hostname %H \\\n    --rm \\\n    --name drove.controller \\\n    ghcr.io/phonepe/drove-controller:latest\n\n[Install]\nWantedBy=multi-user.target\n

Verify the file with the following command:

systemd-analyze verify drove.controller.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.controller.service\n

"},{"location":"cluster/setup/controller.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.controller\nsystemctl start drove.controller\n

You can tail the logs at /var/logs/drove/controller/drove-controller.log.

The console would be available at http://<ip>:10000 and admin functionality will be available on http://<ip>:10001 according to the above config.

Health checks can be performed by running a curl as follows:

curl http://localhost:10001/healthcheck\n

Note

Once controllers are up, one of them will become the leader. You can check the leader by running the following command:

curl http://<ip>:10000/apis/v1/ping\n

Only on the leader you should get the following response along with a HTTP status 200/OK:

{\n    \"status\":\"SUCCESS\",\n    \"data\":\"pong\",\n    \"message\":\"success\"\n}\n

"},{"location":"cluster/setup/executor-setup.html","title":"Setting up Executor Nodes","text":"

We shall setup the executor nodes by setting up the hardware, operating system first and then the executor service itself.

"},{"location":"cluster/setup/executor-setup.html#considerations-and-tuning-for-hardware-and-operating-system","title":"Considerations and tuning for hardware and operating system","text":"

In the following sections we discus some aspects of scheduling, hardware and settings on the OS to ensure good performance.

"},{"location":"cluster/setup/executor-setup.html#cpu-and-memory-considerations","title":"CPU and Memory considerations","text":"

The executor nodes are the servers that host and run the actual docker containers. Drove will take into consideration the NUMA topology of these machines to optimize the placement for containers to extract the maximum performance. Along with this, Drove will cpuset the containers to the allocated cores in a non overlapping manner, so that the cores allocated to a container are dedicated to it. Memory allocated to a container is pinned as well and selected from the same NUMA node.

Needless to say the minimum amount of CPU that can be given to an application or task is 1. Fractional cpu allocation can be achieved in a predictable manner by configuring over provisioning on executor nodes.

"},{"location":"cluster/setup/executor-setup.html#over-provisioning-of-cpu-and-memory","title":"Over Provisioning of CPU and Memory","text":"

Drove does not do any kind of burst scaling or overcommitment to ensure application performance remains predictable even under load. Instead, in Drove, there is a feature to make executors appear to have more cores (and memory) than it actually has. This can be used to get more utilization out of executor nodes in clusters that do not need guaranteed performance (for example staging or dev testing clusters). This is achieved by enabling over provisioning.

Over provisioning needs to be configured in the executor configuration. It primarily consists of two configs:

VCores (virtual cores) are internal representation of a CPU core on the executor. If over provisioning is disabled, a vcore will correspond to a physical core. If over provisioning is enabled, 1 CPU core will generate cpu multiplier number of v cores. Drove does do cpuset even on containers running on nodes that have over provisioning enabled, however the physical cores that the containers get bound to are chosen at random, albeit from the same NUMA node. cpuset-mem is always done on the same NUMA node as well.

Mixed clusters

In some production clusters you might have applications that are non critical in terms of performance and are unable to utilize a full core. These can be tagged to be spun up on some nodes where over provisioning is enabled. Adopting such a cluster topology will ensure that containers that need high performance run on nodes without over provisioning and the smaller apps (like for example operations consoles etc) are run on separate nodes with over provisioning enabled. Just ensure the latter are tagged properly and during app deployment specify this tag in application spec or task spec.

"},{"location":"cluster/setup/executor-setup.html#disable-numa-pinning","title":"Disable NUMA Pinning","text":"

There is an option to disable memory and core pinning. In this situation, all cores from all NUM nodes show up as being part of one node. cpuset-mems is not called if numa pinning is disabled and therefore you will be leaving some memory performance on the table. We recommend not to dabble with this unless you have tasks and containers that need more than the number of cores available on a single NUMA node. This setting is enabled at executor level by setting disableNUMAPinning: true.

"},{"location":"cluster/setup/executor-setup.html#hyper-threading","title":"Hyper-threading","text":"

Whether Hyper Threading needs to be enabled or not is a bit dependent on applications deployed and how effectively they can utilize individual CPU cores. For mixed workloads, we recommend Hyper Threading to be enabled on the executor nodes.

"},{"location":"cluster/setup/executor-setup.html#isolating-container-and-os-processes","title":"Isolating container and OS processes","text":"

Typically we would not want containers to share CPU resources with processes for the operating system, Drove Executor Service as well as Docker engine (if using docker) and so on. While complete isolation would need creating a full scheduler (and passing isolcpus to GRUB parameters), we can get a good middle ground by ensuring such processes utilize only a few CPU cores on the system, and let the Drove executors deploy and pin containers to the rest.

This is achieved in two steps:

Let's say our server has 2 NUMA nodes, each with 40 hyper-threaded cores. We want to reserve the first 2 cores from each CPU to the OS processes. So we reserve cores [0,1,2,3] for the OS processes.

The following line in /etc/systemd/system.conf

#CPUAffinity=\n

needs to be changed to

CPUAffinity=0 1 2 3\n

Tip

Reboot the machine for this to take effect.

The changes can be validated post reboot by running the following command:

grep Cpus_allowed_list /proc/1/status\n

The expected output should be:

Cpus_allowed_list:  0-3\n

Note

Refer to this for more details.

"},{"location":"cluster/setup/executor-setup.html#gpu-computation","title":"GPU Computation","text":"

Nvidia based GPU compute can be enabled at executor level by installing relevant drivers. Please follow the setup guide to enable this. Remember to tag these nodes to isolate them from the primary cluster and use tags to deploy apps and tasks that need GPU.

"},{"location":"cluster/setup/executor-setup.html#storage-consideration","title":"Storage consideration","text":"

On executor nodes the disk might be under pressure if container (re)deployments are frequent or the containers log very heavily. As such, we recommend the logging directory for Drove be mounted on hardware that will be able to handle this load. Similar considerations need to be given to the log and package directory for docker or podman.

"},{"location":"cluster/setup/executor-setup.html#executor-configuration-reference","title":"Executor Configuration Reference","text":"

The Drove Executor is written on the Dropwizard framework. The configuration to the service is set using a YAML file which needs to be injected into the container. A typical controller configuration file will look like the following:

server: #(1)!\n  applicationConnectors: #(2)!\n    - type: http\n      port: 3000\n  adminConnectors: #(3)!\n    - type: http\n      port: 3001\n  applicationContextPath: /\n  requestLog:\n    appenders:\n      - type: console\n        timeZone: ${DROVE_TIMEZONE}\n      - type: file\n        timeZone: ${DROVE_TIMEZONE}\n        currentLogFilename: /logs/drove-executor-access.log\n        archivedLogFilenamePattern: /logs/drove-executor-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\nlogging:\n  level: INFO\n  loggers:\n    com.phonepe.drove: ${DROVE_LOG_LEVEL}\n\n  appenders: #(4)!\n    - type: console #(5)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{instanceLogId}] %message%n\"\n    - type: file #(6)!\n      threshold: ALL\n      timeZone: ${DROVE_TIMEZONE}\n      currentLogFilename: /logs/drove-executor.log\n      archivedLogFilenamePattern: /logs/drove-executor.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n      archive: true\n\n    - type: drove #(7)!\n      logPath: \"/logs/applogs/\"\n      archivedLogFileSuffix: \"%d\"\n      archivedFileCount: 3\n      threshold: TRACE\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) | %-23date | %-30logger{0} | %message%n\"\n      archive: true\n\nzookeeper: #(8)!\n  connectionString: ${ZK_CONNECTION_STRING}\n\nclusterAuth: #(9)!\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ${DROVE_CONTROLLER_SECRET}\n  - nodeType: EXECUTOR\n    secret: ${DROVE_EXECUTOR_SECRET}\n\nresources: #(10)!\n  osCores: [ 0, 1 ]\n  exposedMemPercentage: 60\n  disableNUMAPinning: ${DROVE_DISABLE_NUMA_PINNING}\n  enableNvidiaGpu: ${DROVE_ENABLE_NVIDIA_GPU}\n\noptions: #(11)!\n  cacheImages: true\n  maxOpenFiles: 10_000\n  logBufferSize: 5m\n  cacheFileSize: 10m\n  cacheFileCount: 3\n
  1. Server listener configuration. See Dropwizard Server Configuration for the different options.
  2. Main port configuration. This is where the UI and APIs will be exposed. Check connector configuration docs for details.
  3. Admin port. You can take thread dumps, metrics, run healthchecks on the Drove controller on this port.
  4. Logging configuration. See logging docs.
  5. Log to console. Useful in docker-compose.
  6. Log to rotating files. Useful for running servers.
  7. Drove application logger configuration. See drove logger config for details.
  8. Configure how to connect to Zookeeper See Zookeeper Config for details.
  9. Configuration for authentication between nodes in the cluster. Please check intra node auth config for details.
  10. Resource configuration for this node.
  11. Options to configure executor behaviour. Check executor options section for details.

Tip

In case you do not want to expose admin apis to outside the host, please set bindHost in the admin connectors section.

adminConnectors:\n  - type: http\n    port: 10001\n    bindHost: 127.0.0.1\n
"},{"location":"cluster/setup/executor-setup.html#zookeeper-connection-configuration","title":"Zookeeper Connection Configuration","text":"

The following details can be configured.

Name Option Description Connection String connectionString The connection string of the form: zkserver:2181,zkserver2:2181... Data namespace namespace The top level node inside which all Drove data will be scoped. Defaults to drove if not set.

Sample

zookeeper:\n  connectionString: \"192.168.3.10:2181,192.168.3.11:2181,192.168.3.12:2181\"\n  namespace: drovetest\n

Note

This section is same across the cluster including both controller and executor.

"},{"location":"cluster/setup/executor-setup.html#intra-node-authentication-configuration","title":"Intra Node Authentication Configuration","text":"

Communication between controller and executor is protected by a shared-secret based authentication. The following configuration is meant to configure this. This section consists of a list of 2 members:

Each section consists of the following:

Name Option Description Node Type nodeType Type of node in the cluster. Can be CONTROLLER or EXECUTOR Secret secret The actual secret to be passed.

Sample

clusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: ControllerSecretValue\n  - nodeType: EXECUTOR\n    secret: ExecutorSecret\n

Note

This section is same across the cluster including both controller and executor.

"},{"location":"cluster/setup/executor-setup.html#drove-application-logger-configuration","title":"Drove Application Logger Configuration","text":"

Drove will segregate application and task instance logs in a directory of your choice. The path for such files is set as: - <application id>/<instance id> for Application Instances - <sourceAppName>/<task id> for Task Instances

The Drove Log Appender is based of LogBack's Sifting Appender.

The following configuration options are supported:

Name Option Description Path logPath Directory to host the logs Archive old logs archive Whether to enable log rotation Archived File Suffix archivedLogFileSuffix Suffix for archived log files. Archived File Count archivedFileCount Count of archived log files. Older files are deleted. File Size maxFileSize Size of current log file after which it is archived and a new file is created. Unit: DataSize. Total Size totalSizeCap total size after which deletion takes place. Unit: DataSize. Buffer Size bufferSize Buffer size for the logger. (Set to 8KB by default). Used if immediateFlush is turned off. Immediate Flush immediateFlush Flush logs immediately. Set to true by default (recommended)

Sample

logging:\n  level: INFO\n  ...\n\n  appenders:\n    # Setup appenders for the executor process itself first\n    ...\n\n    - type: drove\n      logPath: \"/logs/applogs/\"\n      archivedLogFileSuffix: \"%d\"\n      archivedFileCount: 3\n      threshold: TRACE\n      timeZone: ${DROVE_TIMEZONE}\n      logFormat: \"%(%-5level) | %-23date | %-30logger{0} | %message%n\"\n      archive: true\n

"},{"location":"cluster/setup/executor-setup.html#resource-configuration","title":"Resource Configuration","text":"

This section can be used to configure how resources are exposed from an executor to the cluster. We have discussed a few of the considerations that will drive the configuration that is being setup.

Name Option Description OS Cores osCores A list of cores reserved for use by operating system processes. See the relevant section for details on the pre-steps needed to achieve this. Exposed Memory exposedMemPercentage What percentage of the system memory can be used by the containers running on the host collectively. Range: 50-100 integer NUMA Pinning disableNUMAPinning Disable NUMA and CPU core pinning for containers. Pinning is on by default. (default: false) Nvidia GPU enableNvidiaGpu Enable GPU support on containers. This setting makes all available Nvidia GPUs on the current executor machine available for any container running on this executor. GPU resources are not discovered on the executor, managed and rationed between containers. Needs to be used in conjunction with tagging (see tags below) to ensure only the applications which require a GPU end up on the executor with GPUs. Tags tags A set of strings that can be used in TAG placement policy to route application and task instances to this executor. Over Provisioning overProvisioning Setup over provisioning configuration.

Tagging

The current hostname is always added as a tag by default and is handled specially to allow for non-tagged deployments to be routed to this executor. If any tag is specified in the tags config, this node will receive containers only when MATCH_TAG placement is used. Please check relevant sections to specify correct placement policies for applications and tasks.

Sample

resources:\n  osCores: [0,1,2,3]\n  exposedMemPercentage: 90\n

"},{"location":"cluster/setup/executor-setup.html#over-provisioning-configuration","title":"Over provisioning configuration","text":"

Drove strives to ensure that containers can run unencumbered on CPU cores allocated to them. This means that the minimum allocation unit possible is 1 for cores. It does not support fractional CPU.

However, there are situations where we would want some non-critical applications to run the cluster but not waste CPU. The overProvisioning configuration aims to provide user a way to turn off NUMA pinning on the executor and run more containers than it normally would.

To ensure predictability, we do not want pinned and non-pinned containers running on the same host. Hence, an executor host can either be running in pinned mode or in non-pinned mode.

To enable more containers than we could usually deploy and to still retain some level of control on how small you want a container to go, we specify multipliers on CPU and memory.

Example: - Let's say your executor server has 40 cores available. If you set cpuMultiplier as 4, this node will now show up as having 160 cores to the controller. - Let's say your server had 512GB of memory, setting memoryMultiplier to 2 will make drove see it as 1TB.

Name Option Description Enabled enabled Set this to true to enable over provisioning. Default: false CPU Multiplier cpuMultiplier Multiplier to be applied to enable cpu over provisioning. Default: 1. Range: 1-20 Memory Multiplier memoryMultiplier Multiplier to be applied to enable memory over provisioning. Default: 1. Range: 1-20

Sample

resources:\n  exposedMemPercentage: 90\n  overProvisioning:\n    enabled: true\n    memoryMultiplier: 1\n    cpuMultiplier: 3\n

Tip

This feature was developed to allow us to run our development environments cheaper. In such environments there is not much pressure on CPU or memory, but a large number of containers run as developers can spin up containers for features they are working on. There was no point is wasting a full core on containers that get hit twice a minute or less. On production we tend to err on the side of caution and allocate at least one core even to the most trivial applications as of the time of writing this.

"},{"location":"cluster/setup/executor-setup.html#executor-options","title":"Executor Options","text":"

The following options can be set to influence the behavior for the Drove executors.

Name Option Description Hostname hostname Override the hostname that gets exposed to the controller. Make sure this is resolvable. Cache Images cacheImages Cache container images. If this is not passed, a container image is removed when a container dies and no other instance is using the image. Command Timeout containerCommandTimeout Timeout used by the container engine client when issuing container commands to docker or podman Container Socket Path dockerSocketPath The path of socket for docker socket. Comes in handy to configure path for socket when using podman etc. Max Open Files maxOpenFiles Override the maximum number of file descriptors a container can open. Default: 470,000 Log Buffer Size logBufferSize The size of the buffer the executor uses to read logs from container. Unit DataSize. Range: 1-128MB. Default: 10MB Cache File Size cacheFileSize To limit disk usage, configure fixed size log file cache for containers. Unit: DataSize. Range: 10MB-100GB. Default: 20MB. Compression is always enabled. Cache File Count cacheFileSize To limit disk usage, configure fixed count of log file cache for containers. Unit: integer. Max: 1024. Default: 3

Sample

options:\n  logBufferSize: 20m\n  cacheFileSize: 30m\n  cacheFileCount: 3\n  cacheImages: true\n

"},{"location":"cluster/setup/executor-setup.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

We shall be volume mounting the config and log directories with the same name.

Prerequisite Setup

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

"},{"location":"cluster/setup/executor-setup.html#setup-the-config-file","title":"Setup the config file","text":"

Create a relevant configuration file in /etc/drove/controller/executor.yml.

Sample

server:\n  applicationConnectors:\n    - type: http\n      port: 11000\n  adminConnectors:\n    - type: http\n      port: 11001\n  requestLog:\n    appenders:\n      - type: file\n        timeZone: IST\n        currentLogFilename: /var/log/drove/executor/drove-executor-access.log\n        archivedLogFilenamePattern: /var/log/drove/executor/drove-executor-access.log-%d-%i\n        archivedFileCount: 3\n        maxFileSize: 100MiB\n\nlogging:\n  level: INFO\n  loggers:\n    com.phonepe.drove: INFO\n\n\n  appenders:\n    - type: file\n      threshold: ALL\n      timeZone: IST\n      currentLogFilename: /var/log/drove/executor/drove-executor.log\n      archivedLogFilenamePattern: /var/log/drove/executor/drove-executor.log-%d-%i\n      archivedFileCount: 3\n      maxFileSize: 100MiB\n      logFormat: \"%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n\"\n    - type: drove\n      logPath: \"/var/log/drove/executor/instance-logs\"\n      archivedLogFileSuffix: \"%d-%i\"\n      archivedFileCount: 0\n      maxFileSize: 1GiB\n      threshold: INFO\n      timeZone: IST\n      logFormat: \"%(%-5level) | %-23date | %-30logger{0} | %message%n\"\n      archive: true\n\nzookeeper:\n  connectionString: \"192.168.56.10:2181\"\n\nclusterAuth:\n  secrets:\n  - nodeType: CONTROLLER\n    secret: \"0v8XvJrDc7r86ZY1QCByPTDPninI4Xii\"\n  - nodeType: EXECUTOR\n    secret: \"pOd9sIEXhv0wrGOVc7ebwNvR7twZqyTN\"\n\nresources:\n  osCores: []\n  exposedMemPercentage: 90\n  disableNUMAPinning: true\n  overProvisioning:\n    enabled: true\n    memoryMultiplier: 10\n    cpuMultiplier: 10\n\noptions:\n  cacheImages: true\n  logBufferSize: 20m\n  cacheFileSize: 30m\n  cacheFileCount: 3\n  cacheImages: true\n

"},{"location":"cluster/setup/executor-setup.html#setup-required-environment-variables","title":"Setup required environment variables","text":"

Environment variables need to run the drove controller are setup in /etc/drove/executor/executor.env.

CONFIG_FILE_PATH=/etc/drove/executor/executor.yml\nJAVA_PROCESS_MIN_HEAP=1g\nJAVA_PROCESS_MAX_HEAP=1g\nZK_CONNECTION_STRING=\"192.168.56.10:2181\"\nJAVA_OPTS=\"-Xlog:gc:/var/log/drove/executor/gc.log -Xlog:gc:::filecount=3,filesize=10M -Xlog:gc::time,level,tags -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Dfile.encoding=utf-8 -Djute.maxbuffer=0x9fffff\"\n
"},{"location":"cluster/setup/executor-setup.html#create-systemd-file","title":"Create systemd file","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.executor.service:

[Unit]\nDescription=Drove Executor Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-executor:latest\nExecStart=/usr/bin/docker run  \\\n    --env-file /etc/drove/executor/executor.env \\\n    --volume /etc/drove/executor:/etc/drove/executor:ro \\\n    --volume /var/log/drove/executor:/var/log/drove/executor \\\n    --volume /var/run/docker.sock:/var/run/docker.sock \\\n    --publish 11000:11000  \\\n    --publish 11001:11001 \\\n    --hostname %H \\\n    --rm \\\n    --name drove.executor \\\n    ghcr.io/phonepe/drove-executor:latest\n\n[Install]\nWantedBy=multi-user.target\n
Verify the file with the following command:
systemd-analyze verify drove.executor.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.executor.service\n

"},{"location":"cluster/setup/executor-setup.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.executor\nsystemctl start drove.executor\n

You can tail the logs at /var/logs/drove/executor/drove-executor.log.

The executor should now show up on the Drove Console.

"},{"location":"cluster/setup/gateway.html","title":"Setting up Drove Gateway","text":"

The Drove Gateway works as a gateway to expose apps running on a drove cluster to rest of the world.

Drove Gateway container uses NGinx and a modified version of Nixy to track drove endpoints. More details about this can be found in the grove-gateway project.

"},{"location":"cluster/setup/gateway.html#drove-gateway-nixy-configuration-reference","title":"Drove Gateway Nixy Configuration Reference","text":"

The nixy running inside the gateway container is configured using a custom TOML file. This section looks into this file:

address = \"127.0.0.1\"# (1)!\nport = \"6000\"\n\n\n# Drove Options\ndrove = [#(2)!\n  \"http://controller1.mydomain:10000\",\n   \"http://controller1.mydomain:10000\"\n   ]\n\nleader_vhost = \"drove-staging.mydomain\"#(3)!\nevent_refresh_interval_sec = 5#(5)!\nuser = \"\"#(6)!\npass = \"\"\naccess_token = \"\"#(7)!\n\n# Parameters to control which apps are exposed as VHost\nrouting_tag = \"externally_exposed\"#(4)!\nrealm = \"api.mydomain,support.mydomain\"#(8)!\nrealm_suffix = \"-external.mydomain\"#(9)!\n\n# Nginx related config\n\nnginx_config = \"/etc/nginx/nginx.conf\"#(10)!\nnginx_template = \"/etc/drove/gateway/nginx.tmpl\"#(11)!\nnginx_cmd = \"nginx\"#(12)!\nnginx_ignore_check = true#(13)!\n\n# NGinx plus specific options\nnginxplusapiaddr=\"127.0.0.1\"#(14)!\nnginx_reload_disabled=true#(15)!\nmaxfailsupstream = 0#(16)!\nfailtimeoutupstream = \"1s\"\nslowstartupstream = \"0s\"\n
  1. Nixy listener configuration. Endpoint for nixy itself.

  2. List of Drove controllers. Add all controller nodes here. Nixy will automatically determine and track the current leader.

    Auto detection is disabled if a single endpoint is specified.

  3. Helps create a vhost entry that tracks the leader on the cluster. Use this to expose the Drove endpoint to users. The value for this will be available to the template engine as the LeaderVHost variable.

  4. If some special routing behaviour needs to be implemented in the template based on some tag metadata of the deployed apps, set the routing_tag option to set the tag name to be used. The actual value is derived from app instances and exposed to the template engine as the variable: RoutingTag. Optional.

    In this example, the RoutingTag variable will be set to the value specified in the routing_tag tag key specified when deploying the Drove Application. For example, if we want to expose the app we can set it to yes, and filter the VHost to be exposed in NGinx template when RoutingTag == \"yes\".

  5. Drove Gateway/Nixy works on event polling on controller. This is the polling interval. Especially if number of NGinx nodes is high. Default is 2 seconds. Unless cluster is really busy with a high rate of change of containers, this strikes a good balance between apps becoming discoverable vs putting the leader controller under heavy load.

  6. user and pass are optional params can be used to set basic auth credentials to the calls made to Drove controllers if basic auth is enabled on the cluster. Leave empty if no basic auth is required.

  7. If cluster has some custom header based auth, the following can be used. The contents on this parameter are passed verbatim to the Authorization HTTP header. Leave empty if no token auth is enabled on the cluster.

  8. By default drove-gateway will expose all vhost declared in the spec for all drove apps on a cluster (caveat: filtering can be done using RoutingTag as well). If specific vhosts need to be exposed, set the realms parameter to a comma separated list of realms. Optional.

  9. Beside perfect vhost matching, Drove Gateway supports suffix based matches as well. A single suffix is supported. Optional.

  10. Path to NGinx config.

  11. Path to the template file, based on which the template will be generated.

  12. NGinx command to use to reload the config. Set this to openresty optionally to use openresty.

  13. Ignore calling NGinx command to test the config. Set this to false or delete this line on production. Default: false.

  14. If using NGinx plus, set the endpoint to the local server here. If left empty, NGinx plus api based vhost update will be disabled.

  15. If specific vhosts are exposed, auto-discovery and updation of config (and NGinx reloads) might not be desired as it will cause connection drops. Set the following parameter to true to disable reloads. Nixy will only update upstreams using the nplus APIs. Default: false.

  16. Connection parameters for NGinx plus.

NGinx plus

NGinx plus is not shipped with this docker. If you want to use NGinx plus, please build nixy from the source tree here and build your own container.

"},{"location":"cluster/setup/gateway.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

We shall be volume mounting the config and log directories with the same name.

Prerequisite Setup

If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.

Go through the following steps to run drove-gateway as a service.

"},{"location":"cluster/setup/gateway.html#create-the-toml-config-for-nixy","title":"Create the TOML config for Nixy","text":"

Sample config file /etc/drove/gateway/gateway.toml:

address = \"127.0.0.1\"\nport = \"6000\"\n\n\n# Drove Options\ndrove = [\n  \"http://controller1.mydomain:10000\",\n   \"http://controller1.mydomain:10000\"\n   ]\n\nleader_vhost = \"drove-staging.mydomain\"\nevent_refresh_interval_sec = 5\nuser = \"guest\"\npass = \"guest\"\n\n\n# Nginx related config\nnginx_config = \"/etc/nginx/nginx.conf\"\nnginx_template = \"/etc/drove/gateway/nginx.tmpl\"\nnginx_cmd = \"nginx\"\nnginx_ignore_check = true\n

Replace domain names

Please remember to update mydomain to a valid domain you want to use.

"},{"location":"cluster/setup/gateway.html#create-template-for-nginx","title":"Create template for NGinx","text":"

Create a NGinx template with the following config in /etc/drove/gateway/nginx.tmpl

# Generated by drove-gateway {{datetime}}\n\nuser www-data;\nworker_processes auto;\npid /run/nginx.pid;\n\nevents {\n    use epoll;\n    worker_connections 2048;\n    multi_accept on;\n}\nhttp {\n    server_names_hash_bucket_size  128;\n    add_header X-Proxy {{ .Xproxy }} always;\n    access_log /var/log/nginx/access.log;\n    error_log /var/log/nginx/error.log warn;\n    server_tokens off;\n    client_max_body_size 128m;\n    proxy_buffer_size 128k;\n    proxy_buffers 4 256k;\n    proxy_busy_buffers_size 256k;\n    proxy_redirect off;\n    map $http_upgrade $connection_upgrade {\n        default upgrade;\n        ''      close;\n    }\n    # time out settings\n    proxy_send_timeout 120;\n    proxy_read_timeout 120;\n    send_timeout 120;\n    keepalive_timeout 10;\n\n    server {\n        listen       7000 default_server;\n        server_name  _;\n        # Everything is a 503\n        location / {\n            return 503;\n        }\n    }\n    {{if and .LeaderVHost .Leader.Endpoint}}\n    upstream {{.LeaderVHost}} {\n        server {{.Leader.Host}}:{{.Leader.Port}};\n    }\n    server {\n        listen 7000;\n        server_name {{.LeaderVHost}};\n        location / {\n            proxy_set_header HOST {{.Leader.Host}};\n            proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;\n            proxy_connect_timeout 30;\n            proxy_http_version 1.1;\n            proxy_set_header Upgrade $http_upgrade;\n            proxy_set_header Connection $connection_upgrade;\n            proxy_pass http://{{.LeaderVHost}};\n        }\n    }\n    {{end}}\n    {{- range $id, $app := .Apps}}\n    upstream {{$app.Vhost}} {\n        {{- range $app.Hosts}}\n        server {{ .Host }}:{{ .Port }};\n        {{- end}}\n    }\n    server {\n        listen 7000;\n        server_name {{$app.Vhost}};\n        location / {\n            proxy_set_header HOST $host;\n            proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;\n            proxy_connect_timeout 30;\n            proxy_http_version 1.1;\n            proxy_set_header Upgrade $http_upgrade;\n            proxy_set_header Connection $connection_upgrade;\n            proxy_pass http://{{$app.Vhost}};\n        }\n    }\n    {{- end}}\n}\n

The above template will do the following:

"},{"location":"cluster/setup/gateway.html#create-environment-file","title":"Create environment file","text":"

We want to configure the drove gateway container using the required environment variables. To do that, put the following in /etc/drove/gateway/gateway.env:

CONFIG_FILE_PATH=/etc/drove/gateway/gateway.toml\nTEMPLATE_FILE_PATH=/etc/drove/gateway/nginx.tmpl\n
"},{"location":"cluster/setup/gateway.html#create-systemd-file","title":"Create systemd file","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.gateway.service:

[Unit]\nDescription=Drove Gateway Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-gateway:latest\nExecStart=/usr/bin/docker run  \\\n    --env-file /etc/drove/gateway/gateway.env \\\n    --volume /etc/drove/gateway:/etc/drove/gateway:ro \\\n    --volume /var/log/drove/gateway:/var/log/nginx \\\n    --network host \\\n    --hostname %H \\\n    --rm \\\n    --name drove.gateway \\\n    ghcr.io/phonepe/drove-gateway:latest\n\n[Install]\nWantedBy=multi-user.target\n

Verify the file with the following command:

systemd-analyze verify drove.gateway.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.gateway.service\n

"},{"location":"cluster/setup/gateway.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.gateway\nsystemctl start drove.gateway\n
"},{"location":"cluster/setup/gateway.html#checking-logs","title":"Checking Logs","text":"

You can check logs using:

journalctl -u drove.gateway -f\n

NGinx logs would be available at /var/log/drove/gateway.

"},{"location":"cluster/setup/gateway.html#log-rotation-for-nginx","title":"Log rotation for NGinx","text":"

The gateway sets up log rotation for the access and errors logs with the following config:

/var/log/nginx/*.log {\n    rotate 5\n    size 10M\n    dateext\n    dateformat -%Y-%m-%d\n    missingok\n    compress\n    delaycompress\n    sharedscripts\n    notifempty\n    postrotate\n        test -r /var/run/nginx.pid && kill -USR1 `cat /var/run/nginx.pid`\n    endscript\n}\n

This will rotate both error and access logs when they hit 10MB and keep 5 logs.

Configure the above if you want and volume mount your config to /etc/logrotate.d/nginx to use different scheme as per your requirements.

"},{"location":"cluster/setup/maintenance.html","title":"Maintaining a Drove Cluster","text":"

There are a couple of constructs built into Drove to allow for easy maintenance.

"},{"location":"cluster/setup/maintenance.html#maintenance-mode","title":"Maintenance mode","text":"

Drove supports a maintenance mode to allow for software updates without affecting the containers running on the cluster.

Danger

In maintenance mode, outage detection is turned off and container failure for applications are not acted upon even if detected.

"},{"location":"cluster/setup/maintenance.html#engaging-maintenance-mode","title":"Engaging maintenance mode","text":"

Set cluster to maintenance mode.

Preconditions - Cluster must be in the following state: MAINTENANCE

Drove CLIJSON
drove -c local cluster maintenance-on\n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/set' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"MAINTENANCE\",\n        \"updated\": 1721630351178\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"cluster/setup/maintenance.html#disengaging-maintenance-mode","title":"Disengaging maintenance mode","text":"

Set cluster to normal mode.

Preconditions - Cluster must be in the following state: MAINTENANCE

Drove CLIJSON
drove -c local cluster maintenance-off\n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/maintenance/unset' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"state\": \"NORMAL\",\n        \"updated\": 1721630491296\n    },\n    \"message\": \"success\"\n}\n

"},{"location":"cluster/setup/maintenance.html#updating-drove-version-across-the-cluster-quickly","title":"Updating drove version across the cluster quickly","text":"

We recommend the following sequence of steps:

  1. Find the leader controller for the cluster using drove ... cluster leader.
  2. Update the controller container on the nodes that are not the leader.

    If you are using the systemd file given here, you just need to restart the controller service using systemctl restart drove.controller

  3. Set cluster to maintenance mode using drove ... cluster maintenance-on.

  4. Update the leader controller.

    If you are using the systemd file given here, you just need to restart the leader controller service: systemctl restart drove.controller

  5. Update the executors.

    If you are using the systemd file given here, you just need to restart all executors: systemctl restart drove.executor

  6. Take cluster out of maintenance mode: drove ... cluster maintenance-off

"},{"location":"cluster/setup/maintenance.html#executor-blacklisting","title":"Executor blacklisting","text":"

In cases where we want to take an executor node out of the cluster for planned maintenance, we need to ensure application instances running on the node are replaced by containers on other nodes and the ones running here are shut down cleanly.

This is achieved by blacklisting the node.

Tip

Whenever blacklisting is done, it causes some flux in the application topology due to new container migration from blacklisted to normal nodes. To reduce the number of times this happens, plan to perform multiple operations togeter and blacklist and un-blacklist executors together.

Drove will optimize bulk blacklisting related app migrations and will migrate containers together for an app only once rather than once for every node.

Danger

Task instances are not migrated out. This is because it is impossible for Drove to know if a task can be migrated or not (i.e. killed and spun up on a new node in any order).

To blacklist executors do the following:

Drove CLIJSON
drove -c local executor blacklist dd2cbe76-9f60-3607-b7c1-bfee91c15623 ex1 ex2 \n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/blacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d&id=ex1&id=ex2' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"failed\": [\n            \"ex2\",\n            \"ex1\"\n        ],\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ]\n    },\n    \"message\": \"success\"\n}\n

To un-blacklist executors do the following:

Drove CLIJSON
drove -c local executor unblacklist dd2cbe76-9f60-3607-b7c1-bfee91c15623 ex1 ex2 \n

Sample Request

curl --location --request POST 'http://drove.local:7000/apis/v1/cluster/executors/unblacklist?id=a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d&id=ex1&id=ex2' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data ''\n

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"failed\": [\n            \"ex2\",\n            \"ex1\"\n        ],\n        \"successful\": [\n            \"a45442a1-d4d0-3479-ab9e-3ed0aa5f7d2d\"\n        ]\n    },\n    \"message\": \"success\"\n}\n

Note

Drove will not re-evaluate placement of existing Applications in RUNNING state once executors are brought back into rotation.

"},{"location":"cluster/setup/planning.html","title":"Planning your cluster","text":"

Running a drove cluster in production for critical workloads involves planning and preparation on factors like Availability, Scale, Security and Access management. The following issues should be considered while planning your drove cluster.

"},{"location":"cluster/setup/planning.html#criteria-for-planning","title":"Criteria for planning","text":"

The simplest form of a drove cluster would run controller, zookeeper, executor and gateway services all on the same machine while a highly available would separate out all components according to following considerations:

"},{"location":"cluster/setup/planning.html#cluster-configuration","title":"Cluster configuration","text":""},{"location":"cluster/setup/planning.html#controllers","title":"Controllers","text":"

Controllers will manage the cluster with application instances spread across multiple executors as per different placement policies. Controllers use leader-election to coordinate and will act as a single entity while each executor acts as a single entity that runs many different application instances.

"},{"location":"cluster/setup/planning.html#zookeeper","title":"Zookeeper","text":""},{"location":"cluster/setup/planning.html#executors","title":"Executors","text":""},{"location":"cluster/setup/planning.html#gateways","title":"Gateways","text":""},{"location":"cluster/setup/prerequisites.html","title":"Setting up the prerequisites","text":"

On all machines on the drove cluster, we would want to use the same user and have a consistent storage structure for configuration, logs etc.

Note

All commands o be issues as root. To get to admin/root mode issue the following command:

sudo su\n
"},{"location":"cluster/setup/prerequisites.html#setting-up-user","title":"Setting up user","text":"

We shall create an user called drove to be used to run all services and containers and assign the file ownership to this user.

adduser --system --group \"drove\" --home /var/lib/misc --no-create-home > /dev/null\n
We want to user to be able to run docker containers, so we add the user to the docker group:

groupadd docker\nusermod -aG docker drove\n
"},{"location":"cluster/setup/prerequisites.html#create-directories","title":"Create directories","text":"

We shall use the following locations to store configurations, logs etc:

We go ahead and create these locations and setup the correct permissions:

mkdir -p /etc/drove\nchown -R drove.drove /etc/drove\nchmod 700 /etc/drove\nchmod g+s /etc/drove\n\nmkdir -p /var/lib/drove\nchown -R drove.drove /var/lib/drove\nchmod 700 /var/lib/drove\n\nmkdir -p /var/log/drove\n

Danger

Ensure you run the chmod commands to remove read access everyone other than the owner.

"},{"location":"cluster/setup/units.html","title":"Units Reference","text":"

In the configuration files for Drove, we use the Duration and DataSize units to make configuration easier.

"},{"location":"cluster/setup/units.html#data-size","title":"Data Size","text":"

Use the following shortcuts to express sizes in human readable form such as 2GB etc:

"},{"location":"cluster/setup/units.html#duration","title":"Duration","text":"

Time durations in Drove can be expressed in human readable form, for example: 3d can be used to signify 3 days and so on. The list of valid duration unit suffixes are:

"},{"location":"cluster/setup/zookeeper.html","title":"Setting Up Zookeeper","text":"

We shall be running Zookeeper using the official Docker images. All data volumes etc will be mounted on the host machines.

The following ports will be exposed:

Danger

The ZK admin server does not shut down cleanly from time to time. And is not needed for anything related to Drove. If not needed, you should turn it off.

We assume the following to be the IP for the 3 zookeeper nodes:

"},{"location":"cluster/setup/zookeeper.html#relevant-directories","title":"Relevant directories","text":"

Location for data and logs are as follows:

"},{"location":"cluster/setup/zookeeper.html#important-files","title":"Important files","text":"

The zookeeper container stores snapshots, transaction logs and application logs on /data, /datalog and /logs directories respectively. We shall be volume mounting the following:

Docker will create these directories when container comes up for the first time.

Tip

The zk server id (as set above using the ZOO_MY_ID) can also be set by putting the server number in a file named myid in the /data directory.

Prerequisite Setup

If not done already, lease complete the prerequisite setup on all machines earmarked for the cluster.

"},{"location":"cluster/setup/zookeeper.html#setup-configuration-files","title":"Setup configuration files","text":"

Let's create the config directory:

mkdir -p /etc/drove/zk\n

We shall be creating 3 different configuration files to configure zookeeper:

"},{"location":"cluster/setup/zookeeper.html#setup-environment-variables","title":"Setup environment variables","text":"

Let us prepare the configuration. Put the following in a file: /etc/drove/zk/zk.env:

#(1)!\nZOO_TICK_TIME=2000\nZOO_INIT_LIMIT=10\nZOO_SYNC_LIMIT=5\nZOO_STANDALONE_ENABLED=false\nZOO_ADMINSERVER_ENABLED=false\n\n#(2)!\nZOO_AUTOPURGE_PURGEINTERVAL=12\nZOO_AUTOPURGE_SNAPRETAINCOUNT=5\n\n#(3)!\nZOO_MY_ID=1\nZOO_SERVERS=server.1=192.168.3.10:2888:3888;2181 server.2=192.168.3.11:2888:3888;2181 server.3=192.168.3.12:2888:3888;2181\n
  1. This is cluster level configuration to ensure the cluster topology remains stable through minor flaps
  2. This will control how much data we retain
  3. This section needs to change per server. Each server should have a different ZOO_MY_ID set. And the same numbers get referred to in ZOO_SERVERS section.

Warning

Info

Exhaustive set of options can be found on the Official Docker Page.

"},{"location":"cluster/setup/zookeeper.html#setup-jvm-parameters","title":"Setup JVM parameters","text":"

Put the following in /etc/drove/zk/java.env

export SERVER_JVMFLAGS='-Djute.maxbuffer=0x9fffff -Xmx4g -Xms4g -Dfile.encoding=utf-8 -XX:+UseG1GC -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError'\n

Configuring Max Data Size

Drove data per node can get a bit on the larger side from time to time depending on your application configuration. To be on the safe side, we need to increase the maximum data size per node. This is achieved by setting the JVM option -Djute.maxbuffer=0x9fffff on all cluster nodes in Drove. This is 10MB (approx). The actual payload doesn't reach anywhere close. However we shall be picking up payload compression in a future version to stop this variable from needing to be set.

For the Zookeeper Docker, the environment variable SERVER_JVMFLAGS needs to be set to -Djute.maxbuffer=0x9fffff.

Please refer to Zookeeper Advanced Configuration for further properties that can be tuned.

JVM Size

We set 4GB JVM heap size for ZK by adding appropriate options in SERVER_JVMFLAGS. Please make sure you have sized your machines to have 10-16GB of RAM at the very least. Tune the JVM size and machine size according to your needs.

q

JVMFLAGS environment variable

Do not set this variable in zk.env. Couple of reasons:

"},{"location":"cluster/setup/zookeeper.html#configure-logging","title":"Configure logging","text":"

We want to have physical log files on disk for debugging and audits and want the container to be ephemeral to allow for easy updates etc. To achieve this, put the following in /etc/drove/zk/logback.xml:

<!--\n Copyright 2022 The Apache Software Foundation\n\n Licensed to the Apache Software Foundation (ASF) under one\n or more contributor license agreements.  See the NOTICE file\n distributed with this work for additional information\n regarding copyright ownership.  The ASF licenses this file\n to you under the Apache License, Version 2.0 (the\n \"License\"); you may not use this file except in compliance\n with the License.  You may obtain a copy of the License at\n\n     http://www.apache.org/licenses/LICENSE-2.0\n\n Unless required by applicable law or agreed to in writing, software\n distributed under the License is distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n See the License for the specific language governing permissions and\n limitations under the License.\n\n Define some default values that can be overridden by system properties\n-->\n<configuration>\n  <!-- Uncomment this if you would like to expose Logback JMX beans -->\n  <!--jmxConfigurator /-->\n\n  <property name=\"zookeeper.console.threshold\" value=\"INFO\" />\n\n  <property name=\"zookeeper.log.dir\" value=\"/logs\" />\n  <property name=\"zookeeper.log.file\" value=\"zookeeper.log\" />\n  <property name=\"zookeeper.log.threshold\" value=\"INFO\" />\n  <property name=\"zookeeper.log.maxfilesize\" value=\"256MB\" />\n  <property name=\"zookeeper.log.maxbackupindex\" value=\"20\" />\n\n  <!--\n    console\n    Add \"console\" to root logger if you want to use this\n  -->\n  <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n    <encoder>\n      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>${zookeeper.console.threshold}</level>\n    </filter>\n  </appender>\n\n  <!--\n    Add ROLLINGFILE to root logger to get log file output\n  -->\n  <appender name=\"ROLLINGFILE\" class=\"ch.qos.logback.core.rolling.RollingFileAppender\">\n    <File>${zookeeper.log.dir}/${zookeeper.log.file}</File>\n    <encoder>\n      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>${zookeeper.log.threshold}</level>\n    </filter>\n    <rollingPolicy class=\"ch.qos.logback.core.rolling.FixedWindowRollingPolicy\">\n      <maxIndex>${zookeeper.log.maxbackupindex}</maxIndex>\n      <FileNamePattern>${zookeeper.log.dir}/${zookeeper.log.file}.%i</FileNamePattern>\n    </rollingPolicy>\n    <triggeringPolicy class=\"ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy\">\n      <MaxFileSize>${zookeeper.log.maxfilesize}</MaxFileSize>\n    </triggeringPolicy>\n  </appender>\n\n  <!--\n    Add TRACEFILE to root logger to get log file output\n    Log TRACE level and above messages to a log file\n  -->\n  <!--property name=\"zookeeper.tracelog.dir\" value=\"${zookeeper.log.dir}\" />\n  <property name=\"zookeeper.tracelog.file\" value=\"zookeeper_trace.log\" />\n  <appender name=\"TRACEFILE\" class=\"ch.qos.logback.core.FileAppender\">\n    <File>${zookeeper.tracelog.dir}/${zookeeper.tracelog.file}</File>\n    <encoder>\n      <pattern>%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>TRACE</level>\n    </filter>\n  </appender-->\n\n  <!--\n    zk audit logging\n  -->\n  <property name=\"zookeeper.auditlog.file\" value=\"zookeeper_audit.log\" />\n  <property name=\"zookeeper.auditlog.threshold\" value=\"INFO\" />\n  <property name=\"audit.logger\" value=\"INFO, RFAAUDIT\" />\n\n  <appender name=\"RFAAUDIT\" class=\"ch.qos.logback.core.rolling.RollingFileAppender\">\n    <File>${zookeeper.log.dir}/${zookeeper.auditlog.file}</File>\n    <encoder>\n      <pattern>%d{ISO8601} %p %c{2}: %m%n</pattern>\n    </encoder>\n    <filter class=\"ch.qos.logback.classic.filter.ThresholdFilter\">\n      <level>${zookeeper.auditlog.threshold}</level>\n    </filter>\n    <rollingPolicy class=\"ch.qos.logback.core.rolling.FixedWindowRollingPolicy\">\n      <maxIndex>10</maxIndex>\n      <FileNamePattern>${zookeeper.log.dir}/${zookeeper.auditlog.file}.%i</FileNamePattern>\n    </rollingPolicy>\n    <triggeringPolicy class=\"ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy\">\n      <MaxFileSize>10MB</MaxFileSize>\n    </triggeringPolicy>\n  </appender>\n\n  <logger name=\"org.apache.zookeeper.audit.Slf4jAuditLogger\" additivity=\"false\" level=\"${audit.logger}\">\n    <appender-ref ref=\"RFAAUDIT\" />\n  </logger>\n\n  <root level=\"INFO\">\n    <appender-ref ref=\"CONSOLE\" />\n    <appender-ref ref=\"ROLLINGFILE\" />\n  </root>\n</configuration>\n

Tip

This is a customization of the original file from Zookeeper source tree. Please refer to documentation to configure logging.

"},{"location":"cluster/setup/zookeeper.html#create-systemd-file","title":"Create Systemd File","text":"

Create a systemd file. Put the following in /etc/systemd/system/drove.zookeeper.service:

[Unit]\nDescription=Drove Zookeeper Service\nAfter=docker.service\nRequires=docker.service\n\n[Service]\nUser=drove\nTimeoutStartSec=0\nRestart=always\nExecStartPre=-/usr/bin/docker pull zookeeper:3.8\nExecStart=/usr/bin/docker run \\\n    --env-file /etc/drove/zk/zk.env \\\n    --volume /var/lib/drove/zk/data:/data \\\n    --volume /var/lib/drove/zk/datalog:/datalog \\\n    --volume /var/log/drove/zk:/logs \\\n    --volume /etc/drove/zk/logback.xml:/conf/logback.xml \\\n    --volume /etc/drove/zk/java.env:/conf/java.env \\\n    --publish 2181:2181 \\\n    --publish 2888:2888 \\\n    --publish 3888:3888 \\\n    --rm \\\n    --name drove.zookeeper \\\n    zookeeper:3.8\n\n[Install]\nWantedBy=multi-user.target\n

Verify the file with the following command:

systemd-analyze verify drove.zookeeper.service\n

Set permissions

chmod 664 /etc/systemd/system/drove.zookeeper.service\n

"},{"location":"cluster/setup/zookeeper.html#start-the-service-on-all-servers","title":"Start the service on all servers","text":"

Use the following to start the service:

systemctl daemon-reload\nsystemctl enable drove.zookeeper\nsystemctl start drove.zookeeper\n

You can check server status using the following:

echo srvr | nc localhost 2181\n

Tip

Replace localhost on the above command with the actual ZK server IPs to test remote connectivity.

Note

You can access the ZK client from the container using the following command:

docker exec -it drove.zookeeper bin/zkCli.sh\n

To connect to remote host you can use the following:

docker exec -it drove.zookeeper bin/zkCli.sh -server <server name or ip>:2181\n

"},{"location":"extra/cli.html","title":"Drove CLI","text":"

Details for the Drove CLI, including installation and usage can be found in the cli repo.

Repo link: https://github.com/PhonePe/drove-cli.

"},{"location":"extra/epoch.html","title":"Epoch","text":"

Epoch is a cron type scheduler that spins up container jobs on Drove.

Details for using epoch can be found in the epoch repo.

Link for Epoch repo: https://github.com/PhonePe/epoch.

"},{"location":"extra/epoch.html#epoch-cli","title":"Epoch CLI","text":"

There is a cli client for interaction with epoch. Details for installation and usage can be found in the epoch CLI repo.

Link for Epoch CLI repo: https://github.com/phonepe/epoch-cli.

"},{"location":"extra/libraries.html","title":"Libraries","text":"

Drove is written in Java. We provide a few libraries that can be used to integrate with a Drove cluster.

"},{"location":"extra/libraries.html#setup","title":"Setup","text":"

Setup the drove version

<properties>\n    <!--other properties-->\n    <drove.version>1.29</drove.version>\n</properties>\n

Checking the latest version

Latest version can be checked at the github packages page here

All libraries are located in sub packages of the top level package com.phonepe.drove.

Java Version Compatibility

Using Drove libraries would need Java versions 17+.

"},{"location":"extra/libraries.html#drove-model","title":"Drove Model","text":"

The model library for the classes used in request and response. It has dependency on jackson and dropwizard-validation.

"},{"location":"extra/libraries.html#dependency","title":"Dependency","text":"
<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-models</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#drove-client","title":"Drove Client","text":"

We provide a client library that can be used to connect to a Drove cluster. The cluster accepts controller endpoints as parameter (among other things) and automatically tracks the leader controller. If a single controller endpoint is provided, this functionality is turned off.

Please note that the client does not provide specific functions corresponding to different api calls from the controller, it acts as a simple endpoint discovery mechanism for drove cluster. Please refer to API section for details on individual apis.

"},{"location":"extra/libraries.html#transport","title":"Transport","text":"

The transport layer in the client is used to actually make HTTP calls to the Drove server. A new transport can be used by implementing the get(), post(), put() and delete() methods in the DroveHttpTransport interface.

By default Drove client uses Java internal HTTP client as a trivial transport implementation. We also provide an Apache Http Components based implementation.

Tip

Do not use the default transport in production. Please use the HTTP Components based transport or your custom ones.

"},{"location":"extra/libraries.html#dependencies","title":"Dependencies","text":"
 <dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-client</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-client-httpcomponent-transport</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#sample-code","title":"Sample code","text":"
public class DroveCluster implements AutoCloseable {\n\n    @Getter\n    private final DroveClient droveClient;\n\n    public DroveCluster() {\n        final var config = new DroveConfig()\n            .setEndpoints(List.of(\"http://controller1:4000,http://controller2:4000\"));\n\n        this.droveClient = new DroveClient(config,\n                                      List.of(new BasicAuthDecorator(\"guest\", \"guest\")),\n                                           new DroveHttpComponentsTransport(config.getCluster()));\n    }\n\n    @Override\n    public void close() throws Exception {\n        this.droveClient.close();\n    }\n}\n

RequestDecorator

This interface can be implemented to augment requests with special headers like for example Authorization, as well as for other stuff like adding content type etc etc.

"},{"location":"extra/libraries.html#drove-event-listener","title":"Drove Event Listener","text":"

This library provides callbacks that can be used to listen and react to events happening on the Drove cluster.

"},{"location":"extra/libraries.html#dependencies_1","title":"Dependencies","text":"
<!--Include Drove client-->\n<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-events-client</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#sample-code_1","title":"Sample Code","text":"
final var droveClient = ... //build your java transport, client here\n\n//Create and setup your object mapper\nfinal var mapper = new ObjectMapper();\nmapper.registerModule(new ParameterNamesModule());\nmapper.setSerializationInclusion(JsonInclude.Include.NON_EMPTY);\nmapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);\nmapper.disable(SerializationFeature.FAIL_ON_EMPTY_BEANS);\nmapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);\nmapper.enable(MapperFeature.ACCEPT_CASE_INSENSITIVE_ENUMS);\n\nfinal var listener = new DroveRemoteEventListener(droveClient, //Create listener\n                                                    mapper,\n                                                    new DroveEventPollingOffsetInMemoryStore(),\n                                                    Duration.ofSeconds(1));\n\nlistener.onEventReceived() //Connect signal handlers\n    .connect(events -> {\n        log.info(\"Remote Events: {}\", events);\n    });\n\nlistener.start(); //Start listening\n\n\n//Once done close the listener\nlistener.close();\n

Event Types

Please check the com.phonepe.drove.models.events package for the different event types and classes.

Event Polling Offset Store

The event poller library uses polling to find new events based on an offset. The event polling offset store is used to store and retrieve this offset. The DroveEventPollingOffsetInMemoryStore default store stores this information in-memory. Implement DroveEventPollingOffsetStore to a more permanent storage if you want this to be more permanent.

"},{"location":"extra/libraries.html#drove-hazelcast-cluster-discovery","title":"Drove Hazelcast Cluster Discovery","text":"

Drove provides an implementation of the Hazelcast discovery SPI so that containers deployed on a drove cluster can discover each other. This client uses the token injected by drove in the DROVE_APP_INSTANCE_AUTH_TOKEN environment variable to get sibling information from the controller.

"},{"location":"extra/libraries.html#dependencies_2","title":"Dependencies","text":"
<!--Include Drove client-->\n<!--Include Hazelcast-->\n<dependency>\n    <groupId>com.phonepe.drove</groupId>\n    <artifactId>drove-events-client</artifactId>\n    <version>${drove.version}</version>\n</dependency>\n
"},{"location":"extra/libraries.html#sample-code_2","title":"Sample Code","text":"
//Setup hazelcast\nConfig config = new Config();\n\n// Enable discovery\nconfig.setProperty(\"hazelcast.discovery.enabled\", \"true\");\nconfig.setProperty(\"hazelcast.discovery.public.ip.enabled\", \"true\");\nconfig.setProperty(\"hazelcast.socket.client.bind.any\", \"true\");\nconfig.setProperty(\"hazelcast.socket.bind.any\", \"false\");\n\n//Setup networking\nNetworkConfig networkConfig = config.getNetworkConfig();\nnetworkConfig.getInterfaces().addInterface(\"0.0.0.0\").setEnabled(true);\nnetworkConfig.setPort(port); //Port is the port exposed on the container for hazelcast clustering\n\n// Setup Drove discovery\nJoinConfig joinConfig = networkConfig.getJoin();\n\nDiscoveryConfig discoveryConfig = joinConfig.getDiscoveryConfig();\nDiscoveryStrategyConfig discoveryStrategyConfig =\n        new DiscoveryStrategyConfig(new DroveDiscoveryStrategyFactory());\ndiscoveryStrategyConfig.addProperty(\"drove-endpoint\", \"http://controller1:4000,http://controller2:4000\"); //Controller endpoints\ndiscoveryStrategyConfig.addProperty(\"port-name\", \"hazelcast\"); // Name of the hazelcast port defined in Application spec\ndiscoveryStrategyConfig.addProperty(\"transport\", \"com.phonepe.drove.client.transport.httpcomponent.DroveHttpComponentsTransport\");\ndiscoveryStrategyConfig.addProperty(\"cluster-by-app-name\", true); //Cluster container across multiple app versions\ndiscoveryConfig.addDiscoveryStrategyConfig(discoveryStrategyConfig);\n\n//Create hazelcast node\nval node = Hazelcast.newHazelcastInstance(config);\n\n//Once connected, node.getCluster() will be non null\n

Peer discovery modes

By default the containers will only discover and connect to containers from the same application id. If you need to connect to containers from all versions of the same application please set the cluster-by-app-name property to true as in the above example.

"},{"location":"extra/nvidia.html","title":"Setting up Nvidia GPU computation on executor","text":"

Prerequisite: Docker version 19.0.3+. Check Docker versions and nvidia for details.

Below steps are for ubuntu primarily for other distros check the associated links.

"},{"location":"extra/nvidia.html#install-nvidia-drivers-on-hosts","title":"Install nvidia drivers on hosts","text":"

Ubuntu provides packaged drivers for nvidia. Driver installation Guide

Recommended

ubuntu-drivers list --gpgpu\nubuntu-drivers install --gpgpu nvidia:535-server\n

Alternatively apt can be used, but may require additional steps Manual install

# Check for the latest stable version \napt search nvidia-driver.*server\napt install -y nvidia-driver-535-server  nvidia-utils-535-server \n

For other distros check Guide

"},{"location":"extra/nvidia.html#install-nvidia-container-toolkit","title":"Install Nvidia-container-toolkit","text":"

Add nvidia repo

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list\n\napt install -y nvidia-container-toolkit\n
For other distros check guide here

Configure docker with nvidia toolkit

nvidia-ctk runtime configure --runtime=docker\n\nsystemctl restart docker #Restart Docker\n
"},{"location":"extra/nvidia.html#verify-installation","title":"Verify installation","text":"

On Host nvidia-smi -l In docker container docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

+-----------------------------------------------------------------------------+\n| NVIDIA-SMI 535.86.10    Driver Version: 535.86.10    CUDA Version: 12.2     |\n|-------------------------------+----------------------+----------------------+\n| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n|                               |                      |               MIG M. |\n|===============================+======================+======================|\n|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |\n| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |\n|                               |                      |                  N/A |\n+-------------------------------+----------------------+----------------------+\n\n+-----------------------------------------------------------------------------+\n| Processes:                                                                  |\n|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n|        ID   ID                                                   Usage      |\n|=============================================================================|\n|  No running processes found                                                 |\n+-----------------------------------------------------------------------------+\n
Verification guide

"},{"location":"extra/nvidia.html#enable-nvidia-support-on-drove","title":"Enable nvidia support on drove","text":"

Enable Nvidia support in drove-executor.yml and restart drove-executor

...\nresources:\n  ...\n  enableNvidiaGpu: true\n...\n

"},{"location":"tasks/index.html","title":"Introduction","text":"

A task is a representation for transient containerized workloads on the cluster. A task instance is supposed to have a much shorter life-time than an application instance. Use tasks to spin up things like automation scripts etc.

"},{"location":"tasks/index.html#primary-differences-with-an-application","title":"Primary differences with an application","text":"

Please note the following important differences between a task instance and application instances

Tip

Use epoch to spin up tasks in a periodic manner

A task specification contains the following sections:

"},{"location":"tasks/index.html#task-id","title":"Task ID","text":"

Identification of a task is a bit more complicated on Drove. There is a Task ID ({sourceAppName}-{taskId}) which is used internally in drove. This is returned to the client when task is created.

However, clients are supposed to use the {sourceAppName,taskId} combo they have sent in the task spec to address and send commands to their tasks.

"},{"location":"tasks/index.html#task-states-and-operations","title":"Task States and operations","text":"

Tasks on Drove have their own life cycle modelled as a state machine. State transitions can be triggered by issuing operations using the APIs.

"},{"location":"tasks/index.html#states","title":"States","text":"

Tasks on a Drove cluster can be one of the following states:

"},{"location":"tasks/index.html#operations","title":"Operations","text":"

The following task operations are recognized by Drove:

Tip

All operations need Cluster Operation Spec which can be used to control the timeout and parallelism of tasks generated by the operation.

"},{"location":"tasks/index.html#task-state-machine","title":"Task State Machine","text":"

The following state machine signifies the states and transitions as affected by cluster state and operations issued.

"},{"location":"tasks/operations.html","title":"Task Operations","text":"

This page discusses operations relevant to Task management. Please go over the Task State Machine to understand the different states a task can be in and how operations applied (and external changes) move a task from one state to another.

Note

Please go through Cluster Op Spec to understand the operation parameters being sent.

For tasks only the timeout parameter is relevant.

Note

Only one operation can be active on a particular task identified by a {sourceAppName,taskId} at a time.

Warning

Only the leader controller will accept and process operations. To avoid confusion, use the controller endpoint exposed by Drove Gateway to issue commands.

"},{"location":"tasks/operations.html#cluster-operation-specification","title":"Cluster Operation Specification","text":"

When an operation is submitted to the cluster, a cluster op spec needs to be specified. This is needed to control different aspects of the operation, including parallelism of an operation or increase the timeout for the operation and so on.

The following aspects of an operation can be configured:

Name Option Description Timeout timeout The duration after which Drove considers the operation to have timed out. Parallelism parallelism Parallelism of the task. (Range: 1-32) Failure Strategy failureStrategy Set this to STOP.

Note

For internal recovery operations, Drove generates it's own operations. For that, Drove applies the following cluster operation spec:

The default operation spec can be configured in the controller configuration file. It is recommended to set this to a something like 8 for faster recovery.

"},{"location":"tasks/operations.html#how-to-initiate-an-operation","title":"How to initiate an operation","text":"

Tip

Use the Drove CLI to perform all manual operations.

All operations for task lifecycle management need to be issued via a POST HTTP call to the leader controller endpoint on the path /apis/v1/tasks/operations. API will return HTTP OK/200 and relevant json response as payload.

Sample api call:

curl --location 'http://drove.local:7000/apis/v1/tasks/operations' \\\n--header 'Content-Type: application/json' \\\n--header 'Authorization: Basic YWRtaW46YWRtaW4=' \\\n--data '{\n    \"type\": \"KILL\",\n    \"sourceAppName\" : \"TEST_APP\",\n    \"taskId\" : \"T0012\",\n    \"opSpec\": {\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}'\n

Note

In the above examples, http://drove.local:7000 is the endpoint of the leader. TEST_APP is the name of the application that started this task and taskId is a unique client generated id. Authorization is basic auth.

Warning

Task operations are not cancellable.

"},{"location":"tasks/operations.html#create-a-task","title":"Create a task","text":"

A task can be created issuing the following command.

Preconditions: - Task with same {sourceAppName,taskId} should not exist on the cluster.

State Transition:

To create a task a Task Spec needs to be created first.

Once ready, CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local tasks create sample/test_task.json\n

Sample Request Payload

{\n    \"type\": \"CREATE\",\n    \"spec\": {...}, //(1)!\n    \"opSpec\": { //(2)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Spec as mentioned in Task Specification
  2. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"taskId\": \"TEST_APP-T0012\"\n    },\n    \"message\": \"success\"\n}\n

Warning

There are no separate create/run steps in a task. Creation will start execution automatically and immediately.

"},{"location":"tasks/operations.html#kill-a-task","title":"Kill a task","text":"

A task can be created issuing the following command.

Preconditions: - Task with same {sourceAppName,taskId} needs to exist on the cluster.

State Transition:

CLI command needs to be issued or the following payload needs to be sent:

Drove CLIJSON
drove -c local tasks kill TEST_APP T0012\n

Sample Request Payload

{\n    \"type\": \"KILL\",\n    \"sourceAppName\" : \"TEST_APP\",//(1)!\n    \"taskId\" : \"T0012\",//(2)!\n    \"opSpec\": {//(3)!\n        \"timeout\": \"5m\",\n        \"parallelism\": 1,\n        \"failureStrategy\": \"STOP\"\n    }\n}\n

  1. Source app name as mentioned in spec during task creation
  2. Task ID as mentioned in the spec
  3. Operation spec as mentioned in Cluster Op Spec

Sample response

{\n    \"status\": \"SUCCESS\",\n    \"data\": {\n        \"taskId\": \"T0012\"\n    },\n    \"message\": \"success\"\n}\n

Note

Task metadata will remain on the cluster for some time. Metadata cleanup for tasks is automatic and can be configured in the controller configuration.

"},{"location":"tasks/specification.html","title":"Task Specification","text":"

A task is defined using JSON. We use a sample configuration below to explain the options.

"},{"location":"tasks/specification.html#sample-task-definition","title":"Sample Task Definition","text":"
{\n    \"sourceAppName\": \"TEST_APP\",//(1)!\n    \"taskId\": \"T0012\",//(2)!\n    \"executable\": {//(3)!\n        \"type\": \"DOCKER\", // (4)!\n        \"url\": \"ghcr.io/appform-io/test-task\",//(5)!\n        \"dockerPullTimeout\": \"100 seconds\"//(6)!\n    },\n     \"resources\": [//(7)!\n        {\n            \"type\": \"CPU\",\n            \"count\": 1//(8)!\n        },\n        {\n            \"type\": \"MEMORY\",\n            \"sizeInMB\": 128//(9)!\n        }\n    ],\n    \"volumes\": [//(10)!\n        {\n            \"pathInContainer\": \"/data\",//(11)!\n            \"pathOnHost\": \"/mnt/datavol\",//(12)!\n            \"mode\" : \"READ_WRITE\"//(13)!\n        }\n    ],\n    \"configs\" : [//(14)!\n        {\n            \"type\" : \"INLINE\",//(15)!\n            \"localFilename\": \"/testfiles/drove.txt\",//(16)!\n            \"data\" : \"RHJvdmUgdGVzdA==\"//(17)!\n        }\n    ],\n    \"placementPolicy\": {//(18)!\n        \"type\": \"ANY\"//(19)!\n    },\n    \"env\": {//(20)!\n        \"CORES\": \"8\"\n    },\n    \"args\" : [] //(27)!\n    \"tags\": { //(21)!\n        \"superSpecialApp\": \"yes_i_am\",\n        \"say_my_name\": \"heisenberg\"\n    },\n    \"logging\": {//(22)!\n        \"type\": \"LOCAL\",//(23)!\n        \"maxSize\": \"100m\",//(24)!\n        \"maxFiles\": 3,//(25)!\n        \"compress\": true//(26)!\n    }\n}\n
  1. Name of the application that has started the task. Make sure this is a valid application on the cluster.
  2. An unique ID for this task. Uniqueness is up to the user, Drove will scope it in the sourceAppName namespace.
  3. Coordinates for the executable. Refer to Executable Specification for details.
  4. Right now the only type supported is DOCKER.
  5. Docker container address
  6. Timeout for container pull.
  7. Volumes to be mounted. Refer to Volume Specification for details.
  8. Path that will be visible inside the container for this mount.
  9. Actual path on the host machine for the mount.
  10. Mount mode can be READ_WRITE and READ_ONLY
  11. Configuration to be injected as file inside the container. Please refer Config Specification for details.
  12. Type of config. Can be INLINE, EXECUTOR_LOCAL_FILE, ONTROLLER_HTTP_FETCHandEXECUTOR_HTTP_FETCH`. Specifies how drove will t the contents to be injected..
  13. File name for the config inside the container.
  14. Serialized form of the data, this and other parameters will vary cording to the type specified above.
  15. List of resources required to run this application. Check Resource Requirements Specification for more tails.
  16. Number of CPU cores to be allocated.
  17. Amount of memory to be allocated expressed in Megabytes
  18. Specifies how the container will be placed on the cluster. Check Placement Policy for details.
  19. Type of placement can be ANY, ONE_PER_HOST, MATCH_TAG, NO_TAG, RULE_BASED, ANY and COMPOSITE. Rest of the parameters in this section will depend on the type.
  20. Custom environment variables. Additional variables are injected by Drove as well. See Environment Variables section for tails.
  21. Key value metadata that can be used in external systems.
  22. Specify how docker log files are configured. Refer to Logging Specification
  23. Log to local file
  24. Maximum File Size
  25. Number of latest log files to retain
  26. Log files will be compressed
  27. List of command line arguments. See Command Line Arguments for details.

Warning

Please make sure sourceAppName is set to a correct application name as specified in the name parameter of a running application on the cluster.

If this is not done, stale task metadata will not be cleaned up and your metadata store performance will get affected over time.

"},{"location":"tasks/specification.html#executable-specification","title":"Executable Specification","text":"

Right now Drove supports only docker containers. However as engines, both docker and podman are supported. Drove executors will fetch the executable directly from the registry based on the configuration provided.

Name Option Description Type type Set type to DOCKER. URL url Docker container URL`. Timeout dockerPullTimeout Timeout for docker image pull.

Note

Drove supports docker registry authentication. This can be configured in the executor configuration file.

"},{"location":"tasks/specification.html#resource-requirements-specification","title":"Resource Requirements Specification","text":"

This section specifies the hardware resources required to run the container. Right now only CPU and MEMORY are supported as resource types that can be reserved for a container.

"},{"location":"tasks/specification.html#cpu-requirements","title":"CPU Requirements","text":"

Specifies number of cores to be assigned to the container.

Name Option Description Type type Set type to CPU for this. Count count Number of cores to be assigned."},{"location":"tasks/specification.html#memory-requirements","title":"Memory Requirements","text":"

Specifies amount of memory to be allocated to a container.

Name Option Description Type type Set type to MEMORY for this. Count sizeInMB Amount of memory (in Mega Bytes) to be allocated.

Sample

[\n    {\n        \"type\": \"CPU\",\n        \"count\": 1\n    },\n    {\n        \"type\": \"MEMORY\",\n        \"sizeInMB\": 128\n    }\n]\n

Note

Both CPU and MEMORY configurations are mandatory.

"},{"location":"tasks/specification.html#volume-specification","title":"Volume Specification","text":"

Files and directories can be mounted from the executor host into the container. The volumes section contains a list of volumes that need to be mounted.

Name Option Description Path In Container pathInContainer Path that will be visible inside the container for this mount. Path On Host pathOnHost Actual path on the host machine for the mount. Mount Mode mode Mount mode can be READ_WRITE and READ_ONLY to allow the containerized process to write or read to the volume.

Info

We do not support mounting remote volumes as of now.

"},{"location":"tasks/specification.html#config-specification","title":"Config Specification","text":"

Drove supports injection of configuration files into containers. The specifications for the same are discussed below.

"},{"location":"tasks/specification.html#inline-config","title":"Inline config","text":"

Inline configuration can be added in the Application Specification itself. This will manifest as a file inside the container.

The following details are needed for this:

Name Option Description Type type Set the value to INLINE Local Filename localFilename File name for the config inside the container. Data data Base64 encoded string for the data. The value for this will be masked on UI.

Config file:

port: 8080\nlogLevel: DEBUG\n
Corresponding config specification:
{\n    \"type\" : \"INLINE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"cG9ydDogODA4MApsb2dMZXZlbDogREVCVUcK\"\n}\n

Warning

The full base 64 encoded config data will get stored in Drove ZK and will be pushed to executors inline. It is not recommended to stream large config files to containers using this method. This will probably need additional configuration on your ZK cluster.

"},{"location":"tasks/specification.html#locally-loaded-config","title":"Locally loaded config","text":"

Config file from a path on the executor directly. Such files can be distributed to the executor host using existing configuration management systems such as OpenTofu, Salt etc.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_LOCAL_FILE Local Filename localFilename File name for the config inside the container. File path filePathOnHost Path to the config file on executor host.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_LOCAL_FILE\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"data\" : \"/mnt/configs/myservice/config.yml\"\n}\n

"},{"location":"tasks/specification.html#controller-fetched-config","title":"Controller fetched Config","text":"

Config file can be fetched from a remote server by the controller. Once fetched, these will be streamed to the executor as part of the instance specification for starting a container.

The following details are needed for this:

Name Option Description Type type Set the value to CONTROLLER_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"CONTROLLER_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

The controller will make an API call for every single time it asks an executor to spin up a container. Please make sure to account for this in your configuration management system.

"},{"location":"tasks/specification.html#executor-fetched-config","title":"Executor fetched Config","text":"

Config file can be fetched from a remote server by the executor before spinning up a container. Once fetched, the payload will be injected as a config file into the container.

The following details are needed for this:

Name Option Description Type type Set the value to EXECUTOR_HTTP_FETCH Local Filename localFilename File name for the config inside the container. HTTP Call Details http HTTP Call related details. Please refer to HTTP Call Specification for details.

Sample config specification:

{\n    \"type\" : \"EXECUTOR_HTTP_FETCH\",\n    \"localFilename\" : \"/config/service.yml\",\n    \"http\" : {\n        \"protocol\" : \"HTTP\",\n        \"hostname\" : \"configserver.internal.yourdomain.net\",\n        \"port\" : 8080,\n        \"path\" : \"/configs/myapp\",\n        \"username\" : \"appuser\",\n        \"password\" : \"secretpassword\"\n    }\n}\n

Note

All executors will make an API call for every single time they spin up a container for this application. Please make sure to account for this in your configuration management system.

"},{"location":"tasks/specification.html#http-call-specification","title":"HTTP Call Specification","text":"

This section details the options that can set when making http calls to a configuration management system from controllers or executors.

The following options are available for HTTP call:

Name Option Description Protocol protocol Protocol to use for upstream call. Can be HTTP or HTTPS. Hostname hostname Host to call. Port port Provide custom port. Defaults to 80 for http and 443 for https. API Path path Path component of the URL. Include query parameters here. Defaults to / HTTP Method verb Type of call, use GET, POST or PUT. Defaults to GET. Success Code successCodes List of HTTP status codes which is considered as success. Defaults to [200] Payload payload Data to be used for POST and PUT calls Connection Timeout connectionTimeout Timeout for upstream connection. Operation timeout operationTimeout Timeout for actual operation. Username username Username to be used basic auth. This field is masked out on the UI. Password password Password to be used for basic auth. This field is masked on the UI. Authorization Header authHeader Data to be passed in HTTP Authorization header. This field is masked on the UI. Additional Headers headers Any other headers to be passed to the upstream in the HTTP calls. This is a map of Skip SSL Checks insecure Skip hostname and certification checks during SSL handshake with the upstream."},{"location":"tasks/specification.html#placement-policy-specification","title":"Placement Policy Specification","text":"

Placement policy governs how Drove deploys containers on the cluster. The following sections discuss the different placement policies available and how they can be configured to achieve optimal placement of containers.

Warning

All policies will work only at a {appName, version} combination level. They will not ensure constraints at an appName level. This means that for somethinge like a one per node placement, for the same appName, multiple containers can run on the same host if multiple deployments with different versions are active in a cluster. Same applies for all policies like N per host and so on.

Important details about executor tagging

"},{"location":"tasks/specification.html#any-placement","title":"Any Placement","text":"

Containers for a {appName, version} combination can run on any un-tagged executor host.

Name Option Description Policy Type type Put ANY as policy.

Sample:

{\n    \"type\" : \"ANY\"\n}\n

Tip

For most use-cases this is the placement policy to use.

"},{"location":"tasks/specification.html#one-per-host-placement","title":"One Per Host Placement","text":"

Ensures that only one container for a particular {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put ONE_PER_HOST as policy.

Sample:

{\n    \"type\" : \"ONE_PER_HOST\"\n}\n

"},{"location":"tasks/specification.html#max-n-per-host-placement","title":"Max N Per Host Placement","text":"

Ensures that at most N containers for a {appName, version} combination is running on an executor host at a time.

Name Option Description Policy Type type Put MAX_N_PER_HOST as policy. Max count max The maximum num of containers that can run on an executor. Range: 1-64

Sample:

{\n    \"type\" : \"MAX_N_PER_HOST\",\n    \"max\": 3\n}\n

"},{"location":"tasks/specification.html#match-tag-placement","title":"Match Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has the tags as mentioned in the policy.

Name Option Description Policy Type type Put MATCH_TAG as policy. Max count tag The tag to match.

Sample:

{\n    \"type\" : \"MATCH_TAG\",\n    \"tag\": \"gpu_enabled\"\n}\n

"},{"location":"tasks/specification.html#no-tag-placement","title":"No Tag Placement","text":"

Ensures that containers for a {appName, version} combination are running on an executor host that has no tags.

Name Option Description Policy Type type Put NO_TAG as policy.

Sample:

{\n    \"type\" : \"NO_TAG\"\n}\n

Info

The NO_TAG policy is mostly for internal use, and does not need to be specified when deploying containers that do not need any special placement logic.

"},{"location":"tasks/specification.html#composite-policy-based-placement","title":"Composite Policy Based Placement","text":"

Composite policy can be used to combine policies together to create complicated placement requirements.

Name Option Description Policy Type type Put COMPOSITE as policy. Polices policies List of policies to combine Combiner combiner Can be AND and OR and signify all-match and any-match logic on the policies mentioned.

Sample:

{\n    \"type\" : \"COMPOSITE\",\n    \"policies\": [\n        {\n            \"type\": \"ONE_PER_HOST\"\n        },\n        {\n            \"type\": \"MATH_TAG\",\n            \"tag\": \"gpu_enabled\"\n        }\n    ],\n    \"combiner\" : \"AND\"\n}\n
The above policy will ensure that only one container of the relevant {appName,version} will run on GPU enabled machines.

Tip

It is easy to go into situations where no executors match complicated placement policies. Internally, we tend to keep things rather simple and use the ANY placement for most cases and maybe tags in a few places with over-provisioning or for hosts having special hardware

"},{"location":"tasks/specification.html#environment-variables","title":"Environment variables","text":"

This config can be used to inject custom environment variables to containers. The values are defined as part of deployment specification, are same across the cluster and immutable to modifications from inside the container (ie any overrides from inside the container will not be visible across the cluster).

Sample:

{\n    \"MY_VARIABLE_1\": \"fizz\",\n    \"MY_VARIABLE_2\": \"buzz\"\n}\n

The following environment variables are injected by Drove to all containers:

Variable Name Value HOST Hostname where the container is running. This is for marathon compatibility. PORT_PORT_NUMBER A variable for every port specified in exposedPorts section. The value is the actual port on the host, the specified port is mapped to. For example if ports 8080 and 8081 are specified, two variables called PORT_8080 and PORT_8081 will be injected. DROVE_EXECUTOR_HOST Hostname where container is running. DROVE_CONTAINER_ID Container that is deployed DROVE_APP_NAME App name as specified in the Application Specification DROVE_INSTANCE_ID Actual instance ID generated by Drove DROVE_APP_ID Application ID as generated by Drove DROVE_APP_INSTANCE_AUTH_TOKEN A JWT string generated by Drove that can be used by this container to call /apis/v1/internal/... apis.

Warning

Do not pass secrets using environment variables. These variables are all visible on the UI as is. Please use Configs to inject secrets files and so on.

"},{"location":"tasks/specification.html#command-line-arguments","title":"Command line arguments","text":"

A list of command line arguments that are sent to the container engine to execute inside the container. This is provides ways for you to configure your container behaviour based off such arguments. Please refer to docker documentation for details.

Danger

This might have security implications from a system point of view. As such Drove provides administrators a way to disable passing arguments at the cluster level by setting disableCmdlArgs to true in the controller configuration.

"},{"location":"tasks/specification.html#logging-specification","title":"Logging Specification","text":"

Can be used to configure how container logs are managed on the system.

Note

This section affects the docker log driver. Drove will continue to stream logs to it's own logger which can be configured at executor level through the executor configuration file.

"},{"location":"tasks/specification.html#local-logger-configuration","title":"Local Logger configuration","text":"

This is used to configure the json-file log driver.

Name Option Description Type type Set the value to LOCAL Max Size maxSize Maximum file size. Anything bigger than this will lead to rotation. Max Files maxFiles Maximum number of logs files to keep. Range: 1-100 Compress compress Enable log file compression.

Tip

If logging section is omitted, the following configuration is applied by default: - File size: 10m - Number of files: 3 - Compression: on

"},{"location":"tasks/specification.html#rsyslog-configuration","title":"Rsyslog configuration","text":"

In case suers want to stream logs to an rsyslog server, the logging configuration needs to be set to RSYSLOG mode.

Name Option Description Type type Set the value to RSYSLOG Server server URL for the rsyslog server. Tag Prefix tagPrefix Prefix to add at the start of a tag Tag Suffix tagSuffix Suffix to add at the en of a tag.

Note

The default tag is the DROVE_INSTANCE_ID. The tagPrefix and tagSuffix will to before and after this

"}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 0000000..7e07226 --- /dev/null +++ b/sitemap.xml @@ -0,0 +1,143 @@ + + + + https://phonepe.github.io/drove-orchestrator/index.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/getting-started.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/apis/index.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/apis/application.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/apis/cluster.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/apis/logs.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/apis/task.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/applications/index.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/applications/instances.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/applications/operations.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/applications/outage.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/applications/specification.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/cluster.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/controller.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/executor-setup.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/gateway.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/maintenance.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/planning.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/prerequisites.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/units.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/cluster/setup/zookeeper.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/extra/cli.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/extra/epoch.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/extra/libraries.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/extra/nvidia.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/tasks/index.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/tasks/operations.html + 2024-08-23 + daily + + + https://phonepe.github.io/drove-orchestrator/tasks/specification.html + 2024-08-23 + daily + + \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz new file mode 100644 index 0000000..d664313 Binary files /dev/null and b/sitemap.xml.gz differ diff --git a/stylesheets/extra.css b/stylesheets/extra.css new file mode 100644 index 0000000..ee2c783 --- /dev/null +++ b/stylesheets/extra.css @@ -0,0 +1,4 @@ +/* Dark mode */ +[data-md-color-scheme="slate"] img { + background-color: lightgray; +} \ No newline at end of file diff --git a/tasks/index.html b/tasks/index.html new file mode 100644 index 0000000..a9bc4c6 --- /dev/null +++ b/tasks/index.html @@ -0,0 +1,1631 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Introduction - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Introduction

+

A task is a representation for transient containerized workloads on the cluster. A task instance is supposed to have a much shorter life-time than an application instance. Use tasks to spin up things like automation scripts etc.

+

Primary differences with an application

+

Please note the following important differences between a task instance and application instances

+
    +
  • Tasks cannot expose ports and virtual hosts for incoming traffic
  • +
  • There are no readiness checks, health checks or shutdown hooks for a task
  • +
  • Task instances cannot be scaled up or down
  • +
  • Tasks cannot be restarted
  • +
  • A task is typically owned by an application running on the cluster.
  • +
  • Task instances are not replaced if the corresponding executor node goes down during execution
  • +
  • Unlike applications there is no task + task instance, the only representation of task on a Drove cluster is a task instance.
  • +
+
+

Tip

+

Use epoch to spin up tasks in a periodic manner

+
+

A task specification contains the following sections:

+
    +
  • Source App Name - Name of the application that created this task
  • +
  • Task ID - User supplied task ID unique in the same sourceAppName scope
  • +
  • Executable - The container to deploy on the cluster
  • +
  • Resources - CPU and Memory required for the container
  • +
  • Placement Policy - How containers are to be placed in the cluster
  • +
  • Environment Variables - Environment variables and values
  • +
  • Volumes - Volumes to be mounted into the container
  • +
  • Configs - Configs/files to be mounted into the container
  • +
  • Logging details - Logging spec (for example rsyslog server)
  • +
  • Tags - A map of strings for additional metadata
  • +
+

Task ID

+

Identification of a task is a bit more complicated on Drove. There is a Task ID ({sourceAppName}-{taskId}) which is used internally in drove. This is returned to the client when task is created.

+

However, clients are supposed to use the {sourceAppName,taskId} combo they have sent in the task spec to address and send commands to their tasks.

+

Task States and operations

+

Tasks on Drove have their own life cycle modelled as a state machine. State transitions can be triggered by issuing operations using the APIs.

+

States

+

Tasks on a Drove cluster can be one of the following states:

+
    +
  • PENDING - Task has been submitted, yet to be provisioned
  • +
  • PROVISIONING - Task is assigned to an executor and docker image i ng -ownloaded
  • +
  • PROVISIONING_FAILED - Docker image download failed
  • +
  • STARTING - Docker run is starting
  • +
  • RUNNING - Task is running currently
  • +
  • RUN_COMPLETED - Task run has completed. Whether passed or failed n -o be checked from the task result.
  • +
  • DEPROVISIONING - Docker image cleanup underway
  • +
  • STOPPED - Task cleanup completed. This is a terminal state.
  • +
  • LOST - Task disappeared while executor was down.
  • +
  • UNKNOWN - All tasks that are running are put in this state when executor has been restarted and startup recovery has not kicked in yet
  • +
+

Operations

+

The following task operations are recognized by Drove:

+
    +
  • CREATE - Create a task. The Task definition/spec is provided as an argument to this.
  • +
  • KILL - Kill a task. The task ID is taken as a parameter.
  • +
+
+

Tip

+

All operations need Cluster Operation Spec which can be used to control the timeout and parallelism of tasks generated by the operation.

+
+

Task State Machine

+

The following state machine signifies the states and transitions as affected by cluster state and operations issued.

+

Task State Machine

+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/tasks/operations.html b/tasks/operations.html new file mode 100644 index 0000000..22b3ec5 --- /dev/null +++ b/tasks/operations.html @@ -0,0 +1,1703 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Task Operations - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + +

Task Operations

+

This page discusses operations relevant to Task management. Please go over the Task State Machine to understand the different states a task can be in and how operations applied (and external changes) move a task from one state to another.

+
+

Note

+

Please go through Cluster Op Spec to understand the operation parameters being sent.

+

For tasks only the timeout parameter is relevant.

+
+
+

Note

+

Only one operation can be active on a particular task identified by a {sourceAppName,taskId} at a time.

+
+
+

Warning

+

Only the leader controller will accept and process operations. To avoid confusion, use the controller endpoint exposed by Drove Gateway to issue commands.

+
+

Cluster Operation Specification

+

When an operation is submitted to the cluster, a cluster op spec needs to be specified. This is needed to control different aspects of the operation, including parallelism of an operation or increase the timeout for the operation and so on.

+

The following aspects of an operation can be configured:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TimeouttimeoutThe duration after which Drove considers the operation to have timed out.
ParallelismparallelismParallelism of the task. (Range: 1-32)
Failure StrategyfailureStrategySet this to STOP.
+
+

Note

+

For internal recovery operations, Drove generates it's own operations. For that, Drove applies the following cluster operation spec:

+
    +
  • timeout - 300 seconds
  • +
  • parallelism - 1
  • +
  • failureStrategy - STOP
  • +
+
+

The default operation spec can be configured in the controller configuration file. It is recommended to set this to a something like 8 for faster recovery.

+
+
+

How to initiate an operation

+
+

Tip

+

Use the Drove CLI to perform all manual operations.

+
+

All operations for task lifecycle management need to be issued via a POST HTTP call to the leader controller endpoint on the path /apis/v1/tasks/operations. API will return HTTP OK/200 and relevant json response as payload.

+

Sample api call:

+
curl --location 'http://drove.local:7000/apis/v1/tasks/operations' \
+--header 'Content-Type: application/json' \
+--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
+--data '{
+    "type": "KILL",
+    "sourceAppName" : "TEST_APP",
+    "taskId" : "T0012",
+    "opSpec": {
+        "timeout": "5m",
+        "parallelism": 1,
+        "failureStrategy": "STOP"
+    }
+}'
+
+
+

Note

+

In the above examples, http://drove.local:7000 is the endpoint of the leader. TEST_APP is the name of the application that started this task and taskId is a unique client generated id. Authorization is basic auth.

+
+
+

Warning

+

Task operations are not cancellable.

+
+

Create a task

+

A task can be created issuing the following command.

+

Preconditions: +- Task with same {sourceAppName,taskId} should not exist on the cluster.

+

State Transition:

+
    +
  • none → PENDINGPROVISIONINGSTARTINGRUNNINGRUN_COMPLETEDDEPROVISIONINGSTOPPED
  • +
+

To create a task a Task Spec needs to be created first.

+

Once ready, CLI command needs to be issued or the following payload needs to be sent:

+
+
+
+
drove -c local tasks create sample/test_task.json
+
+
+
+

Sample Request Payload +

{
+    "type": "CREATE",
+    "spec": {...}, //(1)!
+    "opSpec": { //(2)!
+        "timeout": "5m",
+        "parallelism": 1,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Spec as mentioned in Task Specification
  2. +
  3. Operation spec as mentioned in Cluster Op Spec
  4. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "taskId": "TEST_APP-T0012"
+    },
+    "message": "success"
+}
+

+
+
+
+
+

Warning

+

There are no separate create/run steps in a task. Creation will start execution automatically and immediately.

+
+

Kill a task

+

A task can be created issuing the following command.

+

Preconditions: +- Task with same {sourceAppName,taskId} needs to exist on the cluster.

+

State Transition:

+
    +
  • RUNNINGRUN_COMPLETEDDEPROVISIONINGSTOPPED
  • +
+

CLI command needs to be issued or the following payload needs to be sent:

+
+
+
+
drove -c local tasks kill TEST_APP T0012
+
+
+
+

Sample Request Payload +

{
+    "type": "KILL",
+    "sourceAppName" : "TEST_APP",//(1)!
+    "taskId" : "T0012",//(2)!
+    "opSpec": {//(3)!
+        "timeout": "5m",
+        "parallelism": 1,
+        "failureStrategy": "STOP"
+    }
+}
+

+
    +
  1. Source app name as mentioned in spec during task creation
  2. +
  3. Task ID as mentioned in the spec
  4. +
  5. Operation spec as mentioned in Cluster Op Spec
  6. +
+

Sample response +

{
+    "status": "SUCCESS",
+    "data": {
+        "taskId": "T0012"
+    },
+    "message": "success"
+}
+

+
+
+
+
+

Note

+

Task metadata will remain on the cluster for some time. Metadata cleanup for tasks is automatic and can be configured in the controller configuration.

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file diff --git a/tasks/specification.html b/tasks/specification.html new file mode 100644 index 0000000..4207731 --- /dev/null +++ b/tasks/specification.html @@ -0,0 +1,2747 @@ + + + + + + + + + + + + + + + + + + + + + + + + + Task Specification - Drove Container Orchestrator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + +

Task Specification

+

A task is defined using JSON. We use a sample configuration below to explain the options.

+

Sample Task Definition

+
{
+    "sourceAppName": "TEST_APP",//(1)!
+    "taskId": "T0012",//(2)!
+    "executable": {//(3)!
+        "type": "DOCKER", // (4)!
+        "url": "ghcr.io/appform-io/test-task",//(5)!
+        "dockerPullTimeout": "100 seconds"//(6)!
+    },
+     "resources": [//(7)!
+        {
+            "type": "CPU",
+            "count": 1//(8)!
+        },
+        {
+            "type": "MEMORY",
+            "sizeInMB": 128//(9)!
+        }
+    ],
+    "volumes": [//(10)!
+        {
+            "pathInContainer": "/data",//(11)!
+            "pathOnHost": "/mnt/datavol",//(12)!
+            "mode" : "READ_WRITE"//(13)!
+        }
+    ],
+    "configs" : [//(14)!
+        {
+            "type" : "INLINE",//(15)!
+            "localFilename": "/testfiles/drove.txt",//(16)!
+            "data" : "RHJvdmUgdGVzdA=="//(17)!
+        }
+    ],
+    "placementPolicy": {//(18)!
+        "type": "ANY"//(19)!
+    },
+    "env": {//(20)!
+        "CORES": "8"
+    },
+    "args" : [] //(27)!
+    "tags": { //(21)!
+        "superSpecialApp": "yes_i_am",
+        "say_my_name": "heisenberg"
+    },
+    "logging": {//(22)!
+        "type": "LOCAL",//(23)!
+        "maxSize": "100m",//(24)!
+        "maxFiles": 3,//(25)!
+        "compress": true//(26)!
+    }
+}
+
+
    +
  1. Name of the application that has started the task. Make sure this is a valid application on the cluster.
  2. +
  3. An unique ID for this task. Uniqueness is up to the user, Drove will scope it in the sourceAppName namespace.
  4. +
  5. Coordinates for the executable. Refer to Executable Specification for details.
  6. +
  7. Right now the only type supported is DOCKER.
  8. +
  9. Docker container address
  10. +
  11. Timeout for container pull.
  12. +
  13. Volumes to be mounted. Refer to Volume Specification for details.
  14. +
  15. Path that will be visible inside the container for this mount.
  16. +
  17. Actual path on the host machine for the mount.
  18. +
  19. Mount mode can be READ_WRITE and READ_ONLY
  20. +
  21. Configuration to be injected as file inside the container. Please refer Config Specification for details.
  22. +
  23. Type of config. Can be INLINE, EXECUTOR_LOCAL_FILE, ONTROLLER_HTTP_FETCHandEXECUTOR_HTTP_FETCH`. Specifies how drove will t the contents to be injected..
  24. +
  25. File name for the config inside the container.
  26. +
  27. Serialized form of the data, this and other parameters will vary cording to the type specified above.
  28. +
  29. List of resources required to run this application. Check Resource Requirements Specification for more tails.
  30. +
  31. Number of CPU cores to be allocated.
  32. +
  33. Amount of memory to be allocated expressed in Megabytes
  34. +
  35. Specifies how the container will be placed on the cluster. Check Placement Policy for details.
  36. +
  37. Type of placement can be ANY, ONE_PER_HOST, MATCH_TAG, NO_TAG, RULE_BASED, ANY and COMPOSITE. Rest of the parameters in this section will depend on the type.
  38. +
  39. Custom environment variables. Additional variables are injected by Drove as well. See Environment Variables section for tails.
  40. +
  41. Key value metadata that can be used in external systems.
  42. +
  43. Specify how docker log files are configured. Refer to Logging Specification
  44. +
  45. Log to local file
  46. +
  47. Maximum File Size
  48. +
  49. Number of latest log files to retain
  50. +
  51. Log files will be compressed
  52. +
  53. List of command line arguments. See Command Line Arguments for details.
  54. +
+
+

Warning

+

Please make sure sourceAppName is set to a correct application name as specified in the name parameter of a running application on the cluster.

+

If this is not done, stale task metadata will not be cleaned up and your metadata store performance will get affected over time.

+
+

Executable Specification

+

Right now Drove supports only docker containers. However as engines, both docker and podman are supported. Drove executors will fetch the executable directly from the registry based on the configuration provided.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet type to DOCKER.
URLurlDocker container URL`.
TimeoutdockerPullTimeoutTimeout for docker image pull.
+
+

Note

+

Drove supports docker registry authentication. This can be configured in the executor configuration file.

+
+

Resource Requirements Specification

+

This section specifies the hardware resources required to run the container. Right now only CPU and MEMORY are supported as resource types that can be reserved for a container.

+

CPU Requirements

+

Specifies number of cores to be assigned to the container.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet type to CPU for this.
CountcountNumber of cores to be assigned.
+

Memory Requirements

+

Specifies amount of memory to be allocated to a container.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet type to MEMORY for this.
CountsizeInMBAmount of memory (in Mega Bytes) to be allocated.
+

Sample +

[
+    {
+        "type": "CPU",
+        "count": 1
+    },
+    {
+        "type": "MEMORY",
+        "sizeInMB": 128
+    }
+]
+

+
+

Note

+

Both CPU and MEMORY configurations are mandatory.

+
+

Volume Specification

+

Files and directories can be mounted from the executor host into the container. The volumes section contains a list of volumes that need to be mounted.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Path In ContainerpathInContainerPath that will be visible inside the container for this mount.
Path On HostpathOnHostActual path on the host machine for the mount.
Mount ModemodeMount mode can be READ_WRITE and READ_ONLY to allow the containerized process to write or read to the volume.
+
+

Info

+

We do not support mounting remote volumes as of now.

+
+

Config Specification

+

Drove supports injection of configuration files into containers. The specifications for the same are discussed below.

+

Inline config

+

Inline configuration can be added in the Application Specification itself. This will manifest as a file inside the container.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to INLINE
Local FilenamelocalFilenameFile name for the config inside the container.
DatadataBase64 encoded string for the data. The value for this will be masked on UI.
+

Config file: +

port: 8080
+logLevel: DEBUG
+
+Corresponding config specification: +
{
+    "type" : "INLINE",
+    "localFilename" : "/config/service.yml",
+    "data" : "cG9ydDogODA4MApsb2dMZXZlbDogREVCVUcK"
+}
+

+
+

Warning

+

The full base 64 encoded config data will get stored in Drove ZK and will be pushed to executors inline. It is not recommended to stream large config files to containers using this method. This will probably need additional configuration on your ZK cluster.

+
+

Locally loaded config

+

Config file from a path on the executor directly. Such files can be distributed to the executor host using existing configuration management systems such as OpenTofu, Salt etc.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to EXECUTOR_LOCAL_FILE
Local FilenamelocalFilenameFile name for the config inside the container.
File pathfilePathOnHostPath to the config file on executor host.
+

Sample config specification: +

{
+    "type" : "EXECUTOR_LOCAL_FILE",
+    "localFilename" : "/config/service.yml",
+    "data" : "/mnt/configs/myservice/config.yml"
+}
+

+

Controller fetched Config

+

Config file can be fetched from a remote server by the controller. Once fetched, these will be streamed to the executor as part of the instance specification for starting a container.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to CONTROLLER_HTTP_FETCH
Local FilenamelocalFilenameFile name for the config inside the container.
HTTP Call DetailshttpHTTP Call related details. Please refer to HTTP Call Specification for details.
+

Sample config specification: +

{
+    "type" : "CONTROLLER_HTTP_FETCH",
+    "localFilename" : "/config/service.yml",
+    "http" : {
+        "protocol" : "HTTP",
+        "hostname" : "configserver.internal.yourdomain.net",
+        "port" : 8080,
+        "path" : "/configs/myapp",
+        "username" : "appuser",
+        "password" : "secretpassword"
+    }
+}
+

+
+

Note

+

The controller will make an API call for every single time it asks an executor to spin up a container. Please make sure to account for this in your configuration management system.

+
+

Executor fetched Config

+

Config file can be fetched from a remote server by the executor before spinning up a container. Once fetched, the payload will be injected as a config file into the container.

+

The following details are needed for this:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to EXECUTOR_HTTP_FETCH
Local FilenamelocalFilenameFile name for the config inside the container.
HTTP Call DetailshttpHTTP Call related details. Please refer to HTTP Call Specification for details.
+

Sample config specification: +

{
+    "type" : "EXECUTOR_HTTP_FETCH",
+    "localFilename" : "/config/service.yml",
+    "http" : {
+        "protocol" : "HTTP",
+        "hostname" : "configserver.internal.yourdomain.net",
+        "port" : 8080,
+        "path" : "/configs/myapp",
+        "username" : "appuser",
+        "password" : "secretpassword"
+    }
+}
+

+
+

Note

+

All executors will make an API call for every single time they spin up a container for this application. Please make sure to account for this in your configuration management system.

+
+

HTTP Call Specification

+

This section details the options that can set when making http calls to a configuration management system from controllers or executors.

+

The following options are available for HTTP call:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
ProtocolprotocolProtocol to use for upstream call. Can be HTTP or HTTPS.
HostnamehostnameHost to call.
PortportProvide custom port. Defaults to 80 for http and 443 for https.
API PathpathPath component of the URL. Include query parameters here. Defaults to /
HTTP MethodverbType of call, use GET, POST or PUT. Defaults to GET.
Success CodesuccessCodesList of HTTP status codes which is considered as success. Defaults to [200]
PayloadpayloadData to be used for POST and PUT calls
Connection TimeoutconnectionTimeoutTimeout for upstream connection.
Operation timeoutoperationTimeoutTimeout for actual operation.
UsernameusernameUsername to be used basic auth. This field is masked out on the UI.
PasswordpasswordPassword to be used for basic auth. This field is masked on the UI.
Authorization HeaderauthHeaderData to be passed in HTTP Authorization header. This field is masked on the UI.
Additional HeadersheadersAny other headers to be passed to the upstream in the HTTP calls. This is a map of
Skip SSL ChecksinsecureSkip hostname and certification checks during SSL handshake with the upstream.
+

Placement Policy Specification

+

Placement policy governs how Drove deploys containers on the cluster. The following sections discuss the different placement policies available and how they can be configured to achieve optimal placement of containers.

+
+

Warning

+

All policies will work only at a {appName, version} combination level. They will not ensure constraints at an appName level. This means that for somethinge like a one per node placement, for the same appName, multiple containers can run on the same host if multiple deployments with different versions are active in a cluster. Same applies for all policies like N per host and so on.

+
+
+

Important details about executor tagging

+
    +
  • All hosts have at-least one tag, it's own hostname.
  • +
  • The TAG policy will consider them as valid tags. This can be used to place containers on specific hosts if needed.
  • +
  • This is handled specially in all other policy types and they will consider executors having only the hostname tag as untagged.
  • +
  • A host with a tag (other than host) will not have any containers running if not placed on them specifically using the MATCH_TAG policy
  • +
+
+

Any Placement

+

Containers for a {appName, version} combination can run on any un-tagged executor host.

+ + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut ANY as policy.
+

Sample: +

{
+    "type" : "ANY"
+}
+

+
+

Tip

+

For most use-cases this is the placement policy to use.

+
+

One Per Host Placement

+

Ensures that only one container for a particular {appName, version} combination is running on an executor host at a time.

+ + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut ONE_PER_HOST as policy.
+

Sample: +

{
+    "type" : "ONE_PER_HOST"
+}
+

+

Max N Per Host Placement

+

Ensures that at most N containers for a {appName, version} combination is running on an executor host at a time.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut MAX_N_PER_HOST as policy.
Max countmaxThe maximum num of containers that can run on an executor. Range: 1-64
+

Sample: +

{
+    "type" : "MAX_N_PER_HOST",
+    "max": 3
+}
+

+

Match Tag Placement

+

Ensures that containers for a {appName, version} combination are running on an executor host that has the tags as mentioned in the policy.

+ + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut MATCH_TAG as policy.
Max counttagThe tag to match.
+

Sample: +

{
+    "type" : "MATCH_TAG",
+    "tag": "gpu_enabled"
+}
+

+

No Tag Placement

+

Ensures that containers for a {appName, version} combination are running on an executor host that has no tags.

+ + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut NO_TAG as policy.
+

Sample: +

{
+    "type" : "NO_TAG"
+}
+

+
+

Info

+

The NO_TAG policy is mostly for internal use, and does not need to be specified when deploying containers that do not need any special placement logic.

+
+

Composite Policy Based Placement

+

Composite policy can be used to combine policies together to create complicated placement requirements.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
Policy TypetypePut COMPOSITE as policy.
PolicespoliciesList of policies to combine
CombinercombinerCan be AND and OR and signify all-match and any-match logic on the policies mentioned.
+

Sample: +

{
+    "type" : "COMPOSITE",
+    "policies": [
+        {
+            "type": "ONE_PER_HOST"
+        },
+        {
+            "type": "MATH_TAG",
+            "tag": "gpu_enabled"
+        }
+    ],
+    "combiner" : "AND"
+}
+
+The above policy will ensure that only one container of the relevant {appName,version} will run on GPU enabled machines.

+
+

Tip

+

It is easy to go into situations where no executors match complicated placement policies. Internally, we tend to keep things rather simple and use the ANY placement for most cases and maybe tags in a few places with over-provisioning or for hosts having special hardware 🙂

+
+

Environment variables

+

This config can be used to inject custom environment variables to containers. The values are defined as part of deployment specification, are same across the cluster and immutable to modifications from inside the container (ie any overrides from inside the container will not be visible across the cluster).

+

Sample: +

{
+    "MY_VARIABLE_1": "fizz",
+    "MY_VARIABLE_2": "buzz"
+}
+

+

The following environment variables are injected by Drove to all containers:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Variable NameValue
HOSTHostname where the container is running. This is for marathon compatibility.
PORT_PORT_NUMBERA variable for every port specified in exposedPorts section. The value is the actual port on the host, the specified port is mapped to. For example if ports 8080 and 8081 are specified, two variables called PORT_8080 and PORT_8081 will be injected.
DROVE_EXECUTOR_HOSTHostname where container is running.
DROVE_CONTAINER_IDContainer that is deployed
DROVE_APP_NAMEApp name as specified in the Application Specification
DROVE_INSTANCE_IDActual instance ID generated by Drove
DROVE_APP_IDApplication ID as generated by Drove
DROVE_APP_INSTANCE_AUTH_TOKENA JWT string generated by Drove that can be used by this container to call /apis/v1/internal/... apis.
+
+

Warning

+

Do not pass secrets using environment variables. These variables are all visible on the UI as is. Please use Configs to inject secrets files and so on.

+
+

Command line arguments

+

A list of command line arguments that are sent to the container engine to execute inside the container. This is provides ways for you to configure your container behaviour based off such arguments. Please refer to docker documentation for details.

+
+

Danger

+

This might have security implications from a system point of view. As such Drove provides administrators a way to disable passing arguments at the cluster level by setting disableCmdlArgs to true in the controller configuration.

+
+

Logging Specification

+

Can be used to configure how container logs are managed on the system.

+
+

Note

+

This section affects the docker log driver. Drove will continue to stream logs to it's own logger which can be configured at executor level through the executor configuration file.

+
+

Local Logger configuration

+

This is used to configure the json-file log driver.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to LOCAL
Max SizemaxSizeMaximum file size. Anything bigger than this will lead to rotation.
Max FilesmaxFilesMaximum number of logs files to keep. Range: 1-100
CompresscompressEnable log file compression.
+
+

Tip

+

If logging section is omitted, the following configuration is applied by default: +- File size: 10m +- Number of files: 3 +- Compression: on

+
+

Rsyslog configuration

+

In case suers want to stream logs to an rsyslog server, the logging configuration needs to be set to RSYSLOG mode.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameOptionDescription
TypetypeSet the value to RSYSLOG
ServerserverURL for the rsyslog server.
Tag PrefixtagPrefixPrefix to add at the start of a tag
Tag SuffixtagSuffixSuffix to add at the en of a tag.
+
+

Note

+

The default tag is the DROVE_INSTANCE_ID. The tagPrefix and tagSuffix will to before and after this

+
+ + + + + + + + + + + + + +
+
+ + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + \ No newline at end of file