You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a policy which allows to list a summary of allocations or a topology-tailored resource so that the topology view can be filled with relevant data without exposing all job properties. For instance, the topology view needs only a subset of the allocation information (ram, cpu, node, maybe job name and namespace – in the best case this could even be "redacted") and the basic information about nodes to be meaningful.
Use-cases
Our users want to know the current resource allocation across nodes (best they'd even see used GPUs, but that's a different issue).
However, they only get access to a limited number of namespaces which results in an incomplete view. For some users, the cluster looks almost empty when they navigate to http://localhost:4646/ui/topology, making the view useless at best but potentially even misleading.
The main reason is that they want to have an overview of all the resources to get a better idea why their allocations could not be scheduled and how much is currently in use to make better decisions (e.g., attempt to schedule on a different node type or try with less RAM, etc.).
Attempted Solutions
To view http://localhost:4646/ui/topology, one needs to have the following minimal permissions (topology-read.policy.hcl in the example below):
You can change the topology-read.policy.hcl and re-apply it to try out other combinations, but in general I found out that these are the required policies to read the topology.
However, granting read-job capabilities across all namespaces is exactly what we do not want to do, as some projects are confidential and should not leak job specs (which might contain secrets or other confidential data – despite best efforts of educating about and promoting use of Vault secrets, template stanzas, etc.).
Alternative Solution
We are currently also considering setting up a service which gets the topology-read policy and essentially filters the json outputs such that we can provide our own visualization to our users.
Since we already have an additional service to simplify scheduling of common jobs and fetching nomad job logs via OpenSearch, this could be an option, although it would require some additional service to be set up and maintained (as that service works with the nomad token for API access and does not have any "service" permissions in the background).
The text was updated successfully, but these errors were encountered:
Hi @shoeffner and thanks for raising this issue and including the great detail. To start I wanted to detail the API endpoints that the topology page uses, some of which are used for filtering options:
/v1/allocations in order to list all allocations within the target region
/v1/nodes in order to list all nodes within the target region.
/v1/node/pools in order to list all node pools in the target region and provide filtering and info panels
/v1/namespaces in order to list all namespaces in the target region and provide filtering and info panels
/v1/job/{jobID} this is used when highlighting an allocation on a node, to see more detail about it
The namespace capability read-job importantly provides access to the allocation list API endpoint which is how we calculate the resource consumption (allocated resources) on each node.
Add a policy which allows to list a summary of allocations or a topology-tailored resource so that the topology view can be filled with relevant data without exposing all job properties.
Adding such a policy or capability would likely also require new API and RPC endpoints. My initial feeling here is that this is not something we would want to allow and would open the door for similar future requests to expose certain job specification fields to custom capabilities outside read-job.
I think your alternative solution makes sense here. Another option would be to utilize the telemetry that Nomad emits around client resource allocation within a tool such as Prometheus/Grafana. This could be as simple as showing nomad.client.unallocated.cpu and nomad.client.unallocated.memory for each client. The JQ below shows how this can be pulled out manually to check:
Thanks for your thoughtful reply, @jrasell . I had not thought about the metrics, that makes sense and would not require any permissions to read jobs, so that's a good idea and we'll have a look at that, thanks! It's likely easier to implement a consumer for those than to have a service layer in the middle.
I understand and agree that this feature could potentially open up more requests for fine-grained APIs and permissions, so we will go with the metrics endpoint.
Thanks for taking the time to review this and your response!
Proposal
Add a policy which allows to list a summary of allocations or a topology-tailored resource so that the topology view can be filled with relevant data without exposing all job properties. For instance, the topology view needs only a subset of the allocation information (ram, cpu, node, maybe job name and namespace – in the best case this could even be "redacted") and the basic information about nodes to be meaningful.
Use-cases
Our users want to know the current resource allocation across nodes (best they'd even see used GPUs, but that's a different issue).
However, they only get access to a limited number of namespaces which results in an incomplete view. For some users, the cluster looks almost empty when they navigate to http://localhost:4646/ui/topology, making the view useless at best but potentially even misleading.
The main reason is that they want to have an overview of all the resources to get a better idea why their allocations could not be scheduled and how much is currently in use to make better decisions (e.g., attempt to schedule on a different node type or try with less RAM, etc.).
Attempted Solutions
To view http://localhost:4646/ui/topology, one needs to have the following minimal permissions (
topology-read.policy.hcl
in the example below):To test this, run an ACL-enabled server:
Bootstrap ACLs
You can change the
topology-read.policy.hcl
and re-apply it to try out other combinations, but in general I found out that these are the required policies to read the topology.However, granting
read-job
capabilities across all namespaces is exactly what we do not want to do, as some projects are confidential and should not leak job specs (which might contain secrets or other confidential data – despite best efforts of educating about and promoting use of Vault secrets, template stanzas, etc.).Alternative Solution
We are currently also considering setting up a service which gets the topology-read policy and essentially filters the json outputs such that we can provide our own visualization to our users.
Since we already have an additional service to simplify scheduling of common jobs and fetching nomad job logs via OpenSearch, this could be an option, although it would require some additional service to be set up and maintained (as that service works with the nomad token for API access and does not have any "service" permissions in the background).
The text was updated successfully, but these errors were encountered: