Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize inference system AKS cluster #204

Open
micya opened this issue Sep 29, 2024 · 1 comment
Open

Optimize inference system AKS cluster #204

micya opened this issue Sep 29, 2024 · 1 comment
Labels
Azure Issues relating to Azure infrastructure or deployment to Azure inference system Code to perform inference with the trained model(s)

Comments

@micya
Copy link
Member

micya commented Sep 29, 2024

Inference system currently runs on AKS cluster with 3 Standard B4ms (4 vcpus, 16 GiB memory) VMs. Optimize the usage:

  1. Adjust the pod resource requests if not used
  2. Adjust the VM SKU to match the resource usage more appropriately
  3. Can we use spot instances? Spot VMs are cheaper, though at least one non-spot VM must remain running for cluster health. Relevant doc: https://learn.microsoft.com/en-us/azure/aks/spot-node-pool
@micya micya added Azure Issues relating to Azure infrastructure or deployment to Azure inference system Code to perform inference with the trained model(s) labels Sep 29, 2024
@micya
Copy link
Member Author

micya commented Sep 30, 2024

Node resource usage:

$ kubectl top node
NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-agentpool-41025176-vmss00000q   2368m        61%    5595Mi          41%       
aks-agentpool-41025176-vmss00000y   1983m        51%    3987Mi          29%       
aks-agentpool-41025176-vmss000011   2707m        70%    6567Mi          48%       

Inference system pod resource usage:

$ kubectl top pod -A | grep inference
NAMESPACE        NAME                                  CPU(cores)   MEMORY(bytes)   
bush-point       inference-system-d8d67c775-2lvsp      817m         1873Mi          
mast-center      inference-system-6cc784cb6c-k4flr     835m         1705Mi          
north-sjc        inference-system-5485f96798-r6mpk     998m         2606Mi          
orcasound-lab    inference-system-76c476fdc7-9pzh9     823m         2167Mi          
point-robinson   inference-system-7f78758bdd-xf7cw     24m          1470Mi          
port-townsend    inference-system-6f84d95d79-qtk64     980m         1393Mi          
sunset-bay       inference-system-66bb79b8c7-pjzcv     980m         1439Mi  

Based on above, estimated resource usage is 7 cpu + 14GB ram + cluster services. Cluster services looks like 0.8 cpu + 3GB RAM per node. I don't remember if this is the same behavior if we use an user node pool.

Recommend:

  1. Change pod request to 1 cpu + 1.5G ram, limit 1 cpu + 2.5G ram.

Potential follow ups:

  1. If continue to use system node pool, change VM SKU to compute-optimized F-tier. But F-tier VMs seem more expensive than B-tier for same number of CPUs.
  2. If we select a low SKU for system pool & different SKU for user pool but do not need to run cluster services per node, what would be the expected pricing?
  3. Can we guarantee spot availability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure Issues relating to Azure infrastructure or deployment to Azure inference system Code to perform inference with the trained model(s)
Projects
None yet
Development

No branches or pull requests

1 participant