Skip to content

Latest commit

 

History

History
106 lines (85 loc) · 10.6 KB

computing.md

File metadata and controls

106 lines (85 loc) · 10.6 KB

GPU Comparisons

Contents

  1. My lab's computing setup; link
  2. Cloud GPU and whole machine price comparison and notes (云GPU与整机价格对比); link
  3. Free compute available from companies(教授可申请的免费计算资源); link
  4. Useful learning material on GPUs and setting up your clusters(如何搭建计算集群); link

Computing Setup of My Lab

In my lab, the Precognition Lab, using the start-up funds provided by the university, I have built 9 stand-alone machines equipped with a total of 32 RTX 3090/4090 GPUs (including 4-GPU and some 2-GPU machines). Additionally, I have established a cluster with 3 compute nodes, comprising a total of 24 RTX A6000 GPUs, and a 100TB NAS.

The rationale behind this hybrid setup is twofold: the stand-alone machines cost only 40% of what the cluster does, and they can be acquired quickly without necessitating additional machine room space.

As for the cluster, I've found that a 100 GB Ethernet suffices for the computing network, eliminating the need to invest in an Infiniband switch, which can cost two to three times more. With 3 nodes on this network, I can essentially achieve linear scaling with multi-node training (6 hours for 1-node training and 2 hours for 3-node training, etc.).

Price Comparison

Vendors in mainland, China (Updated 07/2022):

Machine Duration Price (RMB) Note
阿里 8xV100 (16GB) 一年 80万 只有CentOS
一个月 7.1万
一小时 248.42
华为云 8xV100 (32GB) 一年 63万
一个月 6.3万
一小时 131.5
腾讯云 8xV100 (32GB) 一年 45.8万(8.3折) link
一个月 4.6万
一小时 (TIONE) 147
8xA100 (40GB) 一年 113.5万(8.3折)
一个月 11.4万
百度云 8xA100 (40GB) 一年 99.7万(8.3折) link
一个月 10万
8xV100 (32GB) 一年 59.3万
一个月 5.9万
一小时 124.14
矩池云 8xV100 (16GB) 一小时 48
智星云 8x3090 (24GB) 一个月 2.1万
一小时 36
8xA100 (40GB) 一个月 4.5万
一小时 76
8xV100 (32GB) 一个月 2.8万
一小时 48
极链AI云
恒源云
AutoDL link , Most Popular
OpenBayes link

整机购买 (08/2022咨询)

机器
dbcloud深脑云 (淘宝) 8x3090 20万左右起
程明明教授的经验 8xV100 link

Junwei: 近期(09/2022)GPU价格大跌,明显是整机购买比较划算,而3090的算力相当于V100,是性价比最高的卡,所以我认为多个8x3090整机+网络硬盘NAS+kubeflow是最划算、scalable的设置,可以参考一下后面如何自建计算集群。

Vendors in NA (Updated 07/2022):

Machine Duration Price
Google Cloud asia-Taiwan 8xV100 (32GB) 1 month $12,837.30
1 hour $17
Google Cloud asia-Tokyo 8xA100 (40GB) 1 month $18,216.98
vast.ai NA 8xV100 (16GB) 1 hour $2.80
8xA100 (40GB) 1 hour $8.80
8xA6000 (48GB) 1 hour $4.40
10x1080Ti (11GB) 1 hour $2
8xA5000 (24GB) 1 hour $2.40
4x3090 (24GB) 1 hour $1.20
lambda NA 8xV100 (16GB) 1 hour $4.40
8xV100 (16GB) 1 hour (>3 months) $3.20
8xA100 (40GB) 1 hour (>3 months) $8.00
1xA100 (40GB) 1 hour $1.10 link

Free Stuff

note link
幻方AI 万卡算力,免费申请,酣畅科研的夏天 link
NVIDIA 有一张免费卡的资助项目
AWS 在CMU上课的时候,每门课教授都可以给每个学生申请100刀左右的cloud credit
Google Cloud 类似AWS

Learning Stuff

note link
GPU guide from Lambda link
understanding GPU and DL link
腾讯TEG星辰和机智团队 link
MPIJob link
机器学习平台 link
集群硬盘,ceph cluster link
A discussion on machine price on Twitter (for NA) link
A discussion on 1xA100 vs 6x3090 知乎 link
程明明教授的GPU集群经验 link
Good cluster building guide from Lambda link
How to decide on cloud GPUs vs. on-perm vs. hybrid link