- My lab's computing setup; link
- Cloud GPU and whole machine price comparison and notes (云GPU与整机价格对比); link
- Free compute available from companies(教授可申请的免费计算资源); link
- Useful learning material on GPUs and setting up your clusters(如何搭建计算集群); link
In my lab, the Precognition Lab, using the start-up funds provided by the university, I have built 9 stand-alone machines equipped with a total of 32 RTX 3090/4090 GPUs (including 4-GPU and some 2-GPU machines). Additionally, I have established a cluster with 3 compute nodes, comprising a total of 24 RTX A6000 GPUs, and a 100TB NAS.
The rationale behind this hybrid setup is twofold: the stand-alone machines cost only 40% of what the cluster does, and they can be acquired quickly without necessitating additional machine room space.
As for the cluster, I've found that a 100 GB Ethernet suffices for the computing network, eliminating the need to invest in an Infiniband switch, which can cost two to three times more. With 3 nodes on this network, I can essentially achieve linear scaling with multi-node training (6 hours for 1-node training and 2 hours for 3-node training, etc.).
Vendors in mainland, China (Updated 07/2022):
Machine | Duration | Price (RMB) | Note | |
---|---|---|---|---|
阿里 | 8xV100 (16GB) | 一年 | 80万 | 只有CentOS |
一个月 | 7.1万 | |||
一小时 | 248.42 | |||
华为云 | 8xV100 (32GB) | 一年 | 63万 | |
一个月 | 6.3万 | |||
一小时 | 131.5 | |||
腾讯云 | 8xV100 (32GB) | 一年 | 45.8万(8.3折) | link |
一个月 | 4.6万 | |||
一小时 (TIONE) | 147 | |||
8xA100 (40GB) | 一年 | 113.5万(8.3折) | ||
一个月 | 11.4万 | |||
百度云 | 8xA100 (40GB) | 一年 | 99.7万(8.3折) | link |
一个月 | 10万 | |||
8xV100 (32GB) | 一年 | 59.3万 | ||
一个月 | 5.9万 | |||
一小时 | 124.14 | |||
矩池云 | 8xV100 (16GB) | 一小时 | 48 | |
智星云 | 8x3090 (24GB) | 一个月 | 2.1万 | |
一小时 | 36 | |||
8xA100 (40GB) | 一个月 | 4.5万 | ||
一小时 | 76 | |||
8xV100 (32GB) | 一个月 | 2.8万 | ||
一小时 | 48 | |||
极链AI云 | ||||
恒源云 | ||||
AutoDL | link , Most Popular | |||
OpenBayes | link |
整机购买 (08/2022咨询)
机器 | ||
---|---|---|
dbcloud深脑云 (淘宝) | 8x3090 | 20万左右起 |
程明明教授的经验 | 8xV100 | link |
Junwei: 近期(09/2022)GPU价格大跌,明显是整机购买比较划算,而3090的算力相当于V100,是性价比最高的卡,所以我认为多个8x3090整机+网络硬盘NAS+kubeflow是最划算、scalable的设置,可以参考一下后面如何自建计算集群。
Vendors in NA (Updated 07/2022):
Machine | Duration | Price | |
---|---|---|---|
Google Cloud asia-Taiwan | 8xV100 (32GB) | 1 month | $12,837.30 |
1 hour | $17 | ||
Google Cloud asia-Tokyo | 8xA100 (40GB) | 1 month | $18,216.98 |
vast.ai NA | 8xV100 (16GB) | 1 hour | $2.80 |
8xA100 (40GB) | 1 hour | $8.80 | |
8xA6000 (48GB) | 1 hour | $4.40 | |
10x1080Ti (11GB) | 1 hour | $2 | |
8xA5000 (24GB) | 1 hour | $2.40 | |
4x3090 (24GB) | 1 hour | $1.20 | |
lambda NA | 8xV100 (16GB) | 1 hour | $4.40 |
8xV100 (16GB) | 1 hour (>3 months) | $3.20 | |
8xA100 (40GB) | 1 hour (>3 months) | $8.00 | |
1xA100 (40GB) | 1 hour | $1.10 link |
note | link | |
---|---|---|
幻方AI | 万卡算力,免费申请,酣畅科研的夏天 | link |
NVIDIA | 有一张免费卡的资助项目 | |
AWS | 在CMU上课的时候,每门课教授都可以给每个学生申请100刀左右的cloud credit | |
Google Cloud | 类似AWS |
note | link |
---|---|
GPU guide from Lambda | link |
understanding GPU and DL | link |
腾讯TEG星辰和机智团队 | link |
MPIJob | link |
机器学习平台 | link |
集群硬盘,ceph cluster | link |
A discussion on machine price on Twitter (for NA) | link |
A discussion on 1xA100 vs 6x3090 知乎 | link |
程明明教授的GPU集群经验 | link |
Good cluster building guide from Lambda | link |
How to decide on cloud GPUs vs. on-perm vs. hybrid | link |