Skip to content

Latest commit

 

History

History
128 lines (95 loc) · 4.41 KB

docker.md

File metadata and controls

128 lines (95 loc) · 4.41 KB

Docker

Containerd

AI accelerators

ML software stack

ml software stack by Shashank Prasanna

  • Docker Container
    • Typical software stack
      • My Code
      • Tensorflow, PyTorch, Frameworks + Library Dependencies
      • Python
      • CPU ML libraries
    • Hardware Accelator
      • AI accelerator ML libraries
      • AI accelerator drivers
  • OS
    • AI accelerator drivers: with matching versions
    • OS Kernel
    • Host OS
  • Heterogeneous Hardware
    • CPU
    • AI Accelerator

Challenges

  • Duplicating drivers = bloated VMs and containers
  • Hardware driver versions must match
  • Not portable (whole point of containers). difficult to scale
  • Very brittle solution

Container Runtimes

runc/libcontainer/process_linux.go

func (p *initProcess) start() (retErr error) {
	ierr := parseSync(p.comm.syncSockParent, func(sync *syncT) error {
		switch sync.Type {
		case procHooks:
			if p.config.Config.HasHook(configs.Prestart, configs.CreateRuntime) {
				if err := hooks.Run(configs.Prestart, s); err != nil {
					return err
				}

Nvidia

libnvidia-container

Configs

  • /etc/docker/daemon.json
  • /etc/nvidia-container-runtime/config.toml

/etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

in a tritonserver image

docker run --rm -it --gpus all nvcr.io/nvidia/tritonserver:25.01-py3 bash
ls -Fl /dev | grep nvidia

crw-rw-rw- 1 root root 511,   0 Mar  3 03:09 nvidia-uvm
crw-rw-rw- 1 root root 511,   1 Mar  3 03:09 nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Mar  3 03:08 nvidia0
crw-rw-rw- 1 root root 195, 255 Mar  3 03:08 nvidiactl
nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        Off |   00000000:2B:00.0  On |                  N/A |
|  0%   50C    P3             49W /  270W |    1256MiB /   8192MiB |     21%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Neuron