Train models on OpenPAI

Train models on OpenPAI

This document is for beginners, who use OpenPAI to train machine learning models or execute other commands.

It assumes that you know IP address or domain name and have an account of OpenPAI. If there isn't an OpenPAI cluster yet, refer to here to deploy one.

Submit a hello-world job

The job of OpenPAI defines how to execute command(s) in specified environment(s). A job can be model training, other kinds of commands, or distributed on multiple servers.

Following this section to submit a very simple job like hello-world during learning a program language. It trains a model, which is implemented by TensorFlow, on CIFAR-10 dataset. It downloads data and code from internet and doesn't copy model out. It helps getting started with OpoenPAI. Next sections include more details to help on submitting real jobs.

Navigate to OpenPAI web portal. Input IP address or domain name of OpenPAI, which is from administrator of the OpenPAI cluster. If it doesn't require to login, click login link at top right side and input user/password.

After that, OpenPAI will show dashboard as below.
Click Submit Job on the left pane and reach this page.

Click JSON button. Clear existing content and paste below content in the popped text box, then click save.

The content is introduced in next sections.

{
"jobName": "tensorflow-cifar10",
"image": "ufoym/deepo:tensorflow-py36-cu90",
"taskRoles": [
    {
    "name": "default",
    "taskNumber": 1,
    "cpuNumber": 4,
    "memoryMB": 8192,
    "gpuNumber": 1,
    "command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=/tmp/data && python train_image_classifier.py --dataset_name=cifar10 --dataset_dir=/tmp/data --max_number_of_steps=1000"
    }
]
}

It will show as below. Click Submit button to submit the job to OpenPAI platform.

Note, Web portal is one of ways to submit jobs. It's not most efficient way, but simplest way to begin. OpenPAI VS Code Client is recommended to work with OpenPAI.
After submitted, the page redirects to job list, and the submitted job is in list as waiting status. Click Jobs on right pane can also reach this page.
Click job name to view job details. Keep refreshing the details page, until job status is changed to Running, and IP address is assigned in below pane of task role. There are more details and actions, like status, tracking log and so on.

Understand job

With submitting a hello-world job, this section introduces more knowledge about job, so that you can write your own job configuration easily.

Learn hello-world job

The job configuration is a JSON file, which is posted to OpenPAI. Here uses the hello-world job configuration to understand key fields.

The JSON file of job has two levels entries. The top level includes shared information of the job, including job name, docker image, task roles, and so on. The second level is taskRoles, it's an array. Each item in the array specifies commands and the corresponding running environment.

Below is key part of all fields and full spec of job configuration is here.

jobName is the unique name of current job, displays in web also. A meaningful name helps managing jobs well.
image

OpenPAI uses docker to provide runtime environments. Docker is a popular technology to provide isolated environments on the same server. So that OpenPAI can serve multiple resource requests on the same server and provides consistent clean environments.

The image field is the identity of docker image, which includes customized Python and system packages, to provide a clean and consistent environment for each running.

Administrator may set a private docker repository. The hub.docker.com is a public docker repository with a lot of docker images. The ufoym/deepo on hub.docker.com is recommended for deep learning. In the hello-world example, it uses a TensorFlow image, ufoym/deepo:tensorflow-py36-cu90, from ufoym/deepo.

If an appropriate docker image isn't found, it's not difficult to build a docker image from scratch.

Note, if a docker image doesn't include openssh-server and curl components by default, it cannot use SSH feature of OpenPAI. If SSH is needed, another docker image can be built on top of this image and includes openssh-server and curl.
taskRoles defines different roles in a job.

For single machine jobs, there is only one item in taskRoles.

For distributed jobs, there may be multiple roles in taskRoles. For example, when TensorFlow is used to running distributed job, it has two roles, including parameter server and worker. There are two task roles in the corresponding job configuration, refer to an example.
taskRoles/name is the name of current task role and it's used in environment variables for communication in distributed jobs.
taskRoles/taskNumber is number of current task instances. For single server jobs, it should be 1. For distributed jobs, it depends on how many instances are needed for this task role. For example, if it's 8 in a worker role of TensorFlow. It means there should be 8 docker containers as workers should be instantiated for this task.
taskRoles/cpuNumber, taskRoles/memoryMB, taskRoles/gpuNumber are easy to understand. They specify corresponding hardware resources including count of CPU core, MB of memory, and count of GPU.
taskRoles/command is what user want to run in this task role. It can be multiple commands, which are joint by && like in terminal. In the hello-world job configuration, it clones code from GitHub, downloads data and then executes the training progress within one line.

Like the hello-world job, user needs to construct command(s) to get code, data and trigger executing.

Exchange data

The data here doesn't only mean dataset of machine learning, also includes all files and information, like code, scripts, trained model, and so on. Most model training and other kinds of jobs need to exchange data between docker container and outside.

OpenPAI creates a clean docker container. Some data can be built into docker image directly if it's changed rarely.

If it needs to exchange data on runtime, the command, which passes to docker in job configuration, needs to initiate the data exchange progress. For example, use git, wget, scp, sftp or other commands to copy data in and out. If some command is not built in docker, it can be installed in the command by apt install or python -m pip install.

It's better to check with administrator of the OpenPAI cluster, since there may be suggested approaches and examples already.

Job workflow

Once job configuration is ready, next step is to submit it to OpenPAI. To submit a job, the recommended way is to use Visual Studio Code extension for OpenPAI. Both web UI and extension through RESTful API of OpenPAI to manage jobs. So, it's possible to implement your own script or tool.

After received job configuration, OpenPAI processes it as below steps.

Wait to allocate resource. As job configuration, OpenPAI waits enough resources including CPU, memory, and GPU are allocated. If there is enough resource, the job starts to run very soon. If there is not enough resource, job is queued and wait previous job to complete.

Note, distributed jobs start to run when first environment is ready. But user code can still wait until enough container(s) are running to execute actual progress. The job status is set to Running on OpenPAI as well, once one container is running.
Initialize docker container. OpenPAI pulls the docker image, which is specified in configuration, if it doesn't exist locally. After docker container started, OpenPAI executes some initialization and then run user's command(s).
Execute user commands. During user command executing, OpenPAI outputs stdout and stderr near real-time. There also are metrics to monitor workload.
Finalize job running. Once user's command completed. OpenPAI use latest exit code as signal to decide the job is success or not. 0 means success, others mean failure. Then OpenPAI recycles resource for next job.

When job is submitted to OpenPAI, user can see job's status changing from waiting, to running, then to succeeded or failed. It may be stopped if the job is interrupted by user or system.

Reference

Full spec of job configuration
Examples
Troubleshooting job failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training.md

training.md

Train models on OpenPAI

Submit a hello-world job

Understand job

Learn hello-world job

Exchange data

Job workflow

Reference

Files

training.md

Latest commit

History

training.md

File metadata and controls

Train models on OpenPAI

Submit a hello-world job

Understand job

Learn hello-world job

Exchange data

Job workflow

Reference