[Feature] Progress tracking and scheduling #1

anibalsolon · 2020-07-23T01:43:28Z

Fixes

Related to FCP-INDI/C-PAC#1363 by @sgiavasis

Description

This PR creates a C-PAC (or virtually anything) scheduler, with an API interface, that allows running containerized images and checking its progress. It has a CLI, so the user can start up and configure the scheduler, and an API interface, to communicate with the C-PAC GUI project mainly.

Technical details

The implementation relies heavily on the asyncio API, to simplify concurrency. However, it is not a parallel API, meaning that everything is executed in the same thread (and there is no race condition) and the different tasks that are being executed concurrently must not block the asyncio execution (e.g. it can have an asyncio.sleep in a task, or an IO function). Considering this past detail, all the feature implementations must have this in mind, which is why it is hard to leverage all the asyncio potential in projects that were not thought to work this way (e.g. nipype → pydra). The good thing is that, given it is single threaded, it eases a lot to handle different moving parts, while on parallel setups one would have to use different mechanisms of communication to avoid race conditions.

That said, we have # main parts on this implementation: Scheduler, Backend, Shedule (and its children), Message, Result, and the API.
Beggining with the Schedule. Schedule is a abstraction of the task it should be executed. For C-PAC, we have three tasks:

DataSettings: A task to generate data configs from a provided data settings;
DataConfig: A task to schedule a pipeline for the subjects from a data config, spawning new tasks for each participant;
ParticipantPipeline: A task to execute a pipeline for a single subject.

Supposedly, it handles the logic aspects of it in terms of the abstract task they are performing. More technical aspects, such as running containers, are handled by a specialization of the Schedule class: BackendSchedule. The BackendSchedules are specific to a Backend, an interface between Python & the softwares of a specific backend (e.g. singularity binaries). The Backend must contains the parameters required for the BackendSchedules to properly communicate with the underlying softwares, such as the Docker image to be used or the SSH connection to access a SLURM cluster.

The Scheduler is the central part of this implementation, and maybe the most simpler. It stores the Schedules into a tree-like structure, since Schedules can spawn new Schedules, and manage the Messages received by each Schedule, together with the callbacks associated to a Schedule Message type. When a Schedule is scheduled, the Scheduler will send the Schedule to its Backend, and the Backend will specialize this "naive" Schedule into a BackendSchedule for that Backend:

ParticipantPipelineSchedule + DockerBackend = DockerParticipantPipelineSchedule

This "backend-aware" Schedule (from the superclass BackendSchedule) will then be executed by the Scheduler. The BackendSchedule behave as a Python generator, so the Scheduler simply iterate this object, and the items of this iteration are defined as Messages. The Messages are data classes (i.e. only store data, not methods), to give information for the Scheduler about the execution. The Messages are relayed to Scheduler watchers, which are external agents that provide a callback function for the Scheduler to call when it receives a specific type of Message. For the Spawn Message, the Scheduler schedules a new Schedule, with the parameters contained in the Spawn message.

Specifics of the Docker and Singularity containers are actually the same: they share the same base code for container execution, only differing in the container creation.
When the container is created, three tasks run concurrently for this Schedule: container status, log listener, and file listener. The first yields Messages of type Status, as a ping, so we know the container is running fine. The second connects to the websocket server running in the container, to capture which nodes it has run so far, and yield Messages of type Log. The last one looks in the output directory for logs and crashes, storing the files as Results in the Schedule, and yielding Messages of type Result.
Only the ParticipantPipeline has the second and the third, the others have just the container status Messages.

For SLURM, it starts connecting to the cluster via SSH. It uses the SSH multiplexing connection feature, so the authentication process happens only once, which is a good idea for connections that has a multi-factor authentication layer. After connecting to a cluster, the Backend allocates nodes to execute the Schedules and install a Miniconda & CPACpy. By using the API provided by CPACpy, the local CPACpy communicates with the node CPACpy (yes, via HTTP & WS) to run the Schedules. It uses the same API to gather the results and keep the local Schedule state updated. By default, the Singularity Backend is used by the node CPACpy to run the Schedules.

The Results are basically files in which it would be too much to transfer via WS. The API to gather the Results allow to slice the content using HTTP headers (Content-Range). It is essential for results that will be incremented during the execution (i.e. logs). Using the slice, one do not need to request for the whole file again, but only the part it does not have:

/result/cpac_pipeline.log from bytes 0-    # Returns 200 bytes

# The file has some increments from the nipype log

/result/cpac_pipeline.log from bytes 200-  # Returns 100 bytes

# The file has some increments from the nipype log

/result/cpac_pipeline.log from bytes 300-  # Returns 10 bytes

Tests

Screenshots

I mean, I can show some code, I guess...

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the develop branch of the repository.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I updated the changelog.
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

…to make them consistend w each other

…er, use files to move configs around docker

…d status

…ature/ys-dev

anibalsolon added 22 commits July 11, 2020 17:49

add base for the API and a scheduler test

db1073c

Merge branch 'develop' into feature/progress-tracking

0bef943

declutering setup.cfg, do not run test for helpers, add helper to path

964e06a

add dummy test helper for scheduling, rename method, fix status/logs …

6d2ec6a

…to make them consistend w each other

use uids instead of actual schedules for references

fae55e0

test for schedule results

75edfe2

allow delayed initialization

9c0140b

Fixing inheritance ordering and move backend to the bottom

82969d3

initial test for Docker, using data_settings

f1df845

data url parser

e43a27e

docker test for data settings & data config, start on monitoring serv…

3fb3247

…er, use files to move configs around docker

Docker socket used for testing docker log messaging

7e5b805

move scheduling to asyncio

05157c2

docker async fetch logs

557106d

merge log and docker run into one stream

4ba7b72

add conversor to data URI

93b0213

Re-structure watchers

b394ed6

add some initial time and actually copy the data config

2ed0163

abstract runner as an status yielder, some refactoring on the reporte…

ec0fdc0

…d status

add API and websocket to schedule stuff

2056aad

fix log messages, add dependency

c0f2394

Merge branch 'trunk' into feature/progress-tracking

ec58800

anibalsolon requested a review from shnizzedy July 23, 2020 01:43

anibalsolon added 7 commits July 22, 2020 21:44

old, unused code

97a9596

use yielded scheduler

02f79cb

reframe the keyerror as its own key error problem

7f861f2

tag log message

59716a1

do not resend the schedule id, since it is already in the message

e0d7dee

deal with s3 bids dir, and generate a data config for it

df29e0d

comment out unimplemented results

ae9a938

YiranCdr and others added 25 commits April 8, 2021 09:02

Minor fix

076dd2c

Feature: authKey

b6cbfdd

minor fix

d791b60

enh: use real data to report node execution

2be08ae

STASH - ALL POST

4a82024

Merge remote-tracking branch 'origin/feature/ys-dev' into feature/ys-dev

4fa25da

fix: add more logs, replace fixed image on template

3da61db

fix: refactor auth

4a5da25

fix: allow cors

964f48f

fix: output folder for data config

8f2d1e0

fix: match cpac new output

76c435a

debounce

776800e

Merge remote-tracking branch 'origin/feature/ys-dev' into feature/ys-dev

32fc778

debounce 1/2

82b7e0a

auth token strip()

3450eaa

post -> get

f8ccd29

fix: add logs to slurm

351a3a1

fix: exec date and exit

b428b25

fix: more logs

616143d

fix: run command in real shell

60549b3

Merge remote-tracking branch 'origin/feature/anibalsolon-dev' into fe…

80a6ed0

…ature/ys-dev

Fix: pull-request fixes.

370e378

Merge branch 'feature/ys-dev' into feature/progress-tracking

a2500d9

fix: method name

2dc24ce

fix: method name

77a46a0

shnizzedy deleted the branch FCP-INDI:develop February 1, 2022 21:11

shnizzedy closed this Feb 1, 2022

shnizzedy reopened this Feb 22, 2022

shnizzedy dismissed their stale review via 77a46a0 April 20, 2022 20:38

shnizzedy mentioned this pull request Apr 11, 2024

✅ Restore Apptainer tests #70

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Progress tracking and scheduling #1

[Feature] Progress tracking and scheduling #1

anibalsolon commented Jul 23, 2020 •

edited

Loading

[Feature] Progress tracking and scheduling #1

Are you sure you want to change the base?

[Feature] Progress tracking and scheduling #1

Conversation

anibalsolon commented Jul 23, 2020 • edited Loading

Fixes

Description

Technical details

Tests

Screenshots

Checklist

Developer Certificate of Origin

anibalsolon commented Jul 23, 2020 •

edited

Loading