Automatic Extrae deployment on `Distributed.Worker`s #8

mofeing · 2023-01-24T16:51:41Z

Extrae automatically deploys on multi-node jobs if MPI communication is detected, which is not always the case in Julia. Automatically deploying to remote workers on execution and posteriously merging the intermediate trace files would make Extrae a perfect match for Julia.

clasqui · 2023-01-25T14:09:57Z

I think that this issue contains different subtopics, so I will try to put some thouths on them

Initialization of Extrae

When using the libseqtrace library, Extrae is automatically initialized at library load. This can be avoided by setting the environment variable EXTRAE_SKIP_AUTO_LIBRARY_INITIALIZE=1.

To simplify, our first implementation can initialize Extrae manually, when Distributed is already loaded and the number of processors is already set (addprocs()) and therefore known. This requires to setup the two resources functions (Extrae_set_threadid_function and Extrae_set_numthreads_function), correctly identifying the master process (which is worker id 1) as the process 0 for Extrae. Therefore, this functions have to be implemented and passed to Extrae before initialization:

function distributed_taskid()::Cuint
    id = Distributed.myid() - 1
    return id
end

function distributed_numtasks()::Cuint
    nworkers = Distributed.nworkers()
    return nworkers
end

Also, the second requirement before initialization is to setup a program name for the mpits trace file. This is needed when using the libseqtrace library because each Extrae process is not aware of the existance of the others, so running the default configuration would end up in files with the same name causing conflicts. This can be solved setting an environment variable before initialization, more or less like this:

# Set environment variable for different TRACE{Var}.mpits files
    name = "JULIATRACE" * string(id)
    var = "EXTRAE_PROGRAM_NAME"
    @ccall setenv(var::Cstring, name::Cstring)::Cint

Then initialization can be done by just executing Extrae.init() in all workers with @everywhere.

If we wanted to autmatically initialize Extrae when setting up new workers, we can implement this in the prehook of addprocs, or adding a specialized addprocs function. Then we can initialize extrae on those workers. This will require also to notify Extrae of changes of number of processors during execution.

Setting up context overdubbing in workers

The second technical subtopic related to automatic Extrae deployment is how we execute the runtime in the workers under the Extrae context, so we can overdub the Distributed runtime calls. For now, two paths have been found:

First, as a proof of concept, we can ignore the "runtime" side of the workers, and only overdub the workload function itself, by sending an overdubbed version of the function to the workers. This way, at the workers' side we can detect when workload is execution, and correctly differentiating between useful and not useful work.
Then, a more final implementation would consist in running a whole overdubbed version of the event loop. To to this, we need to launch the workers in a way that Julia executes our code when it starts. Right now the only possible way to do this without modifying Julia image is to implement a custom ClusterManager that has a worker launcher that meets this requirements.

Finally, I would discuss this second topic in a separate issue.

mofeing added enhancement New feature or request help wanted Extra attention is needed labels Jan 24, 2023

mofeing assigned clasqui Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic Extrae deployment on `Distributed.Worker`s #8

Automatic Extrae deployment on `Distributed.Worker`s #8

mofeing commented Jan 24, 2023

clasqui commented Jan 25, 2023

Automatic Extrae deployment on Distributed.Workers #8

Automatic Extrae deployment on Distributed.Workers #8

Comments

mofeing commented Jan 24, 2023

clasqui commented Jan 25, 2023

Initialization of Extrae

Setting up context overdubbing in workers

Automatic Extrae deployment on `Distributed.Worker`s #8

Automatic Extrae deployment on `Distributed.Worker`s #8