Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Extrae deployment on Distributed.Workers #8

Open
mofeing opened this issue Jan 24, 2023 · 1 comment
Open

Automatic Extrae deployment on Distributed.Workers #8

mofeing opened this issue Jan 24, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@mofeing
Copy link
Member

mofeing commented Jan 24, 2023

Extrae automatically deploys on multi-node jobs if MPI communication is detected, which is not always the case in Julia. Automatically deploying to remote workers on execution and posteriously merging the intermediate trace files would make Extrae a perfect match for Julia.

@mofeing mofeing added enhancement New feature or request help wanted Extra attention is needed labels Jan 24, 2023
@clasqui
Copy link
Contributor

clasqui commented Jan 25, 2023

I think that this issue contains different subtopics, so I will try to put some thouths on them

Initialization of Extrae

When using the libseqtrace library, Extrae is automatically initialized at library load. This can be avoided by setting the environment variable EXTRAE_SKIP_AUTO_LIBRARY_INITIALIZE=1.

To simplify, our first implementation can initialize Extrae manually, when Distributed is already loaded and the number of processors is already set (addprocs()) and therefore known. This requires to setup the two resources functions (Extrae_set_threadid_function and Extrae_set_numthreads_function), correctly identifying the master process (which is worker id 1) as the process 0 for Extrae. Therefore, this functions have to be implemented and passed to Extrae before initialization:

function distributed_taskid()::Cuint
    id = Distributed.myid() - 1
    return id
end

function distributed_numtasks()::Cuint
    nworkers = Distributed.nworkers()
    return nworkers
end

Also, the second requirement before initialization is to setup a program name for the mpits trace file. This is needed when using the libseqtrace library because each Extrae process is not aware of the existance of the others, so running the default configuration would end up in files with the same name causing conflicts. This can be solved setting an environment variable before initialization, more or less like this:

# Set environment variable for different TRACE{Var}.mpits files
    name = "JULIATRACE" * string(id)
    var = "EXTRAE_PROGRAM_NAME"
    @ccall setenv(var::Cstring, name::Cstring)::Cint

Then initialization can be done by just executing Extrae.init() in all workers with @everywhere.

If we wanted to autmatically initialize Extrae when setting up new workers, we can implement this in the prehook of addprocs, or adding a specialized addprocs function. Then we can initialize extrae on those workers. This will require also to notify Extrae of changes of number of processors during execution.

Setting up context overdubbing in workers

The second technical subtopic related to automatic Extrae deployment is how we execute the runtime in the workers under the Extrae context, so we can overdub the Distributed runtime calls. For now, two paths have been found:

  1. First, as a proof of concept, we can ignore the "runtime" side of the workers, and only overdub the workload function itself, by sending an overdubbed version of the function to the workers. This way, at the workers' side we can detect when workload is execution, and correctly differentiating between useful and not useful work.
  2. Then, a more final implementation would consist in running a whole overdubbed version of the event loop. To to this, we need to launch the workers in a way that Julia executes our code when it starts. Right now the only possible way to do this without modifying Julia image is to implement a custom ClusterManager that has a worker launcher that meets this requirements.

Finally, I would discuss this second topic in a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants