You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extrae automatically deploys on multi-node jobs if MPI communication is detected, which is not always the case in Julia. Automatically deploying to remote workers on execution and posteriously merging the intermediate trace files would make Extrae a perfect match for Julia.
The text was updated successfully, but these errors were encountered:
I think that this issue contains different subtopics, so I will try to put some thouths on them
Initialization of Extrae
When using the libseqtrace library, Extrae is automatically initialized at library load. This can be avoided by setting the environment variable EXTRAE_SKIP_AUTO_LIBRARY_INITIALIZE=1.
To simplify, our first implementation can initialize Extrae manually, when Distributed is already loaded and the number of processors is already set (addprocs()) and therefore known. This requires to setup the two resources functions (Extrae_set_threadid_function and Extrae_set_numthreads_function), correctly identifying the master process (which is worker id 1) as the process 0 for Extrae. Therefore, this functions have to be implemented and passed to Extrae before initialization:
functiondistributed_taskid()::Cuint
id = Distributed.myid() -1return id
endfunctiondistributed_numtasks()::Cuint
nworkers = Distributed.nworkers()
return nworkers
end
Also, the second requirement before initialization is to setup a program name for the mpits trace file. This is needed when using the libseqtrace library because each Extrae process is not aware of the existance of the others, so running the default configuration would end up in files with the same name causing conflicts. This can be solved setting an environment variable before initialization, more or less like this:
# Set environment variable for different TRACE{Var}.mpits files
name ="JULIATRACE"*string(id)
var ="EXTRAE_PROGRAM_NAME"@ccallsetenv(var::Cstring, name::Cstring)::Cint
Then initialization can be done by just executing Extrae.init() in all workers with @everywhere.
If we wanted to autmatically initialize Extrae when setting up new workers, we can implement this in the prehook of addprocs, or adding a specialized addprocs function. Then we can initialize extrae on those workers. This will require also to notify Extrae of changes of number of processors during execution.
Setting up context overdubbing in workers
The second technical subtopic related to automatic Extrae deployment is how we execute the runtime in the workers under the Extrae context, so we can overdub the Distributed runtime calls. For now, two paths have been found:
First, as a proof of concept, we can ignore the "runtime" side of the workers, and only overdub the workload function itself, by sending an overdubbed version of the function to the workers. This way, at the workers' side we can detect when workload is execution, and correctly differentiating between useful and not useful work.
Then, a more final implementation would consist in running a whole overdubbed version of the event loop. To to this, we need to launch the workers in a way that Julia executes our code when it starts. Right now the only possible way to do this without modifying Julia image is to implement a custom ClusterManager that has a worker launcher that meets this requirements.
Finally, I would discuss this second topic in a separate issue.
Extrae automatically deploys on multi-node jobs if MPI communication is detected, which is not always the case in Julia. Automatically deploying to remote workers on execution and posteriously merging the intermediate trace files would make Extrae a perfect match for Julia.
The text was updated successfully, but these errors were encountered: