-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for building GPU enabled wheel with CoreNEURON #1452
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1452 +/- ##
==========================================
+ Coverage 41.11% 41.20% +0.08%
==========================================
Files 550 550
Lines 110261 110262 +1
==========================================
+ Hits 45336 45431 +95
+ Misses 64925 64831 -94
Continue to review full report at Codecov.
|
049e278
to
39de15f
Compare
Removing RPM : -6GB |
* remove gpu_wheel directory, move Dockerfile at the top, remove install scripts * remove old --bare option * add coreneuron or coreneuron-gpu as a CLI option for build_wheel.sh script Update README.md with instructions to build wheel as well image Add new option in setup.py to enable coreneuron+gpu Cleanup setup.py after rebase Update coreneuron submodule to master
* do not propogate NEURON MPI flag if CoreNEURON MPI option is explicitly specified * nrnivmodl uses CORENRNHOME env variable instead of CNRNHOE variable * binwrapper.py sets CORENRNHOME same as NRNHOME * update coreneuron to branch BlueBrain/CoreNeuron#634
Update coreneuron to BlueBrain/CoreNeuron#634
* corenrn_embedded_run now accepts coreneuron mpi library to load * neuron decides the name of mpi library to be loaded based on auto-detection already has * neuron pass the necessary path for dlopen * rest of the workflow remains same
…ns have bugs) Set PATH to prefer CUDA 11.0
467542f
to
9faa982
Compare
Some extra testing is required with dynamic MPI aspects but overall this is ready for the review. Edit: wheel is built with nvhpc/21.2 and trying on BB5 produces following (with or without --gpu but when --mpi is used) # allocate BB5 GPU node
# create virtualenv with python3.8
export OMP_NUM_THREADS=1
module load unstable cuda hpe-mpi nvhpc python
pip install /gpfs/bbp.cscs.ch/home/kumbhar/tmp/wheelhouse/NEURON_nightly-8.0a693-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
nrniv-core -e 1 -d CoreNeuron/tests/integration/ring # WORKS
nrniv-core -e 1 -d CoreNeuron/tests/integration/ring --gpu #WORKS
# BUT --mpi
# we use hmpt library because its mpich compatible
(v38) kumbhar@ldir01u09:~/tmp$ LD_PRELOAD=/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/hpe-mpi-2.22.hmpt-r52ypu/lib/libmpi.so nrniv-core -e 1 -d CoreNeuron/tests/integration/ring --mpi-lib /gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/lib/libcorenrnmpi_mpich.so --mpi
INFO : Using neuron-nightly Package (Developer Version)
num_mpi=1
num_omp_thread=1
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 2ffcc0b (2021-10-14 00:09:37 +0200)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Memory (MBs) : After mk_mech : Max 232.0508, Min 232.0508, Avg 232.0508
Memory (MBs) : After MPI_Init : Max 232.0508, Min 232.0508, Avg 232.0508
Memory (MBs) : Before nrn_setup : Max 232.2031, Min 232.2031, Avg 232.2031
Setup Done : 0.00 seconds
Model size : 84.36 kB
Memory (MBs) : After nrn_setup : Max 232.2266, Min 232.2266, Avg 232.2266
GENERAL PARAMETERS
--mpi=true
--mpi-lib=/gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/lib/libcorenrnmpi_mpich.so
--gpu=false
--dt=0.025
--tstop=1
GPU
--nwarp=0
--cell-permute=0
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=CoreNeuron/tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 232.2266, Min 232.2266, Avg 232.2266
Memory (MBs) : After nrn_finitialize : Max 232.2266, Min 232.2266, Avg 232.2266
psolve |=========================================================| t: 1.00 ETA: 0h00m00s
Solver Time : 0.00101947
Simulation Statistics
Number of cells: 20
Number of compartments: 804
Number of presyns: 21
Number of input presyns: 0
Number of synapses: 21
Number of point processes: 41
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 0
Number of spikes with non negative gid-s: 0
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
Process ID: 26651, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/bin/nrniv-core
MPT Version: HPE HMPT 2.22 03/31/20 16:17:35
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/26651/exe, process 26651
MPT: [New LWP 26676]
MPT: [New LWP 26675]
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: 0x00007fffeb0ab1d9 in waitpid () from /lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: debuginfo-install bbp-nvidia-driver-470.57.02-2.x86_64 glibc-2.17-324.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64
MPT: (gdb) #0 0x00007fffeb0ab1d9 in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00007fffed5ea3e6 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffff9710 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 26651, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/bin/nrniv-"...) at sig.c:340
MPT: #3 0x00007fffed5ea5d8 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x7fffe7ba0080) at sig.c:489
MPT: #4 0x00007fffed5ea8b3 in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:565
MPT: #5 <signal handler called>
MPT: #6 0x00007fffec89ecd2 in ?? ()
MPT: from /gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/bin/../../../NEURON_nightly.libs/libcudart-3f3c6934.so.11.0.221
MPT: #7 0x00007fffec8a2614 in ?? ()
MPT: from /gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/bin/../../../NEURON_nightly.libs/libcudart-3f3c6934.so.11.0.221
MPT: #8 0x00007fffec8921bc in ?? ()
MPT: from /gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/bin/../../../NEURON_nightly.libs/libcudart-3f3c6934.so.11.0.221
MPT: #9 0x00007fffec893cdb in ?? ()
MPT: from /gpfs/bbp.cscs.ch/home/kumbhar/tmp/v38/lib/python3.8/site-packages/neuron/.data/bin/../../../NEURON_nightly.libs/libcudart-3f3c6934.so.11.0.221
MPT: #10 0x00007fffecb19ccb in __pgi_uacc_cuda_unregister_fat_binary (
MPT: pgi_cuda_loc=0x7fffed9b4a40 <__PGI_CUDA_LOC>) at ../../src/cuda_init.c:649
MPT: #11 0x00007fffecb19c6a in __pgi_uacc_cuda_unregister_fat_binaries ()
MPT: at ../../src/cuda_init.c:635
MPT: #12 0x00007fffea12ece9 in __run_exit_handlers () from /lib64/libc.so.6
MPT: #13 0x00007fffea12ed37 in exit () from /lib64/libc.so.6
MPT: #14 0x00007fffea11755c in __libc_start_main () from /lib64/libc.so.6
MPT: #15 0x00000000004122d7 in _start ()
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 26651] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/26651/exe, process 26651
MPT: [Inferior 1 (process 26651) detached] I think we have seen this problem when there is an issue/incompatiblity with CUDA driver vs CUDA version. Next I am going to try using CUDA v11.0 for building wheel. I think that's the problem (?). |
Importantly, use MPI_SGI_vtune_is_running instead of MPI_SGI_init to identify HMPT library becayse we have to distinguish between MPT and HMPT versions. coreneuron submodule update
9faa982
to
c174509
Compare
- correctly auth & upload Update binwrapper to be aware of new package names for GPU
See BlueBrain/mod2c#72 Enable upload of GPU nightly wheel
If we cancel stuck build then we see: + nrniv -python -c 'import neuron; neuron.test(); quit()' Warning: no DISPLAY environment variable. --No graphics will be displayed. NEURON -- VERSION 8.0a-712-g5db9e33e3 HEAD (5db9e33e3) 2021-10-20 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2021 See http://neuron.yale.edu/neuron/credits Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'neuron' And after cancelling the job continues until it gets killed. It seems like with system python somehow gets selected wrong but not sure how this is possible because we are virtual env and everything seems to be the same. Try passing -pyexe explivitly
cfe0d8d
to
2c2c609
Compare
Brew python : avoid 3.9 as it seems to have issues Update coreneuron
This reverts commit bd4c775.
Run brew test on all MacOS platform Disable upload wheels on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
`neuron-nightly*` package As part of #1452, we changed exe wrapper for GPU but introduced a bug where `neuron` package name was not checked
Fully functional, production quality wheel requires still bit more effort but this is getting close. This PR summarises the current status, thanks to @ferdonline !
neuronsimulator/neuron_wheel
but install additional nvidia hpc toolkit rpms for building GPU wheel
--bare
option from wheel build scriptTodo's before merge
nrniv-core
wrappercc: @ferdonline @alexsavulescu @iomaganaris