-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues with using more than one mpiproc per node #84
Comments
Yes that's almost certainly a filesystem bug (although goes slightly in the opposite direction to what you'd expect, but I have seen weirder). As ever, we should not be using SQLite at this scale. But maybe post the full traceback in case there is some identifiable lock we can put in. |
The problem is that all of my processes die at once so the traceback is all jumbled... not sure how useful it is. I could just attach the full error file? |
I think there's some mpirun / mpiexec variant that will label which line came from which processor? |
Interesting... right now each "normal" (non-traceback or error message) line does show the processor number, but the traceback doesn't... |
Yes the 'normal' lines are labelled by tangos itself but it can't do that for the traceback. I think though mpirun can do it... you just need to find the right incantation... Or alternatively, just do a manual detective work ;-) |
I'll look into it |
When I run a calculation using 50 processes, with 1 process per node things seem to work alright. However, running the same calculation on the same simulation steps/halos with 2 processes per node and 25 nodes (rather than 50), I quickly get a database error. I'm not sure if this is a problem with this specific filesystem or not.... The following is the database error that comes up. It happens quickly enough that it seems to be simply when the code is attempting to read in the existing database properties. This is all done on nobackupp2 on pleiades and using the most up-to-date master branch (except the very latest updates in the past 12 hours or so).
sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) database disk image is malformed [SQL: u'SELECT creators.id AS creators_id, creators.command_line AS creators_command_line, creators.dtime AS creators_dtime, creators.host AS creators_host, creators.username AS creators_username, creators.cwd AS creators_cwd \nFROM creators \nWHERE creators.id = ?\n LIMIT ? OFFSET ?'] [parameters: (234, 1, 0)]
The text was updated successfully, but these errors were encountered: