issues with using more than one mpiproc per node #84

mtremmel · 2018-11-21T18:58:26Z

When I run a calculation using 50 processes, with 1 process per node things seem to work alright. However, running the same calculation on the same simulation steps/halos with 2 processes per node and 25 nodes (rather than 50), I quickly get a database error. I'm not sure if this is a problem with this specific filesystem or not.... The following is the database error that comes up. It happens quickly enough that it seems to be simply when the code is attempting to read in the existing database properties. This is all done on nobackupp2 on pleiades and using the most up-to-date master branch (except the very latest updates in the past 12 hours or so).

sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) database disk image is malformed [SQL: u'SELECT creators.id AS creators_id, creators.command_line AS creators_command_line, creators.dtime AS creators_dtime, creators.host AS creators_host, creators.username AS creators_username, creators.cwd AS creators_cwd \nFROM creators \nWHERE creators.id = ?\n LIMIT ? OFFSET ?'] [parameters: (234, 1, 0)]

The text was updated successfully, but these errors were encountered:

apontzen · 2018-11-21T19:00:23Z

Yes that's almost certainly a filesystem bug (although goes slightly in the opposite direction to what you'd expect, but I have seen weirder). As ever, we should not be using SQLite at this scale. But maybe post the full traceback in case there is some identifiable lock we can put in.

mtremmel · 2018-11-21T19:01:38Z

The problem is that all of my processes die at once so the traceback is all jumbled... not sure how useful it is. I could just attach the full error file?

apontzen · 2018-11-21T19:02:12Z

I think there's some mpirun / mpiexec variant that will label which line came from which processor?

mtremmel · 2018-11-21T19:04:57Z

Interesting... right now each "normal" (non-traceback or error message) line does show the processor number, but the traceback doesn't...

apontzen · 2018-11-21T19:06:04Z

Yes the 'normal' lines are labelled by tangos itself but it can't do that for the traceback. I think though mpirun can do it... you just need to find the right incantation...

Or alternatively, just do a manual detective work ;-)

mtremmel · 2018-11-21T19:07:29Z

I'll look into it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with using more than one mpiproc per node #84

issues with using more than one mpiproc per node #84

mtremmel commented Nov 21, 2018

apontzen commented Nov 21, 2018

mtremmel commented Nov 21, 2018

apontzen commented Nov 21, 2018

mtremmel commented Nov 21, 2018

apontzen commented Nov 21, 2018

mtremmel commented Nov 21, 2018

issues with using more than one mpiproc per node #84

issues with using more than one mpiproc per node #84

Comments

mtremmel commented Nov 21, 2018

apontzen commented Nov 21, 2018

mtremmel commented Nov 21, 2018

apontzen commented Nov 21, 2018

mtremmel commented Nov 21, 2018

apontzen commented Nov 21, 2018

mtremmel commented Nov 21, 2018