Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with using more than one mpiproc per node #84

Open
mtremmel opened this issue Nov 21, 2018 · 6 comments
Open

issues with using more than one mpiproc per node #84

mtremmel opened this issue Nov 21, 2018 · 6 comments

Comments

@mtremmel
Copy link
Contributor

When I run a calculation using 50 processes, with 1 process per node things seem to work alright. However, running the same calculation on the same simulation steps/halos with 2 processes per node and 25 nodes (rather than 50), I quickly get a database error. I'm not sure if this is a problem with this specific filesystem or not.... The following is the database error that comes up. It happens quickly enough that it seems to be simply when the code is attempting to read in the existing database properties. This is all done on nobackupp2 on pleiades and using the most up-to-date master branch (except the very latest updates in the past 12 hours or so).

sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) database disk image is malformed [SQL: u'SELECT creators.id AS creators_id, creators.command_line AS creators_command_line, creators.dtime AS creators_dtime, creators.host AS creators_host, creators.username AS creators_username, creators.cwd AS creators_cwd \nFROM creators \nWHERE creators.id = ?\n LIMIT ? OFFSET ?'] [parameters: (234, 1, 0)]

@apontzen
Copy link
Member

Yes that's almost certainly a filesystem bug (although goes slightly in the opposite direction to what you'd expect, but I have seen weirder). As ever, we should not be using SQLite at this scale. But maybe post the full traceback in case there is some identifiable lock we can put in.

@mtremmel
Copy link
Contributor Author

The problem is that all of my processes die at once so the traceback is all jumbled... not sure how useful it is. I could just attach the full error file?

@apontzen
Copy link
Member

I think there's some mpirun / mpiexec variant that will label which line came from which processor?

@mtremmel
Copy link
Contributor Author

Interesting... right now each "normal" (non-traceback or error message) line does show the processor number, but the traceback doesn't...

@apontzen
Copy link
Member

Yes the 'normal' lines are labelled by tangos itself but it can't do that for the traceback. I think though mpirun can do it... you just need to find the right incantation...

Or alternatively, just do a manual detective work ;-)

@mtremmel
Copy link
Contributor Author

I'll look into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants