Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pidlock to public functions #45

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

willow-ahrens
Copy link

@willow-ahrens willow-ahrens commented Jan 6, 2025

This PR locks the basic functions of pyjuliapkg. I used FileLock instead of pidlock because I needed a reentrant lock. I also used a thread lock. There is a test which I have verified does show that this fix avoids redundant resolves. fixes #19.

@amilsted
Copy link
Contributor

Do you know if this will work on a cluster if the shared filesystem does not support unix file locking? The "pidfile" approach, which stores the hostname and process ID of the acquirer in the lock file, does not depend on filesystem support for locking. I don't think filelock uses this approach, so I'm a bit worried about robustness.

@willow-ahrens
Copy link
Author

Filelock advertises itself as platform-independent, and has windows and unix specific functions. I haven't been able to test this on windows, however.

@willow-ahrens
Copy link
Author

One question I have about this PR is where to write the lock. I have it being written to the source directory of pyjuliapkg, but I suspect it would be better to use some location which is specified by the STATE dict. I just don't know which one makes sense to use here.

@amilsted
Copy link
Contributor

Filelock advertises itself as platform-independent, and has windows and unix specific functions.

It can be platform independent and still not work well across multiple hosts on a shared filesystem.

@willow-ahrens
Copy link
Author

I see what you're asking. The filelock library seems to advise the use of the SoftFileLock type in the event that there are multiple hosts, but this may require sometimes deleting the lock file when someone interrupts the setup. The julia developers seem to have had similar concerns in the design of Pkg, and ultimately I'm not sure of the extent to which this is solved. The solution in this PR should fix bugs on not-shared filesystems, and I'm open to input on how to ensure this also fixes bugs on shared ones. I chose the PID file package off of PyPI with the greatest support that offered a simple reentrant lock, but I'm open to suggestions on others?

@amilsted
Copy link
Contributor

amilsted commented Jan 16, 2025

Yeah, I understand this is a nontrivial choice. I shopped around for a Python package that ticked all the boxes (basically replicating the Julia solution) but didn't find one that worked well out of the... well, box. Any locking is definitely better than none here and I appreciate your taking this on!

Edit: And I don't mean to sound like I'm a maintainer here. I'm not! Just a user quite invested in making it easier to have python packages that use julia (including for use on clusters).

@willow-ahrens
Copy link
Author

willow-ahrens commented Jan 16, 2025

Looking at package managers in python:

@amilsted
Copy link
Contributor

In case it's useful, I used a somewhat modified version of pidlock to externally lock juliapkg operations. This worked okay on a cluster, but it again probably doesn't do everything right. For one thing, it's missing a mechanism to delete stale lock files.

Also the github version has a defect, in that it checks for file existence and then, if the file exists, opens the file without exception handling. The delay between existence check and open allows other processes or hosts to delete the file and cause a crash. This can be fixed through handling exceptions on open instead of checking for existence separately.

@amilsted
Copy link
Contributor

Looks like the conda approach is quite similar to pidlock, but using a directory instead of a file. It also doesn't seem to include the hostname at the moment. Maybe worth stealing?

@willow-ahrens
Copy link
Author

willow-ahrens commented Jan 23, 2025

I'm happy to choose whichever pidlock we would like, but it seems out of scope to try to do better than e.g. virtualenv. @cjdoris, do you have a preference for which file lock you prefer? Can we take on the maintenance burden of maintaining our own lock here, or is this best effort choice of FileLock satisfactory?

@willow-ahrens
Copy link
Author

willow-ahrens commented Jan 23, 2025

I suppose I would also like to see an example of a failing test with the current setup before I roll my own file lock, so that we can verify that a different choice of file lock fixes the test. @amilsted, do you have additional tests to contribute to the PR?

@amilsted
Copy link
Contributor

Very fair. This is tricky, because I think it's hard to break FileLock on a single host and a non-network filesystem. I will think about it. I suppose one could argue that juliapkg just shouldn't be in online mode when running on distributed systems like HPC clusters, so that these cases should not be considered relevant for the locking here. Perhaps one should just always set PYTHON_JULIAPKG_OFFLINE=yes when running a batch job that would load juliacall in parallel on many hosts.

@amilsted
Copy link
Contributor

Counterpoint: Julia core uses https://github.com/vtjnash/Pidfile.jl to do hostname/PID-based locking to protect precompilation and Pkg operations, because this type of locking is robust on HPC clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible concurrency issues with resolve on a cluster
2 participants