-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added run_batch functionality for the web mode solver, added parallel… #1654
Conversation
…ization using joblib for the local mode solver, and added new functions for unit testing
for i in range(num_of_sims): | ||
results[i] = mode_solver_list[i].solve() | ||
|
||
assert any(isinstance(x,ModeSolverData) for x in results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be all
instead of any.
# Run mode solver one at a time | ||
results = msweb.run_batch(mode_solver_list, verbose = False) | ||
|
||
assert any(isinstance(x,ModeSolverData) for x in results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be all
instead of any. Will fix it when I incorporate all other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! A few new comments and I'd recommend getting another reviewer (maybe Lucas?). Also you'll need to use black
to format the code for the tests to pass. Check out the link to the wiki that I sent for details. Thanks!
CHANGELOG.md
Outdated
@@ -26,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
- Properties `num_time_steps_adjoint` and `tmesh_adjoint` to `JaxSimulation` to estimate adjoint run time. | |||
- Ability to add `path` to `updated_copy()` method to recursively update sub-components of a tidy3d model. For example `sim2 = sim.updated_copy(size=new_size, path="structures/0/geometry")` creates a recursively updated copy of `sim` where `sim.structures[0].geometry` is updated with `size=new_size`. | |||
- Python 3.12 support. Python 3.8 deprecation. Updated dependencies. | |||
- A batch of ModeSolver objects can be run concurrently using `tidy3d.plugins.mode.web.run_batch()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put ModeSolver
inside of single backslash quotes
CHANGELOG.md
Outdated
@@ -36,6 +37,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
- Changed sign of mode solver PML conductivity to match the `exp(-1j * omega * t)` convention used everywhere else. Previously, loss due to PML was coming out as negative `k_eff` of the mode, while now it comes out as positive `k_eff`, in line with material loss. | |||
- Augmented the PML mode solver profile to the form `s(x) = kappa(x) + 1j * sigma(x) / (omega * EPSILON_0)`. Previously, `kappa` was not used. The parameters are currently set to `kappa_min = 1`, `kappa_max = 3`, `sigma_max = 2`. These are not yet exposed to the user. | |||
- Also scaling `sigma_max` by the average relative impedance in the mode solver PML region. | |||
- Changed `_solve_all_freqs` function of the local model solver code to run mode solving in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a private method (leading _
in the name) maybe this won't mean much to users. I wonder if we can just explain that "local mode solver will run in parallel over each frequency"
tidy3d/web/api/mode.py
Outdated
|
||
# Check type to make sure the user is submitting a list of mode solvers | ||
if not all(isinstance(x, ModeSolver) for x in mode_solvers): | ||
log.error("[red]Stopped running a batch of mode solver simulations. Please provide a list of mode solvers of type [cyan]`td.plugins.mode.mode_solver.ModeSolver` [red]to run a batch of mode solver simulations.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually just raise ValueError
, let's change to that.
The error message could be more descriptive, such as:
mode_solvers
passed to mode.web.run_batch()
should be a list containing all ModeSolver
instances. Can't run the batched mode solver.
tidy3d/web/api/mode.py
Outdated
results_file = results_files[index], | ||
verbose = False, | ||
progress_callback_upload = progress_callback_upload, | ||
progress_callback_download = progress_callback_download |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the style guide we follow usually doesn't put spaces between the =
in kwargs. You can automatically format the code using the black
tool. See our notion page in the EM wiki. In fact the tests won't pass until the code is unchanged under black
and ruff
. Make sure you have the version prescribed in the pyproject.toml
https://www.notion.so/flexcompute/Python-Style-Guide-afac5dbbf949463e89c18fe39b7453c9
tidy3d/web/api/mode.py
Outdated
time.sleep(retry_delay) | ||
handle_mode_solver(index, retries + 1) | ||
else: | ||
log.warning(f"The {index}-th mode solver failed after {max_retries} tries.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we going to do about it? Maybe add another sentence that the data returned by this mode solver will be None
?
Another random thought: is it worth building in a safeguard in case this fails? At least in the local mode solver? like a try except and the multi-threaded one fails, just warn and use the original? |
@tylerflex that makes sense so that we can decouple running mode solver from potential dependency issues with joblib. I can implement that as well. |
@tylerflex another thing is that when I run parallel processing for the local mode solver I am using all but one core. Mode solver runs fast but I can hear the fan going even on my mac for large frequency sweeps. I suppose this is fine for usability? |
tidy3d/plugins/mode/mode_solver.py
Outdated
@@ -10,6 +10,9 @@ | |||
import pydantic.v1 as pydantic | |||
import xarray as xr | |||
|
|||
from joblib import Parallel, delayed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joblib
will have to be added as a dependency in pyproject.toml right? Curious why you chose to use that instead of the native python multiprocessing
? As far as I understand joblib
seems to use that under the hood?
One thing I'm noticing is that you're not explicitly setting the openmp, etc. num threads to 1. This was a problem when you were doing multiprocessing? Did using joblib
somehow remove the need for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@momchil-flex I initially tried using multiprocessing but was getting errors like this `AttributeError: Can't pickle local object 'ModeSolver._solve_all_freqs..'. Also, was getting errors when the make_simulation function that creates the mode solver loads gds files. Tried messing around with multiprocessing including using libraries like dill for serializing but couldn't make it work. Multithreading for concurrency works fine but since this task is compute limited we want multiprocessing. Then I found this library and it worked right away and was embarrassingly simple like they say in their website :) By default it is using backend="loky", which sets multiple process workers with n_jobs defining the number of cpu cores it is using. You can also switch to backend= "threading" for multithreading but this is not what we want for processing multiple frequencies in parallel. More details here https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I could help you go around the "can't pickle object" problem which we've hit multiple times by now (and the answer is not to use dill, even though that's what you find on StackOverflow :) You just need to re-arrange things a bit usually). But I do like this joblib library and we may even switch to that in other places! It seems like it's pretty light and has no dependencies for basic functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, would love to learn how you fixed the "can't pickle object" error. Yeah, joblib just looks simple and very readable!
tidy3d/plugins/mode/mode_solver.py
Outdated
coords = coords, | ||
symmetry = symmetry) | ||
for freq in self.freqs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly we do something very similar on the backend, which is why we have this _solve_single_freq
in the first place. We use multiprocessing
there. I'm even wondering if we shouldn't just be bringing backend code here. But yeah the explicit thread setting is generally required or else things can get substantially slower. Similarly one reason I didn't want to put this on the frontend is that this can get stuck or crash if the user runs out of memory (big mode solves in prallel). Do you think this is a problem in your implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem like joblib
provides a resource to avoid the over-subscription, but it doesn't seem like you are using it?
https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@momchil-flex So, what I am doing is I am looking at the number of cores the local computer has. If the number of cores is greater than 2, I use all but 1 core for parallel computation. So, if I have 4 cores, the code uses all 3 cores.
If the backend code has multiprocessing
I would assume the same implementation also works on the frontend. I have tested this with 100s of frequency simulations but have not tested this with a big mode solving in parallel. Do you have some examples I can use to test this? I can also just make a really fine mesh and see how it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two things here. The first one is the memory. I think you should definitely test with some large mode planes. multiprocessing
does not handle that smoothly, it can definitely freeze if you run out of memory. So on the backend we do some checks to estimate how much memory will be needed to decide how many cores to use.
The other point is about disabling multi-threading within each process to avoid over-allocation which can completely slow things down. This is what the link to joblib that I posted is about. I remember you experienced this in your initial testing (you needed to set the env variables I sent you to 1), but here you don't seem to be doing that. So I'm wondering what's happening, whether it's joblib doing something (which it seems to be able to, but you're not using that functionality as far as I can tell), or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@momchil-flex okay, that makes sense to set inner_max_num_threads
=1. Apparently, by default for "the 'loky' backend: by default each worker process will have environment variables set to allow a maximum of cpu_count() // n_jobs (which is 1 in my case with 12 CPUs and (12-1) njobs )so that the total number of threads used by all the workers does not exceed the number of CPUs of the host."
This is probably why it was working fine.
So, now I have something like this written down and it works the same as before
with parallel_config(backend="loky", inner_max_num_threads=1): results = Parallel(n_jobs=n_jobs)(delayed(self._solve_single_freq)( freq = freq, coords = coords, symmetry = symmetry) for freq in self.freqs )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@momchil-flex okay tried it with crazy fine with resolution of min_steps_per_wvl= 1000
compared to 15 before (so around 4400 times larger number of grid points). So, this time it crashed saying system out of memory. We need to implement similar checks like you had in the backend then. Could we discuss this tomorrow?
results = msweb.run_batch(mode_solver_list, verbose = False) | ||
|
||
assert any(isinstance(x,ModeSolverData) for x in results) | ||
assert (results[i].n_eff.shape == (num_freqs, i+1) for i in range(num_of_sims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be all(...)
results[i] = mode_solver_list[i].solve() | ||
|
||
assert any(isinstance(x,ModeSolverData) for x in results) | ||
assert (results[i].n_eff.shape == (num_freqs, i+1) for i in range(num_of_sims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be all(...)
…changes to local mode solver, updated errors in test_mode_solver_web_run_batch test function
@tylerflex implemented all the feedback and ran both |
Hi @prashkh , I should have warned you about this. But we need to use a specific version of I can take a more detailed look tomorrow if you don't feel totally comfortable with it, sorry for the inconvenience :( |
@tylerflex I tried to fix it by reverting to the commit before this one but not sure it worked. Might need your help tomorrow. Sorry about it. I should have used the current black and ruff versions. |
closing this PR as it seems we're going with #1685 |
…ization using joblib for the local mode solver, and added new functions for unit testing.
Since there was some mix up between
develop
branch andpre/2.7
when I rebased, I created a new PR and implemented all the changes you requested. I also added the parallelization for the local mode solver and created two new unit tests for both testing the local mode solver and the run_batch for the web.