Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_organize_nwb_test_data started to take longer? #191

Closed
yarikoptic opened this issue Aug 12, 2020 · 4 comments
Closed

test_organize_nwb_test_data started to take longer? #191

yarikoptic opened this issue Aug 12, 2020 · 4 comments
Assignees
Labels
DX Developer eXperience no solution identified Issues which are blocked by no proper solution yet found

Comments

@yarikoptic
Copy link
Member

Didn't test/time yet, but they (perceptually) started to take longish time, whenever AFAIK they should be speedy as long as the metadata reading is cached etc (... my suspicion that caching of get_metadata etc is no longer in effect somehow. Happens even if I roll back to e.g. tags/0.5.0~1 . In the log I see

2020-08-12 14:58:33,706 [    INFO] Loading metadata from 31 files
2020-08-12 14:58:42,151 [   DEBUG] Calling memoized version of <function get_metadata at 0x7fee9fdd5430> for /home/yoh/.tmp/tmp9onmwll2/v2.0.1/test_ClusterWaveforms.nwb

so it is doing smth for 9 seconds, although then says that it does reuse cached get_metadata from the previous test in the matrix... so the question is what is it doing and either it could be avoided.

@yarikoptic yarikoptic added the DX Developer eXperience label Aug 12, 2020
@jwodder
Copy link
Member

jwodder commented Aug 13, 2020

I can reproduce the several seconds of nothing, and it appears to be due to joblib.Parallel somehow. If I disable joblib.Parallel by setting DEVEL_DANDI=1 in the test environment and adding "--devel-debug" to cmd in test_organize_nwb_test_data, the time gap goes away. However, I can't figure out why joblib is doing this; trying to make its logs show up by setting the root logger's log level to DEBUG did not produce any messages from joblib, even though there are references to logging in its code.

@yarikoptic
Copy link
Member Author

THANK YOU for digging into it
TL;DR summary from my end -- I would at least partially blame a heavy nature of pynwb imports - it imports even pandas and ends up even with matplotlib imported. Filed a motivational PR NeurodataWithoutBorders/pynwb#1282 to start dialog ;-)

if I defer import of pynwb into get_metadata call (5e4eb5a) I would get overall test run of 36 sec, whenever master takes 44 sec (well -- only 8 sec difference, there is still some 2 sec wasted between Loading and Calling)

Some additional blurb

oh - may be it is busy serializing/pickling to pass into subprocesses... and may be we got some heavy thing bound somewhere... I tried to py-spy it but it blew up...
as for logging, it seems to be quite all over the place in joblib where they do not use a dedicated one and some borrowed submodules do

$> git grep logging
conftest.py:import logging
conftest.py:    """Setup multiprocessing logging for the tests"""
conftest.py:        log = mp.util.log_to_stderr(logging.DEBUG)
conftest.py:        log.handlers[0].setFormatter(logging.Formatter(
joblib/externals/cloudpickle/cloudpickle_fast.py:import logging
joblib/externals/cloudpickle/cloudpickle_fast.py:    return logging.getLogger, (obj.name,)
joblib/externals/cloudpickle/cloudpickle_fast.py:    return logging.getLogger, ()
joblib/externals/cloudpickle/cloudpickle_fast.py:    _dispatch_table[logging.Logger] = _logger_reduce
joblib/externals/cloudpickle/cloudpickle_fast.py:    _dispatch_table[logging.RootLogger] = _root_logger_reduce
joblib/externals/loky/_base.py:import logging
joblib/externals/loky/_base.py:    LOGGER = logging.getLogger("concurrent.futures")
joblib/externals/loky/backend/fork_exec.py:    # Make sure to keep stdout and stderr open for logging purpose
joblib/externals/loky/backend/resource_tracker.py:#  * add some VERBOSE logging
joblib/externals/loky/backend/spawn.py:#  * Improve logging data
joblib/externals/loky/backend/spawn.py:        import logging
joblib/externals/loky/backend/spawn.py:            logging.Formatter(data['log_fmt'])
joblib/func_inspect.py:    # XXX: Not using logging framework
joblib/logger.py:Helpers for logging.
joblib/logger.py:import logging
joblib/logger.py:    """ Base class for logging messages.
joblib/logger.py:        logging.warning("[%s]: %s" % (self, msg))
joblib/logger.py:        logging.debug("[%s]: %s" % (self, msg))
joblib/memory.py:# TODO: Same remark for the logger, and probably use the Python logging
joblib/test/test_memory.py:    # capture stdlib logging output (see
joblib/test/test_memory.py:    # logging destination) to make sure there is no exception while
joblib/test/test_memory.py:    # logging destination) to make sure there is no exception  

In DataLad I usually end up just reassigning handlers from datalad if I want some 3rd party module to log... tried here but it lead to the opposite effect. Setting root logger to DEBUG though worked. That is the patch

$> git diff
diff --git a/dandi/__init__.py b/dandi/__init__.py
index a6caf46..d956171 100644
--- a/dandi/__init__.py
+++ b/dandi/__init__.py
@@ -36,3 +36,7 @@ lgr = get_logger()
 set_logger_level(lgr, os.environ.get("DANDI_LOG_LEVEL", logging.INFO))
 FORMAT = "%(asctime)-15s [%(levelname)8s] %(message)s"
 logging.basicConfig(format=FORMAT)
+
+logging.root.setLevel(logging.DEBUG)
+# somehow complitely disables our logging!
+#logging.root.handlers = lgr.handlers

and what I saw was:

$ DANDI_LOG_LEVEL=DEBUG python -m pytest -s -v -k test_organize_nwb_test_data
...
2020-08-13 18:26:05,295 [    INFO] Loading metadata from 31 files
2020-08-13 18:26:08,096 [   DEBUG] CONFIGDIR=/home/yoh/.config/matplotlib
2020-08-13 18:26:08,102 [   DEBUG] (private) matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,103 [   DEBUG] loaded rc file /etc/matplotlibrc
2020-08-13 18:26:08,106 [   DEBUG] matplotlib version 3.2.1
2020-08-13 18:26:08,106 [   DEBUG] interactive is False
2020-08-13 18:26:08,107 [   DEBUG] platform is linux
2020-08-13 18:26:08,107 [   DEBUG] loaded modules: ['sys', 'builtins', '_frozen_importlib',....
... [more or less the above repeated 12 times, I guess one per core/parallel process]
020-08-13 18:26:08,763 [   DEBUG] CACHEDIR=/home/yoh/.cache/matplotlib
2020-08-13 18:26:08,766 [   DEBUG] matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,773 [   DEBUG] CACHEDIR=/home/yoh/.cache/matplotlib
2020-08-13 18:26:08,773 [   DEBUG] matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,782 [   DEBUG] Using fontManager instance from /home/yoh/.cache/matplotlib/fontlist-v310.json
2020-08-13 18:26:08,799 [   DEBUG] Using fontManager instance from /home/yoh/.cache/matplotlib/fontlist-v310.json
2020-08-13 18:26:08,812 [   DEBUG] CACHEDIR=/home/yoh/.cache/matplotlib
2020-08-13 18:26:08,816 [   DEBUG] matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,824 [   DEBUG] Using fontManager instance from /home/yoh/.cache/matplotlib/fontlist-v310.json
2020-08-13 18:26:10,597 [   DEBUG] Calling memoized version of <function get_metadata at 0x7fb0092c3c10> for /home/yoh/.tmp/tmpcynfjpr3/v2.0.1/test_CurrentClampSeries.nwb

here it was just 5 not 9 seconds (may be 0.6.0 got faster? ;)), but still awhile -- so it took 3.5 seconds to start subprocess?! those VERY long lists of modules in each subprocess, I hope they are not really reimported again and just reported to be imported at that point ... then there is an additional 1.5 sec at the end before calling memoized which is not accounted for...

But overall I now question my initial assessment that I have not experienced that slow operation before... may be later I will try to time travel into original dates when I introduced parallelization there and see if I could avoid seeing the effect... By a quick downgrade of pynwb alone I do get leaner "imports profile" but overall "quality" is comparable.

@yarikoptic
Copy link
Member Author

yarikoptic commented Aug 14, 2020

I would need to compare on some real use case (dataset) where we would actually need to load metadata (so there would be no cache) to see how it compares to multiprocessing approach but as for the tests

diff --git a/dandi/cli/cmd_organize.py b/dandi/cli/cmd_organize.py
index 2424bcd..8c0d2e4 100644
--- a/dandi/cli/cmd_organize.py
+++ b/dandi/cli/cmd_organize.py
@@ -182,7 +182,7 @@ def organize(
             # to no benefit from Parallel without using multiproc!  But that would
             # complicate progress bar indication... TODO
             metadata = list(
-                Parallel(n_jobs=-1, verbose=10)(
+                Parallel(n_jobs=-1, verbose=10, backend="threading")(
                     delayed(_get_metadata)(path) for path in paths
                 )
             )

makes those tests run in 10-12 instead of 47 seconds ... we could have exposed type of parallelization as an option, but I think that would be overkill.

edit: comment in the code says that it is CPU intensive so parallelization for the cases where there is no cached metadata -- would better be done via parallel processes. We could make it "smarter" by exposing interface to query the cache on which ones would need to be really computed, and then invoke Parallel only on those. But it would need some kind of a helper shim on top of Parallel (e.g. CachedParallel) to avoid coding for such cases all the time with identical code.

@yarikoptic
Copy link
Member Author

didn't bother anyone in awhile, might reopen if becomes pertinent again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DX Developer eXperience no solution identified Issues which are blocked by no proper solution yet found
Projects
None yet
Development

No branches or pull requests

2 participants