test_organize_nwb_test_data started to take longer? #191

yarikoptic · 2020-08-12T19:01:34Z

Didn't test/time yet, but they (perceptually) started to take longish time, whenever AFAIK they should be speedy as long as the metadata reading is cached etc (... my suspicion that caching of get_metadata etc is no longer in effect somehow. Happens even if I roll back to e.g. tags/0.5.0~1 . In the log I see

2020-08-12 14:58:33,706 [    INFO] Loading metadata from 31 files
2020-08-12 14:58:42,151 [   DEBUG] Calling memoized version of <function get_metadata at 0x7fee9fdd5430> for /home/yoh/.tmp/tmp9onmwll2/v2.0.1/test_ClusterWaveforms.nwb

so it is doing smth for 9 seconds, although then says that it does reuse cached get_metadata from the previous test in the matrix... so the question is what is it doing and either it could be avoided.

The text was updated successfully, but these errors were encountered:

jwodder · 2020-08-13T19:35:19Z

I can reproduce the several seconds of nothing, and it appears to be due to joblib.Parallel somehow. If I disable joblib.Parallel by setting DEVEL_DANDI=1 in the test environment and adding "--devel-debug" to cmd in test_organize_nwb_test_data, the time gap goes away. However, I can't figure out why joblib is doing this; trying to make its logs show up by setting the root logger's log level to DEBUG did not produce any messages from joblib, even though there are references to logging in its code.

yarikoptic · 2020-08-13T23:37:53Z

THANK YOU for digging into it
TL;DR summary from my end -- I would at least partially blame a heavy nature of pynwb imports - it imports even pandas and ends up even with matplotlib imported. Filed a motivational PR NeurodataWithoutBorders/pynwb#1282 to start dialog ;-)

if I defer import of pynwb into get_metadata call (5e4eb5a) I would get overall test run of 36 sec, whenever master takes 44 sec (well -- only 8 sec difference, there is still some 2 sec wasted between Loading and Calling)

Some additional blurb

oh - may be it is busy serializing/pickling to pass into subprocesses... and may be we got some heavy thing bound somewhere... I tried to py-spy it but it blew up...
as for logging, it seems to be quite all over the place in joblib where they do not use a dedicated one and some borrowed submodules do

$> git grep logging
conftest.py:import logging
conftest.py:    """Setup multiprocessing logging for the tests"""
conftest.py:        log = mp.util.log_to_stderr(logging.DEBUG)
conftest.py:        log.handlers[0].setFormatter(logging.Formatter(
joblib/externals/cloudpickle/cloudpickle_fast.py:import logging
joblib/externals/cloudpickle/cloudpickle_fast.py:    return logging.getLogger, (obj.name,)
joblib/externals/cloudpickle/cloudpickle_fast.py:    return logging.getLogger, ()
joblib/externals/cloudpickle/cloudpickle_fast.py:    _dispatch_table[logging.Logger] = _logger_reduce
joblib/externals/cloudpickle/cloudpickle_fast.py:    _dispatch_table[logging.RootLogger] = _root_logger_reduce
joblib/externals/loky/_base.py:import logging
joblib/externals/loky/_base.py:    LOGGER = logging.getLogger("concurrent.futures")
joblib/externals/loky/backend/fork_exec.py:    # Make sure to keep stdout and stderr open for logging purpose
joblib/externals/loky/backend/resource_tracker.py:#  * add some VERBOSE logging
joblib/externals/loky/backend/spawn.py:#  * Improve logging data
joblib/externals/loky/backend/spawn.py:        import logging
joblib/externals/loky/backend/spawn.py:            logging.Formatter(data['log_fmt'])
joblib/func_inspect.py:    # XXX: Not using logging framework
joblib/logger.py:Helpers for logging.
joblib/logger.py:import logging
joblib/logger.py:    """ Base class for logging messages.
joblib/logger.py:        logging.warning("[%s]: %s" % (self, msg))
joblib/logger.py:        logging.debug("[%s]: %s" % (self, msg))
joblib/memory.py:# TODO: Same remark for the logger, and probably use the Python logging
joblib/test/test_memory.py:    # capture stdlib logging output (see
joblib/test/test_memory.py:    # logging destination) to make sure there is no exception while
joblib/test/test_memory.py:    # logging destination) to make sure there is no exception

In DataLad I usually end up just reassigning handlers from datalad if I want some 3rd party module to log... tried here but it lead to the opposite effect. Setting root logger to DEBUG though worked. That is the patch

$> git diff
diff --git a/dandi/__init__.py b/dandi/__init__.py
index a6caf46..d956171 100644
--- a/dandi/__init__.py
+++ b/dandi/__init__.py
@@ -36,3 +36,7 @@ lgr = get_logger()
 set_logger_level(lgr, os.environ.get("DANDI_LOG_LEVEL", logging.INFO))
 FORMAT = "%(asctime)-15s [%(levelname)8s] %(message)s"
 logging.basicConfig(format=FORMAT)
+
+logging.root.setLevel(logging.DEBUG)
+# somehow complitely disables our logging!
+#logging.root.handlers = lgr.handlers

and what I saw was:

$ DANDI_LOG_LEVEL=DEBUG python -m pytest -s -v -k test_organize_nwb_test_data
...
2020-08-13 18:26:05,295 [    INFO] Loading metadata from 31 files
2020-08-13 18:26:08,096 [   DEBUG] CONFIGDIR=/home/yoh/.config/matplotlib
2020-08-13 18:26:08,102 [   DEBUG] (private) matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,103 [   DEBUG] loaded rc file /etc/matplotlibrc
2020-08-13 18:26:08,106 [   DEBUG] matplotlib version 3.2.1
2020-08-13 18:26:08,106 [   DEBUG] interactive is False
2020-08-13 18:26:08,107 [   DEBUG] platform is linux
2020-08-13 18:26:08,107 [   DEBUG] loaded modules: ['sys', 'builtins', '_frozen_importlib',....
... [more or less the above repeated 12 times, I guess one per core/parallel process]
020-08-13 18:26:08,763 [   DEBUG] CACHEDIR=/home/yoh/.cache/matplotlib
2020-08-13 18:26:08,766 [   DEBUG] matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,773 [   DEBUG] CACHEDIR=/home/yoh/.cache/matplotlib
2020-08-13 18:26:08,773 [   DEBUG] matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,782 [   DEBUG] Using fontManager instance from /home/yoh/.cache/matplotlib/fontlist-v310.json
2020-08-13 18:26:08,799 [   DEBUG] Using fontManager instance from /home/yoh/.cache/matplotlib/fontlist-v310.json
2020-08-13 18:26:08,812 [   DEBUG] CACHEDIR=/home/yoh/.cache/matplotlib
2020-08-13 18:26:08,816 [   DEBUG] matplotlib data path: /usr/share/matplotlib/mpl-data
2020-08-13 18:26:08,824 [   DEBUG] Using fontManager instance from /home/yoh/.cache/matplotlib/fontlist-v310.json
2020-08-13 18:26:10,597 [   DEBUG] Calling memoized version of <function get_metadata at 0x7fb0092c3c10> for /home/yoh/.tmp/tmpcynfjpr3/v2.0.1/test_CurrentClampSeries.nwb

here it was just 5 not 9 seconds (may be 0.6.0 got faster? ;)), but still awhile -- so it took 3.5 seconds to start subprocess?! those VERY long lists of modules in each subprocess, I hope they are not really reimported again and just reported to be imported at that point ... then there is an additional 1.5 sec at the end before calling memoized which is not accounted for...

But overall I now question my initial assessment that I have not experienced that slow operation before... may be later I will try to time travel into original dates when I introduced parallelization there and see if I could avoid seeing the effect... By a quick downgrade of pynwb alone I do get leaner "imports profile" but overall "quality" is comparable.

yarikoptic · 2020-08-14T01:29:28Z

I would need to compare on some real use case (dataset) where we would actually need to load metadata (so there would be no cache) to see how it compares to multiprocessing approach but as for the tests

diff --git a/dandi/cli/cmd_organize.py b/dandi/cli/cmd_organize.py
index 2424bcd..8c0d2e4 100644
--- a/dandi/cli/cmd_organize.py
+++ b/dandi/cli/cmd_organize.py
@@ -182,7 +182,7 @@ def organize(
             # to no benefit from Parallel without using multiproc!  But that would
             # complicate progress bar indication... TODO
             metadata = list(
-                Parallel(n_jobs=-1, verbose=10)(
+                Parallel(n_jobs=-1, verbose=10, backend="threading")(
                     delayed(_get_metadata)(path) for path in paths
                 )
             )

makes those tests run in 10-12 instead of 47 seconds ... we could have exposed type of parallelization as an option, but I think that would be overkill.

edit: comment in the code says that it is CPU intensive so parallelization for the cases where there is no cached metadata -- would better be done via parallel processes. We could make it "smarter" by exposing interface to query the cache on which ones would need to be really computed, and then invoke Parallel only on those. But it would need some kind of a helper shim on top of Parallel (e.g. CachedParallel) to avoid coding for such cases all the time with identical code.

yarikoptic · 2021-07-27T18:06:35Z

didn't bother anyone in awhile, might reopen if becomes pertinent again

yarikoptic assigned jwodder Aug 12, 2020

yarikoptic added the DX Developer eXperience label Aug 12, 2020

yarikoptic mentioned this issue Aug 13, 2020

OPT: make pandas (and may be scipy, ... anything else?) dependency optional NeurodataWithoutBorders/pynwb#1282

Draft

5 tasks

yarikoptic mentioned this issue Aug 14, 2020

ENH: make tests not conflicting and thus runnable in parallel with pytest-xdist #194

Closed

yarikoptic added the no solution identified Issues which are blocked by no proper solution yet found label Sep 14, 2020

yarikoptic closed this as completed Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_organize_nwb_test_data started to take longer? #191

test_organize_nwb_test_data started to take longer? #191

yarikoptic commented Aug 12, 2020

jwodder commented Aug 13, 2020

yarikoptic commented Aug 13, 2020

yarikoptic commented Aug 14, 2020 •

edited

Loading

yarikoptic commented Jul 27, 2021

test_organize_nwb_test_data started to take longer? #191

test_organize_nwb_test_data started to take longer? #191

Comments

yarikoptic commented Aug 12, 2020

jwodder commented Aug 13, 2020

yarikoptic commented Aug 13, 2020

yarikoptic commented Aug 14, 2020 • edited Loading

yarikoptic commented Jul 27, 2021

yarikoptic commented Aug 14, 2020 •

edited

Loading