-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: make pandas (and may be scipy, ... anything else?) dependency optional #1282
base: dev
Are you sure you want to change the base?
Conversation
Thanks for the PR @yarikoptic . It is possible to adapt The question then is whether that is the right solution for the problem and whether this solution is appropriate for the majority of users / use cases. While delaying import of pandas and scipy would speed up the initial import, it creates the inconvenience that anyone who wants to use the commonly used The root problem seems to be that you want to read basic metadata information from an NWB file and need a fast, lightweight tool to do so that uses pandas or scipy only on demand. We could solve that by creating forks of pynwb and hdmf (pynwb-lite and hdmf-lite?) with not only delayed imports but also a pickled TypeMap of some sort (see #1050 ) and other efficiencies for this particular use case. This is more work for us, the maintainers, however... but if we only need to update it after every minor release, then it might not be so bad... |
@@ -1,5 +1,4 @@ | |||
h5py==2.10.0 | |||
hdmf==2.1.0 | |||
numpy==1.18.5 | |||
pandas==0.25.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas is still a requirement though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that having pandas as a requirement for install should not be an issue. If we "just" want to disable certain functionality to speed-up import, then an alternative way would be to use environment variables such that, e.g., if PYNWB_NOPANDAS is set then you would not import Pandas, i.e., the user would have to explicitly set the variable before importing pynwb. Having a special "pynwb-lite" may be the "cleaner" variant but I'm afraid that this will ultimately add a lot of work and I don't think we have sufficient resources right now to support such an effort.
Thank you @ajtritt and @oruebel for your rapid feedback, very much appreciated!
I totally agree and appreciate the difficulty of this issue. Unfortunately I do not know an ideal solution ATM, and some compromise should be made. A similar battle we are fighting in DataLad, to which end we also use delayed imports, extensions (so some functionality is not available until people install domain/purpose specific extensions such as Overall I see it beneficial to separate two aspects: runtime and installation. Although runtime changes would depend on how installed, I think runtime should be made to cater available installation schemes. code / runtimedocstrings
One of the major aspects to appreciate is that no pandas data structures would reach code for checking for So even for type checking no need to import pandas if pandas is not among loaded modules already. It is only to mint docstrings etc where I thought it would be desired to be able to list delayed importsthey do add up some coding burden and easy to "slip away". To that end it is valuable to add a unittest along the lines of my snippet which checks on effect of the import of pynwb... I thought we had such one in datalad as well but failed to find it now (I guess we just refactored all heavy dependencies away anyways). delayed imports sometimes (although I see those less and less frequently) even help to just avoid some tricky "import bug" of some heavy dependency, so with delayed imports it emerges only whenever related code path is hit. re @oruebel env vars idea which is an alternative to delayed imports as far as I see it: I think it would be adding substantially more of user and downstream developers burden. installation: user vs use as a library dependency
Yes. To that end as a compromise $> datalad containers-run
datalad: Unknown command 'containers-run'. See 'datalad --help'.
Hint: Command containers-run is provided by (not installed) extension datalad-container.
|
Installing an extra always also installs everything listed in |
yeap, that is what I suggested in preceding bullet point. But such a feature request ("make specific extra_requires to not demand installation of |
@yarikoptic There's a discussion of a similar feature request going on now in the Python Packaging forum. |
Great find @jwodder - exactly what is needed. I will comment after/if digest it in full - didn't find mentioning of setup.cfg which might get odd if it to be None or an empty string |
Motivation
Disclaimer: I love pandas, matplotlib, scipy!
But AFAIK none of them must be needed for regular IO on .nwb files, unless I do have some pandas DataFrame I am working with and thus would have pandas imported already.
ATM 9 importing of
pynwb
causes import of over 800 modules from about 100 packages into sys.modules:so -- 1.4 sec just to import top level pynwb module, possibly just to read or save a .nwb file. That is seems what "contributes" to our attempts to parallelize some functionality in dandi-cli: see dandi/dandi-cli#191
one of the main "abusers" is
pandas
. it is the one causing matplotlib imports etc. With this draft PR + following quick and dirty patch to hdmf (would need to be send there but I ran out of juice and wanted to first "palpate interest"):import of pynwb drops almost that half a second which takes to import pandas:
Next candidate could be scipy which seems to be needed only at the
hdmf
level and only for sparse matrices...here is the q&d patch
which seems to drop additional 100ms from pynwb import time...
The main problem to address is the
@docval
types description. I have not looked into either it is used only for docs (thus could be just replaced with a string), or needs actual type for type checking may be???How to test the behavior?
See all the snippets above
Checklist
flake8
from the source directory.#XXX
notation whereXXX
is the issue number?