Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPI source distribution does not include the test data #381

Open
bcbnz opened this issue Aug 27, 2024 · 5 comments
Open

PyPI source distribution does not include the test data #381

bcbnz opened this issue Aug 27, 2024 · 5 comments
Labels
internals Internal machinery invalid This doesn't seem right

Comments

@bcbnz
Copy link

bcbnz commented Aug 27, 2024

I was trying to build a distribution package from the 0.1.16 source tarball on PyPI. Running the tests gave the following error during startup:

argopy/tests/conftest.py:8: in <module>
    from mocked_http import mocked_httpserver
argopy/tests/helpers/mocked_http.py:101: in <module>
    class HTTPTestHandler(BaseHTTPRequestHandler):
argopy/tests/helpers/mocked_http.py:103: in HTTPTestHandler
    "": get_html_landing_page(),
argopy/tests/helpers/mocked_http.py:92: in get_html_landing_page
    html.append("<h1>Mocked HTTP server is up and running, serving %i files</h1>" % len(URI))
E   NameError: name 'URI' is not defined

Working back through the code, I found the underlying problem was that argopy/tests/test_data is not included in the PyPI source tarball. Switching to the tarball that GitHib provides from the 0.1.16 tag works as that includes the test data.

@gmaze gmaze added invalid This doesn't seem right internals Internal machinery labels Aug 27, 2024
@gmaze
Copy link
Member

gmaze commented Aug 27, 2024

Hi @bcbnz
thanks for raising the issue and I'm glad you found a way to fix it

indeed we exclude the test data from the pypi distribution because it is about 420Mb in size, while the entire code only is about a few Mb, and we don't want to clutter the package with data that won't be used by most installers

if you have some indications with regards to:

  • what is your argopy usage ? in other words, what kind of users would need to build a distribution package ? (we can only think of: regular users installing from pip or conda that don't need test data, or code contributors working with a fork and thus having the test data)
  • how we may design this differently for a better user experience ? should test data be on another repo ? should we update the documentation to clarify this internal specificity ? etc

thanks for your feedbacks !

@bcbnz
Copy link
Author

bcbnz commented Aug 27, 2024

Hi @gmaze

In my case this is for an Arch AUR package which allows Arch users to build a package which can then be installed system-wide by the system package manager. Standard practice for Python packages is to get the source code, build a wheel locally, test it and then generate an Arch package from it. Where possible, I like to use sources from PyPI because (a) they have a checksum of the uploaded file available and (b) GitHub has changed the way they generate archives in the past, which means the checksum changes even if the contents are identical, which breaks packaging. (Note that the current Arch system version of xarray is not compatible with NumPy 2.0 so I am advising users to use a virtual environment if that causes them problems -- for use in developing other projects this should be preferred anyway, but sometimes it is useful to have a system-wide install).

I can think of three options to avoid this issue:

  1. Include the test data in the source distribution (note I am not suggesting the PyPI wheel should include the test data, just the .tar.gz source). Yes, it makes it bigger (note the GitHub tarball is 144 MB so I guess the test data compresses well) but this will only impact people building a local copy -- pip will use the wheel from PyPI unless instructed otherwise, and other tools like conda rely on their own prebuilt packages which should also exclude this data.
  2. Make a download of the test data easily available somewhere. This could be done with each release by uploading a .tar.gz or similar with the data to the GitHub release page. A separate repository (which could be included into the main repository as a submodule) could be easier as it avoids the need to keep uploading new versions of the test data. As part of this, modify argopy/tests/helpers/mocked_http.py to check if the data is available, and if not raise a RuntimeError or similar with instructions of how to get the test data to make the problem clear (it took some investigation to figure out what was causing the error in my original report as it is not obvious).
  3. Remove the tests entirely from the source uploaded to PyPI. If people building from source want to test the package they generate, they will note the lack of tests and look for another source which includes them, such as the GitHub tarball or a direct clone of the repository. This is my least preferred option, but in my opinion this is better than having tests which don't run with a non-obvious error which requires the user to try and debug the problem.

@gmaze
Copy link
Member

gmaze commented Aug 28, 2024

Ok I understand better the use case, thanks for the explanation and your propositions !

  • @ocefpaf what do you think of prop 1 above ? do you think you could implement it ? Otherwise may be @bcbnz you could through a PR to do this ?
  • In the meantime, I will also update argopy/tests/helpers/mocked_http.py to be more verbose in case test_data are missing

Copy link

This issue was marked as staled automatically because it has not seen any activity in 90 days

@github-actions github-actions bot added the stale No activity over the last 90 days label Nov 26, 2024
@ocefpaf
Copy link
Collaborator

ocefpaf commented Nov 26, 2024

@gmaze somehow I missed this ping and only notice it now that the bot marked as stale. As a packager, I usually go for 1 when I need the test data that is not on PyPI.

Not that, while running tests locally is good practice, pure Python packages rarely require that. Only compiled extensions are required to run tests locally IMO.

I would not do option 3 but option 2, while somewhat more laborious, is also fine. (I do 2 for my packages. See https://github.com/pyoceans/pocean-core/releases and https://github.com/pyoceans/pocean-core/blob/main/pocean/tests/download_test_data.py for an example.)

@github-actions github-actions bot removed the stale No activity over the last 90 days label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Internal machinery invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants