Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perform 1-3 HTTP requests for each wheel using fast-deps #12208

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Aug 6, 2023

Continued motivation for fast-deps

While PEP 658 is the standards-compliant solution and metadata from there is already preferred when available, --use-feature=fast-deps avoids downloading wheels against --find-links repos and any pypi index not supporting PEP 658 yet. Most non-pypi indices will be either of these, because it's very easy to expose those to pip with a simple file server, so improving fast-deps (and turning it on by default) is necessary to extend the benefits from the recent metadata resolve work to most users hosting their own index, especially corporations running pip in their internal CI.

Problem

Solution

  1. The HTTP Range header accepts a negative value bytes=-N, which acts like negative slice indices in Python, returning a chunk from the end of the file. This avoids a HEAD request to get the file length.
  2. The *.dist-info/ directory is all that's going to be extracted from our lazy wheels, and this directory's contents form a contiguous substring of the total file content. After extracting the central directory from the end of the file with our first request, we can perform a single range request to populate the contents of every file in the *.dist-info/ directory in the lazy wheel, so no further HTTP requests need to be made to continue the resolution.

Additional Fixes

Two additional issues have popped up since #11447:

  1. pypi no longer supports negative byte ranges, returning an error from the varnish cache (see lazier lazy_wheel #11481 (comment)). This is expected behavior from pypi: see Support Negative Offset Range Requests on files.pythonhosted.org pypi/warehouse#12823.
  2. Some wheels such as tensorflow-gpu==2.5.3 have begun to put their own *.dist-info/ directories at the beginning of the zip file, possibly as a result of generating them from other build tools which sort zip file entries lexicographically.

Both of these are considered to be reasonable behavior, and this change handles both cases gracefully.

Result

This halves the time to resolve dependencies from the below requirements for pip install --dry-run --report when the fixes of #12186 are merged:

> git checkout main
> rm -rf ~/.cache/pip
> time python3.8 -m pip install --report test.json --dry-run --ignore-installed 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m33.531s
user    0m13.335s
sys     0m3.219s
> git checkout -
> rm -rf ~/.cache/pip
> time python3.8 -m pip install --report test.json --dry-run --ignore-installed --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m18.417s
user    0m8.847s
sys     0m0.710s

As with PEP 658 metadata, in pathological cases which involve lots of backtracking, this will avoid downloading more than a single version of each wheel even for pip download or pip install without --dry-run. If --use-feature=fast-deps is enabled by default, this will also significantly improve performance of all resolves involving tensorflow-gpu==2.5.3 and other wheels which do not have PEP 658 metadata available on pypi, or against indices which do not serve PEP 658 metadata. I therefore propose turning on fast-deps by default, either in this PR or in #12186 which will be merged after this one.

@cosmicexplorer
Copy link
Contributor Author

@dholth: got this working again; on my connection to pypi, this new code (adapted from your solution) is much much faster than multiple small range requests. A couple of things that popped up:

  • As discussed, pypi isn't generating negative byte ranges for some reason; I still have that as a TODO.
  • Apparently, some wheels have taken it upon themselves to make this absolutely as hard as possible:
# download Keras wheel
> curl -L -O 'https://files.pythonhosted.org/packages/44/e1/dc0757b20b56c980b5553c1b5c4c32d378c7055ab7bfa92006801ad359ab/Keras-2.4.3-py2.py3-none-any.whl'
# this is normal; the dist-info is at the end
> unzip -l Keras-2.4.3-py2.py3-none-any.whl | tail -n10
      143  2020-06-24 22:38   keras/utils/vis_utils.py
       28  2020-06-24 22:38   keras/wrappers/__init__.py
      117  2020-06-24 22:38   keras/wrappers/scikit_learn.py
     1616  2020-06-24 22:40   Keras-2.4.3.dist-info/LICENSE
     1496  2020-06-24 22:40   Keras-2.4.3.dist-info/METADATA
      110  2020-06-24 22:40   Keras-2.4.3.dist-info/WHEEL
       11  2020-06-24 22:40   Keras-2.4.3.dist-info/top_level.txt
     6766  2020-06-24 22:40   Keras-2.4.3.dist-info/RECORD
---------                     -------
    73432                     83 files
# download tensorflow-gpu wheel
> curl -L -O 'https://files.pythonhosted.org/packages/80/4d/3a008dc31225768318e7ba0f7f95aa4677b0936805be40e37036b7755d62/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl'
# hell
> unzip -l tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl | head -n10
Archive:  tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      111  2022-01-30 11:38   tensorflow_gpu-2.5.3.dist-info/WHEEL
       11  2022-01-30 17:16   tensorflow_gpu-2.5.3.dist-info/top_level.txt
     2810  2022-01-30 17:17   tensorflow_gpu-2.5.3.dist-info/METADATA
      550  2022-01-30 17:16   tensorflow_gpu-2.5.3.dist-info/entry_points.txt
  1046068  2022-01-30 11:38   tensorflow_gpu-2.5.3.dist-info/RECORD
    13777  2022-01-30 17:17   tensorflow_gpu-2.5.3.dist-info/LICENSE
    25049  2022-01-30 17:16   tensorflow/__init__.py

So this change had to expand your approach to cover the case of dist-info dirs placed at the very start of a zip file. I'm assuming this was done intentionally by google in order to do zip file hacks like we're trying to do, but from the front of the file. Regardless, this also works now.

@cosmicexplorer cosmicexplorer force-pushed the lazier-wheel branch 2 times, most recently from 33f2431 to 49ad8c5 Compare August 7, 2023 08:07
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Aug 7, 2023

cc @ewdurbin: are you familiar with why pypi might be failing to accept negative byte range requests (of the form Range: bytes=-N)? It does not block the work here, since that was the lesser of the optimizations, but since wheels like tensorflow-gpu==2.5.3 (downloadable at https://files.pythonhosted.org/packages/80/4d/3a008dc31225768318e7ba0f7f95aa4677b0936805be40e37036b7755d62/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl) don't have PEP 658 metadata generated (yet?), pip will fall back to fast-deps metadata from this PR. Not asking you to do any more work, since it's fully accounted for in this change, but was curious is all since @dholth mentioned this used to be a header pypi was able to interpret.

@cosmicexplorer
Copy link
Contributor Author

Still marking this as draft until we get some test cases going, which will have to be sometime next week.

@ewdurbin
Copy link
Member

ewdurbin commented Aug 7, 2023

cc @ewdurbin: are you familiar with why pypi might be failing to accept negative byte range requests (of the form Range: bytes=-N)? It does not block the work here, since that was the lesser of the optimizations, but since wheels like tensorflow-gpu==2.5.3 (downloadable at https://files.pythonhosted.org/packages/80/4d/3a008dc31225768318e7ba0f7f95aa4677b0936805be40e37036b7755d62/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl) don't have PEP 658 metadata generated (yet?), pip will fall back to fast-deps metadata from this PR. Not asking you to do any more work, since it's fully accounted for in this change, but was curious is all since @dholth mentioned this used to be a header pypi was able to interpret.

See pypi/warehouse#12823

@cosmicexplorer
Copy link
Contributor Author

Thanks so much!

@cosmicexplorer cosmicexplorer force-pushed the lazier-wheel branch 2 times, most recently from ec80945 to a80dc04 Compare August 9, 2023 02:08
@cosmicexplorer cosmicexplorer force-pushed the lazier-wheel branch 7 times, most recently from 4a03cc8 to 35ae8bc Compare August 9, 2023 04:35
@cosmicexplorer cosmicexplorer changed the title reduce HTTP requests for fast-deps, improving performance for high-latency connections perform 1-2 HTTP requests for each wheel using fast-deps Aug 9, 2023
@cosmicexplorer cosmicexplorer force-pushed the lazier-wheel branch 2 times, most recently from ec4f803 to e024a28 Compare August 9, 2023 05:27
@cosmicexplorer cosmicexplorer changed the title perform 1-2 HTTP requests for each wheel using fast-deps perform 1-3 HTTP requests for each wheel using fast-deps Aug 9, 2023
@cosmicexplorer cosmicexplorer force-pushed the lazier-wheel branch 2 times, most recently from 12c78b0 to ac6cda5 Compare August 9, 2023 10:08
@cosmicexplorer
Copy link
Contributor Author

@notatallshaw @pradyunsg: macOS runners have all appeared to cancel themselves again at 95% (https://github.com/pypa/pip/actions/runs/10046709654/job/27766793866?pr=12208, saying "Error: The operation was canceled."). I'm not quite sure how to debug this?

@pradyunsg
Copy link
Member

6h 0m 13s

That looks like it failed because the CI timed out. I'd suggest looking at the logs to find out which tests started but never finished. You'll be able to tell by checking which ones don't have a status reported about them within the raw logs.

@cosmicexplorer
Copy link
Contributor Author

6h 0m 13s

That looks like it failed because the CI timed out. I'd suggest looking at the logs to find out which tests started but never finished. You'll be able to tell by checking which ones don't have a status reported about them within the raw logs.

So the tests that the logs fail at (https://github.com/pypa/pip/actions/runs/10046709654/job/27766794821?pr=12208) don't really seem to make sense; e.g. the 3.13 osx shard last displays tests/unit/test_wheel.py::TestInstallUnpackedWheel::test_invalid_entrypoints_fail[console_scripts-hello = hello], but that's an XFAIL. The raw logs for that shard (https://productionresultssa11.blob.core.windows.net/actions-results/d85a634c-7a9f-43e7-9ca3-f0d852b61367/workflow-job-run-35e056e7-49e2-502f-c6e6-3e89d12fbcfd/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-08-10T06%3A22%3A05Z&sig=%2F%2F3XcmB53D19L4lgn7ZCtKIBZiAIqZIKsgaZJZVMXvM%3D&ske=2024-08-10T15%3A32%3A24Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-08-10T03%3A32%3A24Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-05-04&sp=r&spr=https&sr=b&st=2024-08-10T06%3A12%3A00Z&sv=2024-05-04) fail at the same point. We did change a small amount of wheel test setup here, so I will try reverting those next.

However, I realized that this may well be about threaded server shutdown behavior which may well differ across OS, so in 0d72998 I tried setting ThreadingMixIn.block_on_close = False. I will see if that changes anything.

- handle short files `416`; prefetch entire dist-info
- lazy_wheel: translate BadZipfile to InvalidWheel
- handle 501 on negative bytes range from pypi
- catch UnsupportedWheel
- make lazy wheel work against tensorflow-gpu
- link to pypi discussion on negative byte ranges
- check for case when server doesn't support byte ranges at all
- remove _check_zip() method since we do that in prefetch_dist_info() now
- clean up error handling code and don't warn when negative ranges fail
- remove TODO, as Cache-Control: no-cache is the correct behavior
- rearrange explanatory comments
- specify how we handle 200 OK and interpret 405 as no range requests
- rename LazyZipOverHTTP to LazyWheelOverHTTP because of the assumption of the structure of *.dist-info/
- stream the initial chunk request too, since it's 1MB
- add note about fast-deps being on by default
- add testing for request limiting
- fix download metadata testing
- factor out the laziness from the wheel-specific logic
- factor out LazyRemoteResource from FixedSizeLazyResource
- add metadata_first=True arg to make_wheel
- reduce initial chunk size to 10K
- remove hardcoded compilewheel and instead generate wheels
- catch new requests decoding error
- support NegativeRangeOverflowing
- support SneakilyCoerceNegativeRange
- time out test server thread joining within 3 seconds

Co-authored-by: Tzu-ping Chung <[email protected]>

- make FakePackageSource to abstract over generated and hardcoded whls
- ensure InvalidWheel retains context from inner errors

Co-authored-by: Randy Döring <[email protected]>

- add link to perf evaluation from radoering
@cosmicexplorer
Copy link
Contributor Author

Ok, that didn't work, and I can't immediately reproduce the error on my mac laptop. I've just tried reverting all of the changes to the tests/ directory to see if that changes anything.

@cosmicexplorer cosmicexplorer force-pushed the lazier-wheel branch 2 times, most recently from b7c6abc to e4b2aaf Compare August 10, 2024 11:03
ideally, this should avoid collisions. no clue if this matters at all
@cosmicexplorer
Copy link
Contributor Author

Ok, I'm trying to skip unit tests on macos shards and seeing if they complete. That way we at least nail down that it's probably a test that's causing a hang and not some separate weird CI mechanic.

@cosmicexplorer
Copy link
Contributor Author

Ugh!!!! So that did pass, which means it is definitely a specific test (which is actually much less confusing, but frustrating since I can't seem to reproduce it). Will try running the exact command in CI on my mac laptop now.

@cosmicexplorer
Copy link
Contributor Author

Wow, that was quick! It appears to be test_dist_from_wheel_url_no_range()! Will work on figuring this out!

@cosmicexplorer
Copy link
Contributor Author

The key was using --numprocesses 1, using multiple processes appears to make it quite difficult to figure out when a single test is stopped!

@cosmicexplorer
Copy link
Contributor Author

Hm, it's actually probably test_dist_from_wheel_url_not_zip(), which makes sense but is non-obvious since the test name is never actually printed out! It remains to be seen why it's only on macos, but I'm very glad it finally reproduces ^_^

@cosmicexplorer
Copy link
Contributor Author

Ok, making a few fixtures use scope="session" seems to fix the hang on my mac laptop! Just making the PipSession use session scope alone doesn't work, it needs the MockServer to be session scope too. @uranusjr let's see if this actually passes CI first, but I would love your insight on why b06d73b seems to work, given that mypy_whl_no_range seems to only be used once in that file (so maybe it's the MockServer at session scope that actually solves the issue?).

@cosmicexplorer
Copy link
Contributor Author

nox -e test-3.10 -- -vs -k lazy_wheel reliably reproduces the issue on a mac device, and b06d73b reliably fixes it.

@cosmicexplorer
Copy link
Contributor Author

yes!!!!!!!! omg praise

@cosmicexplorer
Copy link
Contributor Author

cc @pradyunsg @notatallshaw: no rush at all, but CI passes extremely quickly now and the fix was very simple.

Copy link
Member

@pfmoore pfmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review - I haven't looked at the changes to lazy_wheel.py itself yet. They are much more extensive, and I've not had time to review the context of all this. It's been a long while since I last looked at lazy_wheel, and given that PyPI now serves package metadata separately (and other indexes should do the same), I'd like to review how important the whole lazy_wheel code path even is these days, before spending too much time diving into the code itself.

If you have any current data on how often this code path gets hit in typical usage of pip, and how much of a speedup it provides, then that would be very useful. Not microbenchmarks showing that the code is x% faster than it was before, but real-world figures like how much does this improve installing Jupyter, or Django, or big packages with complex dependency trees such as Apache Airflow or Home Assistant?

@@ -25,12 +25,12 @@
}


@pytest.fixture
@pytest.fixture(scope="session")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change in scope here (and in mypy_whl_no_range)?

If this is an unrelated change, it should go in its own PR. If it's related, it could do with a comment explaining why it's needed.

if src_fp := source.fp:
src_name = getattr(src_fp, "name", None)
logger.debug("extracting entry '%s' from zip '%s'", path, src_name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's debugging output only indirectly related to this change. While it's not a bad change, it should probably be separate from this PR. Also, printing a raw None when the source doesn't have a name attribute seems unfriendly in user-facing output.

link.netloc,
str(e),
)
self._domains_without_http_range.add(link.netloc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question here is, which is faster? To not use lazy wheel on sites where a range request ever failed, or to check every time and use lazy wheel whenever possible. A comment here explaining the trade-off and why you made the choice you did would be useful.

(I haven't looked at the three linked PRs, as IMO this PR should stand alone, so "this check will go away when further PRs are merged" isn't a reason to not document what's going on here).

@@ -362,34 +360,46 @@ def make_wheel(
:param entry_points:
:param record: if provided and None, then no RECORD file is generated;
else if a string then sets the content of the RECORD file
:param metadata_first: Put the .dist-info metadata at the front of the zip file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wheel spec explicitly states that archivers should put the wheel metadata at the end of the archive. We absolutely should not default to True here, and I question why we should make this configurable at all. If it's to test edge cases where pip encounters wheels with the metadata in a non-recommended place, I'd prefer to see the main test suite unchanged, but separate, well-documented tests that explicitly create and test such wheels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this should not be configurable.


def __str__(self) -> str:
return f"Wheel '{self.name}' located at {self.location} is invalid."
suffix = f" ({self.context})" if self.context else "."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
suffix = f" ({self.context})" if self.context else "."
suffix = f" ({self.context})." if self.context else "."

(or better, keep the trailing full stop in the return statement rather than here, but I don't think github will let me add 2-line suggestions...)

@cburroughs
Copy link

but real-world figures like how much does this improve installing Jupyter, or Django, or big packages with complex dependency trees such as Apache Airflow or Home Assistant?

Hiya! I don't have any great insight into the finer details, I'm just a user with > 30minute resolve excited for any performance improvements. Happy to help with any testing grunt work.

@ b06d73b without fast-deps

(cache cleared between runs)

time python3.10 -m pip install --report test.json --cache-dir=.cache-pr12208  --dry-run --ignore-installed 'acryl-datahub[all]==0.13.2.3'
4m14.350s
4m4.991s
4m17.795s

@ b06d73b with fast-deps

(cache cleared between runs)

time python3.10 -m pip install --report test.json --cache-dir=.cache-pr12208-fast --use-feature=fast-deps  --dry-run --ignore-installed 'acryl-datahub[all]==0.13.2.3'
6m43.164s
7m15.692s
7m52.251s

So fast-deps appears slower against PyPI in this naive test. I admit I'm not sure what to make of these results or if I'm testing the right hing. My expectations is that fast-deps is only expected to be beneficial against indexes that do not support PEP 658+backfill (aka not PyPI) and would otherwise be no different in performance. Again I'm just wearing my user hat here, not sure if I'm understanding the case exactly right, and happy to run other tests if I can help.

(acryl-datahub[all] isn't special, it is just a concise test that I know happens to expand out to ~300 transitive dependencies.)

@notatallshaw
Copy link
Member

I don't have any great insight into the finer details, I'm just a user with > 30minute resolve

It would be a big help for me if you can you open a new issue as something like "Slow resolver performance" and give a reproducible example.

I will profile it and use it as a target for improved resolution, which once pip vendors a new version of resolvelib I will be working on hopefully very big improvements to these kinds of cases.

@dholth
Copy link
Member

dholth commented Oct 1, 2024

This needs to be tested with slow round-trip times and/or slow bandwidth. Sometimes a user who lives next to a pypi server (or an alternate index with full range request support) may not see the full story. Is there an easy way to simulate those conditions?

@notatallshaw
Copy link
Member

This needs to be tested with slow round-trip times and/or slow bandwidth. Sometimes a user who lives next to a pypi server (or an alternate index with full range request support) may not see the full story. Is there an easy way to simulate those conditions?

There are lots of HTTP proxies available that allow you to add delay to responses/requests, .e.g https://github.com/ruwanta/delaying-proxy, https://github.com/KevCui/mitm-scripts/blob/master/mitm-delay-request.py, https://github.com/chimurai/http-proxy-middleware/blob/master/recipes/delay.md

@cburroughs
Copy link

I reran the prior `time python3.10 -m pip install --report test.json --cache-dir=.cache-pr12208-fast --use-feature=fast-deps --dry-run --ignore-installed 'acryl-datahub[all]==0.13.2.3' tests, this time interleaving the tests. I think it confirms that variable network bandwidth makes this difficult to measure. Or perhaps that "download a bunch of stuff from PyPI" isn't the right framing. My apologies if the driv-by testing was not helpful.

nofast
--

4m25.297s
4m34.908s
4m26.272s


fast
--

4m28.240s
4m29.895s
5m0.192s

@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet