Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow resolution from Azure DevOps feed #11379

Open
thomasaarholt opened this issue Feb 10, 2025 · 11 comments
Open

Slow resolution from Azure DevOps feed #11379

thomasaarholt opened this issue Feb 10, 2025 · 11 comments
Labels
question Asking for clarification or support upstream An upstream dependency is involved

Comments

@thomasaarholt
Copy link

thomasaarholt commented Feb 10, 2025

Summary

Hi!
I'm trying to find the best way to debug why installing from my work's package index is so slow. The dependency resolution is the slow part, at 25 sec for our index vs 0.3 sec for pypi.

I have followed the uv instructions for alternative package indexes. We mostly mirror pypi, and for these packages I don't expect there to be any real difference, except that it should be a bit faster since we probably host fewer packages.

There is no authentication window that I need to click on, everything happens automatically. We've added keyring to uv tool with uv tool install keyring --with artifacts-keyring.

Summary logs

# From pypi
❯ uv pip install pandas numpy scikit-learn --no-cache
Resolved 10 packages in 379ms
Prepared 10 packages in 2.44s
Installed 10 packages in 617ms
# (package versions printed here)
# From our index
uv pip install pandas numpy scikit-learn --no-cache --keyring-provider=subprocess --index-url=<redacted>
⠴ Resolving dependencies...                                                                                                                                                                                  [Information] [CredentialProvider]VstsCredentialProvider - Acquired bearer token using 'REDACTED'
[Information] [CredentialProvider]VstsCredentialProvider - Attempting to exchange the bearer token for an Azure DevOps session token.
Resolved 10 packages in 25.31s # <-------------------- 25 sec vs 0.3 sec!
Prepared 10 packages in 3.70s
Installed 10 packages in 985ms
# (package versions printed here)

Verbose logs

I include full logs when appending --verbose to the calls for the pypi and other index-url case.

Platform

Windows 11 x86_64

Version

uv 0.5.25 (9c07c3f 2025-01-28)

Python version

3.12.8

@thomasaarholt thomasaarholt added the bug Something isn't working label Feb 10, 2025
@konstin
Copy link
Member

konstin commented Feb 10, 2025

When resolving, uv needs the metadata of every version used in the resolution. This information we get from a file called METADATA inside the <name>-<version>.dist-info directory of the wheel. PyPI exposes this file directly in the API (PEP 658). For most other server we can use HTTP Range requests to read only the METADATA file from the zip without downloading and unpacking the entire file. The index you're using does neither support PEP 658 nor HTTP Range requests, so we have to download each wheel file to extract its metadata.

The hint is in this log message:

WARN Range requests not supported for numpy-2.2.2-cp311-cp311-win_amd64.whl; streaming wheel

@konstin konstin added question Asking for clarification or support upstream An upstream dependency is involved and removed bug Something isn't working labels Feb 10, 2025
@thomasaarholt
Copy link
Author

Thank you so much for this quick response, this makes perfect sense. I’m contacting those responsible for adding support for this.

I’m closing this, but any pointers or advice for adding support for this sort of thing to a tracker is very much appreciated! I’m very pro not spending unnecessary energy boiling the oceans, especially in a CI!

@thomasaarholt
Copy link
Author

thomasaarholt commented Feb 12, 2025

Reopening this because I have some new information and am trying to find a minimum repro.

Full disclosure: I actually work at Microsoft, but as a Data Scientist and not affiliated with the Azure teams. My opinions are my own.

I raised this internally within Microsoft, and learnt that we do not support PEP 658 yet, but we do support HTTP range requests. However, we think that uv might not be handling the fact that we redirect from Azure Storage to Azure Front Door, a caching service.

My intention is now to see if we can work out why uv doesn't handle the redirect for HTTP range requests.

I have created a public tracker on Azure DevOps (ADO) where I can add specific python packages from PyPI to make them available to the public on that tracker. I have added PyPI as an "upstream provider", which lets an authenticated user to the ADO project (me) add any PyPI package version to the feed for public consumption. I do that via pip install <package>==<version> while authenticating via artifacts-keyring (I install it directly in the venv, pip didn't pick up the uv tool). There's actually a (possibly redirect-related?) bug with uv here, as I can't get uv to authenticate correctly for packages that aren't already in the index. Will make a new issue on that, but that requires authentication to the server to debug.

export INDEX='https://pkgs.dev.azure.com/thomasaarholt/thomasaarholt-python/_packaging/thomas-python-feed/pypi/simple/'
uv venv --seed --quiet
source .venv/bin/activate.fish # activate however you normally do
uv pip install numpy pandas scikit-learn --index-url=$INDEX --no-cache -v  &| grep stream # I pipe stdout and stderr into grep

I've run the above a few times now, and always get at least one package warning, sometimes two or three:

export INDEX='https://pkgs.dev.azure.com/thomasaarholt/thomasaarholt-python/_packaging/thomas-python-feed/pypi/simple/'
  uv venv --seed --quiet
  source .venv/bin/activate.fish
  uv pip install numpy pandas scikit-learn --index-url=$INDEX --no-cache -v  &| grep stream # I pipe stdout and stderr into grep
# WARN Range requests not supported for scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; streaming wheel

# rerunexport INDEX='https://pkgs.dev.azure.com/thomasaarholt/thomasaarholt-python/_packaging/thomas-python-feed/pypi/simple/'
  uv venv --seed --quiet
  source .venv/bin/activate.fish
  uv pip install numpy pandas scikit-learn --index-url=$INDEX --no-cache -v  &| grep stream # I pipe stdout and stderr into grep
# WARN Range requests not supported for pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; streaming wheel
# WARN Range requests not supported for scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; streaming wheel
# WARN Range requests not supported for numpy-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl; streaming wheel

Is there any chance I could get some help debugging this from the uv team?

@zanieb
Copy link
Member

zanieb commented Feb 12, 2025

There's actually a (possibly redirect-related?) bug with uv here, as I can't get uv to authenticate correctly for packages that aren't already in the index. Will make a new issue on that, but that requires authentication to the server to debug.

I'm not seeing the opt-in to using the keyring and a username in the index URL.

On the range requests...

Is it feasible to create a simple nginx container or something that replicates the setup?

Can you reproduce with async_http_range_reader directly?

cc @konstin I think you have some familiarity with async_http_range_reader?

@thomasaarholt
Copy link
Author

thomasaarholt commented Feb 12, 2025

I'm not seeing the opt-in to using the keyring and a username in the index URL.

Yes, I'm showing the regular index url here since the user shouldn't need authentication when installing packages that are on the feed. When trying to get uv to authenticate to add packages from PyPI to my public feed, I was using:

export UV_INDEX='https://[email protected]/thomasaarholt/thomasaarholt-python/_packaging/thomas-python-feed/pypi/simple/'

uv venv --seed --quiet
source .venv/bin/activate.fish
uv pip install numpy==2.1.1 --index-url=$UV_INDEX --keyring-provider='subprocess' -v

Here's the log for that in a gist.

To clear one possible point of confusion: One must be an authenticated user in order to install an upstream (PyPI) package through an ADO feed. This is by design (purple box).

On the range requests...

I might be able to. Never worked with nginx before, and only a bit with rust. I will try if I can find the time.

@zanieb
Copy link
Member

zanieb commented Feb 12, 2025

So a range request does work with curl if you provide the -L flag to follow the redirects

❯ curl -H "Range: bytes=0-1048575" "https://pkgs.dev.azure.com/thomasaarholt/5e230929-47ac-48ad-97db-15ab1d2aa0c5/_packaging/56fc1e3a-f592-420b-bd14-a8c967a65c28/pypi/download/pandas/2.2.3/pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=c124333816c3a9b03fbeef3a9f230ba9a737e9e5bb4060aa2107a86cc0a497fc" -f -L > chunk

@zanieb
Copy link
Member

zanieb commented Feb 12, 2025

But... if you make a HEAD request, the index returns a HTTP 405 method not allowed

❯ curl --head -L "https://pkgs.dev.azure.com/thomasaarholt/5e230929-47ac-48ad-97db-15ab1d2aa0c5/_packaging/56fc1e3a-f592-420b-bd14-a8c967a65c28/pypi/download/pandas/2.2.3/pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl"
HTTP/2 405 

which is what we use to determine if range requests are supported

let req = self
.uncached_client(url)
.head(url.clone())
.header(
"accept-encoding",
http::HeaderValue::from_static("identity"),
)
.build()
.map_err(|err| ErrorKind::from_reqwest(url.clone(), err))?;
// Copy authorization headers from the HEAD request to subsequent requests
let mut headers = HeaderMap::default();
if let Some(authorization) = req.headers().get("authorization") {
headers.append("authorization", authorization.clone());
}
// This response callback is special, we actually make a number of subsequent requests to
// fetch the file from the remote zip.
let read_metadata_range_request = |response: Response| {
async {
let mut reader = AsyncHttpRangeReader::from_head_response(
self.uncached_client(url).clone(),
response,
url.clone(),
headers.clone(),
)
.await
.map_err(|err| ErrorKind::AsyncHttpRangeReader(url.clone(), err))?;
trace!("Getting metadata for {filename} by range request");
let text = wheel_metadata_from_remote_zip(filename, url, &mut reader).await?;
let metadata =
ResolutionMetadata::parse_metadata(text.as_bytes()).map_err(|err| {
Error::from(ErrorKind::MetadataParseError(
filename.clone(),
url.to_string(),
Box::new(err),
))
})?;
Ok::<ResolutionMetadata, CachedClientError<Error>>(metadata)
}
.boxed_local()
.instrument(info_span!("read_metadata_range_request", wheel = %filename))
};

https://github.com/prefix-dev/async_http_range_reader/blob/c2a1b5a63712a34bca661cc94389fb03fb0dcddd/src/lib.rs#L310-L318

@zanieb
Copy link
Member

zanieb commented Feb 12, 2025

I would be curious to hear why they don't support HEAD requests — as that seems to be the crux of the problem here.

You could set RUST_LOG=trace to get verbose logs from the network layer and confirm the HTTP 405 is occurring.

@thomasaarholt
Copy link
Author

That was quick! I've brought it up internally now. Thank you!

@konstin
Copy link
Member

konstin commented Feb 12, 2025

For additional context, we should be following redirects in the HEAD request, but we need the initial HEAD requests to determine whether the server supports range requests (https://github.com/prefix-dev/async_http_range_reader/blob/9c1e2d3082745f3a75d8065a3dd145cd51b2715a/src/lib.rs#L310-L318).

@thomasaarholt
Copy link
Author

You could set RUST_LOG=trace to get verbose logs from the network layer and confirm the HTTP 405 is occurring.

export INDEX='https://pkgs.dev.azure.com/thomasaarholt/thomasaarholt-python/_packaging/thomas-python-feed/pypi/simple/'
export RUST_LOG=trace
uv venv --seed --quiet
source .venv/bin/activate.fish # activate however you normally do
uv pip install numpy --index-url=$INDEX --no-cache -v  &> log.txt
# Ctrl+F 405
TRACE Attempting to retry error: Error { kind: WrappedReqwestError(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("pkgs.dev.azure.com")), port: None, path: "/thomasaarholt/5e230929-47ac-48ad-97db-15ab1d2aa0c5/_packaging/56fc1e3a-f592-420b-bd14-a8c967a65c28/pypi/download/numpy/2.2.2/numpy-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", query: None, fragment: Some("sha256=02935e2c3c0c6cbe9c7955a8efa8908dd4221d7755644c59d1bba28b94fd334f") }, WrappedReqwestError(Reqwest(reqwest::Error { kind: Status(405), url: "https://pkgs.dev.azure.com/thomasaarholt/5e230929-47ac-48ad-97db-15ab1d2aa0c5/_packaging/56fc1e3a-f592-420b-bd14-a8c967a65c28/pypi/download/numpy/2.2.2/numpy-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=02935e2c3c0c6cbe9c7955a8efa8908dd4221d7755644c59d1bba28b94fd334f" }))) }

@thomasaarholt thomasaarholt changed the title Slow package resolution from authenticated package index Slow resolution from Azure DevOps feed Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Asking for clarification or support upstream An upstream dependency is involved
Projects
None yet
Development

No branches or pull requests

3 participants