Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure our pre-flight HEAD request is always by-tag if we're pushing a tag #58

Merged
merged 2 commits into from
Jun 26, 2024

Conversation

tianon
Copy link
Member

@tianon tianon commented Jun 26, 2024

This is a latent bug the system has had for large manifests that we now have for all manifests instead! 😄

(This also would artificially inflate our speedups since all the pre-flight HEAD requests would be by-digest which are infinitely cacheable and thus very, very fast.)

Follow-up to #56

…a tag

This is a latent bug the system has had for large manifests that we now have for all manifests instead! 😄

(This also would artificially inflate our speedups since all the pre-flight HEAD requests would be by-digest which are infinitely cacheable and thus very, very fast.)
@tianon tianon requested a review from yosifkit as a code owner June 26, 2024 00:14
@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

I found this while trying to figure out why my freshly-built tianon/buildkit:0.13 image was not being pushed, even though the output looks good:

Pushing manifest sha256:1b479e0976d87ecf6d76e9d7fff51a24efcdbfef118449bae49b724bd8b14c56:
 - tianon/buildkit:0.13.2
 - tianon/buildkit:0.13

(I'm about to test it on that deploy now to see if it fixes it)

yosifkit
yosifkit previously approved these changes Jun 26, 2024
@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

Confirmed, tianon/buildkit:0.13 was stuck at sha256:dab7ad84db6d538ecae96cad77648ee21f47f0c1ecaabf91e5d5f527ecfa1556 before and now is successfully correctly updated to sha256:1b479e0976d87ecf6d76e9d7fff51a24efcdbfef118449bae49b724bd8b14c56

@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

Dramatically slower, because tag lookups are slow (fair/expected).

My personal no-op deploys were at ~3m30s before #56, dropped to 30s (and then 10s, once the cache was really hot) after, and now with this change go back up to ~5m8s, so #56 might actually make things worse not better 😬 😭

This reverts commit c70a534.

(After fixing the implementation bug, this made things worse, not better 😭)
@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

Now that we know #56 does actually make things worse, I've included a partial revert of it here (keeping the addition of the comment).

@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

Confirmed, rebased my personal pipeline on this change and re-ran while that (twice run, consistent) 5m's run cache was still hot and we're back down to the 3m ballpark.

@tianon tianon merged commit d85d365 into docker-library:main Jun 26, 2024
1 check passed
@tianon tianon deleted the lookup-by-digest branch June 26, 2024 00:39
@whalelines
Copy link
Collaborator

Do we know why the short-lived #56 made things worse? More requests? More prone to hit rate limiting? Something else?

@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

I have some educated guesses, but it's definitely worth trying again with some tracing to find out with stronger data. 👍

@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

Hmm, the obvious answer would be applying a CPU profile, but this isn't CPU bound, so a CPU profile actually won't tell us anything. We need something that will be inclusive of time spent waiting for net/http requests.

Edit: gonna try playing with https://pkg.go.dev/runtime/trace -- I'm hoping we don't have to go all the way to something low-level like https://blog.golang.org/http-tracing or as all-encompassing as OpenTracing to answer this, but will continue exploring options

@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

Ironically, I think it's actually our read-only caching proxy at fault here -- that proxy works really well when we're doing lookups by digest because it can cache them effectively indefinitely, but when doing a lookup by-tag it has to proxy to Docker Hub on every request, and I think what we're seeing is the result of that extra multi-hop latency.

@tianon
Copy link
Member Author

tianon commented Jun 26, 2024

Good news -- I found our proxy had several bad behaviors that exacerbated this:

  • our proxy changes GET of a tag into a HEAD of the tag that then redirects to a GET of the digest (good for cache!)
  • an unintentional side effect of that was that a HEAD of a tag was also being turned into a redirect, which is what causes the really bad behavior we were seeing here (every HEAD of a tag becomes a 2x HEAD, which is understandably much slower and the requests are all so tiny they're not really cacheable)
  • additionally a system in front of our proxy was "helpfully" converting all our HEAD requests into GET requests to the backend since it's a violation of the spec for HEAD/GET to have differing behavior/responses (and that's the cache layer, so it would need to cache)

I've now fixed the proxy such that:

  • HEAD does not redirect -- it just returns the upstream HEAD data directly (with munging of cache headers only for cacheable requests like by-digest)
  • GET on a tag still redirects, giving us better cache behavior
  • the cache in front of the proxy will only cache requests with sha256: in them (thus allowing HEAD of a tag to stay HEAD all the way back)

Before making these fixes, I had tested with the proxy totally bypassed (all requests going directly to Docker Hub), which is how I narrowed down that the proxy was responsible. After making these fixes, I've re-tested and got ~exactly the same speed as the direct Hub requests via the proxy, so we should be good to go to re-apply #56 and see accross-the-board speed gains 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants