Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slow repo_list with many refs #310

Merged
merged 5 commits into from
Apr 7, 2023
Merged

Fix slow repo_list with many refs #310

merged 5 commits into from
Apr 7, 2023

Conversation

jonashaag
Copy link
Owner

@benaryorg please try this, closes #309

klaus/repo.py Outdated
Cache is invalidated if one of the ref targets changes,
eg. a new commit has been made and 'refs/heads/master' was changed.
"""
if len(self.refs.keys()) > 10000:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10000 still seems like quite a lot, there's no way to meaningfully display that - and it still means 10k random accesses of the repository.

What if there are more than 10000 heads, as is the case with the nix repo?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah damn I thought that most heads are branches but eg. GH PRs are also heads

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the number used here is not displayed anywhere, just the maximum timestamp of the refs

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah damn I thought that most heads are branches but eg. GH PRs are also heads

Wait no they are not

@benaryorg
Copy link

@benaryorg please try this

curl -so /dev/null http://localhost:8000/  0.01s user 0.01s system 0% cpu 26.207 total max RSS 8

That's the initial request, still taking quite a long time (in fact almost hitting the default gunicorn timeout again).
After that of course it's cached and returns almost instantly.
The time I got with my quick&dirty version of the patch that only checked HEAD ran in ~25.514s which is barely any better, so I'd say this patch is as fast as this is going to get on that (admittedly slow) machine here.
So yeah, this solves my issue!
Thanks a lot!


For posterity's sake; the nixpkgs repository itself (which at this point already had its timestamp cached):

curl -so /dev/null http://localhost:8000/nixpkgs/  0.01s user 0.01s system 0% cpu 2.610 total max RSS 8

@jonashaag
Copy link
Owner Author

Hmm I wonder why it's still so slow. For me it takes around 1s. Is the repo on a slow hard drive?

@jelmer is there a way to batch-lookup timestamps with Dulwich that's faster?

@jelmer
Copy link
Contributor

jelmer commented Apr 5, 2023

Hmm I wonder why it's still so slow. For me it takes around 1s. Is the repo on a slow hard drive?

@jelmer is there a way to batch-lookup timestamps with Dulwich that's faster?

The main alternative is to iterate over most of the repository using something like Repo.object_store.iterobject_subset(). That will at least remove the need to reinflate delta bases more than once (which you would probably be doing when doing random access).

That's a new API (in 0.21.0 I think), and not used much outside of dulwich itself yet.

@jonashaag
Copy link
Owner Author

next(repo.object_store.iterobjects_subset([repo[b"HEAD"].id]))
KeyError: b'5804b7fb5fa07ac9b3720a19aa207edfb1521133'

Am I using it wrong?

@jelmer
Copy link
Contributor

jelmer commented Apr 5, 2023

next(repo.object_store.iterobjects_subset([repo[b"HEAD"].id]))
KeyError: b'5804b7fb5fa07ac9b3720a19aa207edfb1521133'

Am I using it wrong?

That seems right, and works here (although on a different repo, obviously). Does repo[repo[b'HEAD'].id] work? What version of dulwich are you on?

(it's possible this is a bug, but curious what triggers it)

@benaryorg
Copy link

Hmm I wonder why it's still so slow. For me it takes around 1s. Is the repo on a slow hard drive?

A slow SSD, which should still be a lot faster than an HDD but not up to speed with current SSDs.
However these load times aren't IO bound but CPU bound.
The thread the request is running on does use up 100% of a single core on the machine during those 20s though, so it likely isn't impacted by any IO speeds.
Recalling the stacktrace I got at one point I'm pretty sure it's going through a lot of of zlib compressed data, though I can't say for sure what exactly takes so long.

@jonashaag
Copy link
Owner Author

Repo.object_store.iterobject_subset()

Unfortunately it is actually slower:

In [11]: %timeit list(repo.object_store.iterobjects_subset(random.sample(list(refs.values()), 1000), allow_missing=True))
94.2 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit [repo[x] for x in random.sample(list(refs.values()), 1000)]
66.7 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jonashaag
Copy link
Owner Author

@benaryorg what's your CPU? I wonder how it can be 20x slower than mine

@benaryorg
Copy link

@benaryorg what's your CPU? I wonder how it can be 20x slower than mine

Intel(R) Atom(TM) CPU N2800 @ 1.86GHz

This is probably one of my oldest and slowest servers *sigh*

@jonashaag jonashaag merged commit defa9ca into master Apr 7, 2023
@jonashaag jonashaag deleted the repo-list-many-refs branch April 7, 2023 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

large repository makes index page very slow
3 participants