Fix slow repo_list with many refs #310

jonashaag · 2023-04-05T09:41:13Z

@benaryorg please try this, closes #309

jelmer · 2023-04-05T09:44:08Z

klaus/repo.py

+        Cache is invalidated if one of the ref targets changes,
+        eg. a new commit has been made and 'refs/heads/master' was changed.
+        """
+        if len(self.refs.keys()) > 10000:


10000 still seems like quite a lot, there's no way to meaningfully display that - and it still means 10k random accesses of the repository.

What if there are more than 10000 heads, as is the case with the nix repo?

Ah damn I thought that most heads are branches but eg. GH PRs are also heads

Note that the number used here is not displayed anywhere, just the maximum timestamp of the refs

Ah damn I thought that most heads are branches but eg. GH PRs are also heads

Wait no they are not

benaryorg · 2023-04-05T14:32:55Z

@benaryorg please try this

curl -so /dev/null http://localhost:8000/  0.01s user 0.01s system 0% cpu 26.207 total max RSS 8

That's the initial request, still taking quite a long time (in fact almost hitting the default gunicorn timeout again).
After that of course it's cached and returns almost instantly.
The time I got with my quick&dirty version of the patch that only checked HEAD ran in ~25.514s which is barely any better, so I'd say this patch is as fast as this is going to get on that (admittedly slow) machine here.
So yeah, this solves my issue!
Thanks a lot!

For posterity's sake; the nixpkgs repository itself (which at this point already had its timestamp cached):

curl -so /dev/null http://localhost:8000/nixpkgs/  0.01s user 0.01s system 0% cpu 2.610 total max RSS 8

jonashaag · 2023-04-05T18:04:43Z

Hmm I wonder why it's still so slow. For me it takes around 1s. Is the repo on a slow hard drive?

@jelmer is there a way to batch-lookup timestamps with Dulwich that's faster?

jelmer · 2023-04-05T18:18:53Z

Hmm I wonder why it's still so slow. For me it takes around 1s. Is the repo on a slow hard drive?

@jelmer is there a way to batch-lookup timestamps with Dulwich that's faster?

The main alternative is to iterate over most of the repository using something like Repo.object_store.iterobject_subset(). That will at least remove the need to reinflate delta bases more than once (which you would probably be doing when doing random access).

That's a new API (in 0.21.0 I think), and not used much outside of dulwich itself yet.

jonashaag · 2023-04-05T18:53:55Z

next(repo.object_store.iterobjects_subset([repo[b"HEAD"].id]))

KeyError: b'5804b7fb5fa07ac9b3720a19aa207edfb1521133'

Am I using it wrong?

jelmer · 2023-04-05T19:08:30Z

next(repo.object_store.iterobjects_subset([repo[b"HEAD"].id]))

KeyError: b'5804b7fb5fa07ac9b3720a19aa207edfb1521133'

Am I using it wrong?

That seems right, and works here (although on a different repo, obviously). Does repo[repo[b'HEAD'].id] work? What version of dulwich are you on?

(it's possible this is a bug, but curious what triggers it)

benaryorg · 2023-04-06T13:50:40Z

Hmm I wonder why it's still so slow. For me it takes around 1s. Is the repo on a slow hard drive?

A slow SSD, which should still be a lot faster than an HDD but not up to speed with current SSDs.
However these load times aren't IO bound but CPU bound.
The thread the request is running on does use up 100% of a single core on the machine during those 20s though, so it likely isn't impacted by any IO speeds.
Recalling the stacktrace I got at one point I'm pretty sure it's going through a lot of of zlib compressed data, though I can't say for sure what exactly takes so long.

jonashaag · 2023-04-06T20:14:48Z

Repo.object_store.iterobject_subset()

Unfortunately it is actually slower:

In [11]: %timeit list(repo.object_store.iterobjects_subset(random.sample(list(refs.values()), 1000), allow_missing=True))
94.2 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit [repo[x] for x in random.sample(list(refs.values()), 1000)]
66.7 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

jonashaag · 2023-04-06T20:15:11Z

@benaryorg what's your CPU? I wonder how it can be 20x slower than mine

benaryorg · 2023-04-07T03:32:43Z

@benaryorg what's your CPU? I wonder how it can be 20x slower than mine

Intel(R) Atom(TM) CPU N2800 @ 1.86GHz

This is probably one of my oldest and slowest servers *sigh*

Fix slow repo_list with many refs

5f0a7cb

jelmer reviewed Apr 5, 2023

View reviewed changes

jonashaag mentioned this pull request Apr 6, 2023

iterobjects_subset not working jelmer/dulwich#1166

Closed

jonashaag added 3 commits April 6, 2023 23:19

Update repo.py

e25ee84

Empty

3421e06

Improvements

bf2795d

Update repo.py

5c8397a

jonashaag merged commit defa9ca into master Apr 7, 2023

jonashaag deleted the repo-list-many-refs branch April 7, 2023 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slow repo_list with many refs #310

Fix slow repo_list with many refs #310

jonashaag commented Apr 5, 2023

jelmer Apr 5, 2023

jonashaag Apr 5, 2023

jonashaag Apr 5, 2023

jonashaag Apr 5, 2023

benaryorg commented Apr 5, 2023

jonashaag commented Apr 5, 2023

jelmer commented Apr 5, 2023 •

edited

Loading

jonashaag commented Apr 5, 2023

jelmer commented Apr 5, 2023 •

edited

Loading

benaryorg commented Apr 6, 2023

jonashaag commented Apr 6, 2023

jonashaag commented Apr 6, 2023

benaryorg commented Apr 7, 2023

Fix slow repo_list with many refs #310

Fix slow repo_list with many refs #310

Conversation

jonashaag commented Apr 5, 2023

jelmer Apr 5, 2023

Choose a reason for hiding this comment

jonashaag Apr 5, 2023

Choose a reason for hiding this comment

jonashaag Apr 5, 2023

Choose a reason for hiding this comment

jonashaag Apr 5, 2023

Choose a reason for hiding this comment

benaryorg commented Apr 5, 2023

jonashaag commented Apr 5, 2023

jelmer commented Apr 5, 2023 • edited Loading

jonashaag commented Apr 5, 2023

jelmer commented Apr 5, 2023 • edited Loading

benaryorg commented Apr 6, 2023

jonashaag commented Apr 6, 2023

jonashaag commented Apr 6, 2023

benaryorg commented Apr 7, 2023

jelmer commented Apr 5, 2023 •

edited

Loading

jelmer commented Apr 5, 2023 •

edited

Loading