generated from oracle/template-repo
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: use partial clone to reduce clone time #389
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
oracle-contributor-agreement
bot
added
the
OCA Verified
All contributors have signed the Oracle Contributor Agreement.
label
Jul 28, 2023
nathanwn
force-pushed
the
perf-partial-clone
branch
from
July 28, 2023 15:04
263a485
to
1e3786d
Compare
nathanwn
force-pushed
the
perf-partial-clone
branch
2 times, most recently
from
September 24, 2023 11:46
ccc900a
to
76faf19
Compare
nathanwn
force-pushed
the
perf-partial-clone
branch
2 times, most recently
from
October 4, 2023 10:12
15a2d23
to
cfa344b
Compare
tromai
force-pushed
the
perf-partial-clone
branch
from
October 16, 2023 22:23
cfa344b
to
51aeba7
Compare
tromai
force-pushed
the
perf-partial-clone
branch
from
October 26, 2023 01:42
51aeba7
to
c3b5360
Compare
tromai
force-pushed
the
perf-partial-clone
branch
from
October 26, 2023 05:14
c3b5360
to
2a6c569
Compare
tromai
added
the
git_operations
The issues realated to the git operations that Macaron makes on the target repository
label
Oct 31, 2023
nicallen
reviewed
Oct 31, 2023
…on directly Signed-off-by: Trong Nhan Mai <[email protected]>
nicallen
reviewed
Nov 1, 2023
Signed-off-by: Trong Nhan Mai <[email protected]>
Signed-off-by: Trong Nhan Mai <[email protected]>
nicallen
approved these changes
Nov 1, 2023
behnazh-w
reviewed
Nov 3, 2023
Signed-off-by: Trong Nhan Mai <[email protected]>
Signed-off-by: Trong Nhan Mai <[email protected]>
Signed-off-by: Trong Nhan Mai <[email protected]>
Signed-off-by: Trong Nhan Mai <[email protected]>
behnazh-w
approved these changes
Nov 3, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
git_operations
The issues realated to the git operations that Macaron makes on the target repository
OCA Verified
All contributors have signed the Oracle Contributor Agreement.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale
For analysis, Macaron currently needs to do full clones of repositories. This takes a long time for large repositories (the likes of google/guava can take a few minutes).
Although subsequent runs of Macaron on the same repository are fast thanks to Macaron storing already cloned repositories in
output/git_repos/...
, the first analysis can be slow. For first-time users or for someone who would just want to try out Macaron, this behavior would be discouraging. For example, with our tutorial here, we use guava as one of the example dependency. Because guava is a large repository, it would take a long time to do a full clone for it (at the moment, the full clone size is more than 400MB).Optimizing clone time with treeless cloning
General Idea
For a single
macaron analyze
run, we only care about one revision of a repository, rather than the whole repository. Therefore, full cloning is not the best option in terms of performance in a single run, since it downloads way more than necessary.To optimize cloning time, we should follow an on-demand approach, i.e. only download things required for a single analysis. At the same time, we also need to make sure that other git operations that Macaron needs to carry out still work properly. One key operation is checking out a specific commit, branch, or tag of a repo.
Note about shallow clone
Shallow cloning is not considered. This is because shallow cloning does not keep a local copy of a repo's history (the commit tree). In general, we plan to not support shallow cloning altogether, since it creates more cases to handle and is known to have caused issues in the past.
Git filter spec and partial cloning
The selected approach utilizes the git feature of filter-spec to do partial cloning.
There are 3 basic object types in git:
These form a hierarchy of nodes, in which:
While the commits graph should be available at all times locally to enable
git checkout
, blobs and trees could be downloaded on-demand, i.e. blobs and trees under a commit should only be downloaded once that commit is being checked out. This approach is generally called "partial cloning".The exact partial cloning strategy we use is called "treeless cloning", which can be achieved by providing
git clone
with the argument--filter=tree:0
. With this approach, all blobs and trees are downloaded by demand. Another strategy that we have evaluated is "blobless cloning", in which only blobs are retrieved on demand. Treeless cloning is slightly faster than blobless cloning duringgit clone
and also has not caused any issues so far, so we decided to go ahead with it.The downside to treeless cloning is that blobs and trees will need to be downloaded every time Macaron checks out a version of a repo that it has not checked out before. However, since these are only blobs and trees of a single commit rather than the whole history of the repo, this is relatively fast (within a few seconds). This is a trade-off that we are more than happy to make since it is not comparable to taking a few minutes to do full-cloning. At the same time, optimizing for first-time use is important, as we want to provide a good user experience for first-timers of Macaron.
To learn more about git partial cloning, please see the following resources:
Implementation details
Use subprocess to perform the cloning
In this commit.
Previously, we used the
GitPython
'sRepo.clone_from
(implementation) method to perform the cloning. However, after further investigation, we have decided to move away from using this method and use thesubprocess.run
function to run thegit clone --filter=tree:0
command directly because of the following reasons:git clone
is executed (e.g.GIT_TERMINAL_PROMPT
).GitPython
are usually not well-documented (according to our past experience) and require us to read their actual implementation to make sure we catch all errors that could be raised.At the same time, there is no particular feature in
GitPython
'sRepo.clone_from
that we can benefit from according to our investigation. Hence, we settle with usingsubprocess.run
for now.Use a context manager to patch the environment variables in
os.environ
This PR adds a utility context manager to patch
os.environ
. This is helpful to better control the environment variables in the shell where asubprocess.run
is executed.In this PR specifically, the context manager is used with the
git clone
command to setGIT_TERMINAL_PROMPT=0
.