Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use partial clone to reduce clone time #389

Merged
merged 9 commits into from
Nov 3, 2023
Merged

Conversation

nathanwn
Copy link
Member

@nathanwn nathanwn commented Jul 28, 2023

Rationale

For analysis, Macaron currently needs to do full clones of repositories. This takes a long time for large repositories (the likes of google/guava can take a few minutes).

Although subsequent runs of Macaron on the same repository are fast thanks to Macaron storing already cloned repositories in output/git_repos/..., the first analysis can be slow. For first-time users or for someone who would just want to try out Macaron, this behavior would be discouraging. For example, with our tutorial here, we use guava as one of the example dependency. Because guava is a large repository, it would take a long time to do a full clone for it (at the moment, the full clone size is more than 400MB).

Optimizing clone time with treeless cloning

General Idea

For a single macaron analyze run, we only care about one revision of a repository, rather than the whole repository. Therefore, full cloning is not the best option in terms of performance in a single run, since it downloads way more than necessary.

To optimize cloning time, we should follow an on-demand approach, i.e. only download things required for a single analysis. At the same time, we also need to make sure that other git operations that Macaron needs to carry out still work properly. One key operation is checking out a specific commit, branch, or tag of a repo.

Note about shallow clone

Shallow cloning is not considered. This is because shallow cloning does not keep a local copy of a repo's history (the commit tree). In general, we plan to not support shallow cloning altogether, since it creates more cases to handle and is known to have caused issues in the past.

Git filter spec and partial cloning

The selected approach utilizes the git feature of filter-spec to do partial cloning.

There are 3 basic object types in git:

  • blobs: git's representation for files
  • trees: git's representation for directories
  • commits

These form a hierarchy of nodes, in which:

  • commits link with each other to form a "commit graph", which is the actual history of a repository.
  • each commit has trees as its children.
  • each tree has blobs and other trees (i.e. subdirectories) as its children

While the commits graph should be available at all times locally to enable git checkout, blobs and trees could be downloaded on-demand, i.e. blobs and trees under a commit should only be downloaded once that commit is being checked out. This approach is generally called "partial cloning".

The exact partial cloning strategy we use is called "treeless cloning", which can be achieved by providing git clone with the argument --filter=tree:0. With this approach, all blobs and trees are downloaded by demand. Another strategy that we have evaluated is "blobless cloning", in which only blobs are retrieved on demand. Treeless cloning is slightly faster than blobless cloning during git clone and also has not caused any issues so far, so we decided to go ahead with it.

The downside to treeless cloning is that blobs and trees will need to be downloaded every time Macaron checks out a version of a repo that it has not checked out before. However, since these are only blobs and trees of a single commit rather than the whole history of the repo, this is relatively fast (within a few seconds). This is a trade-off that we are more than happy to make since it is not comparable to taking a few minutes to do full-cloning. At the same time, optimizing for first-time use is important, as we want to provide a good user experience for first-timers of Macaron.

To learn more about git partial cloning, please see the following resources:

Implementation details

Use subprocess to perform the cloning

In this commit.
Previously, we used the GitPython 's Repo.clone_from (implementation) method to perform the cloning. However, after further investigation, we have decided to move away from using this method and use the subprocess.run function to run the git clone --filter=tree:0 command directly because of the following reasons:

  • We could directly handle the captured output from the command.
  • We have better control of the environment variables passed to the shell in which git clone is executed (e.g. GIT_TERMINAL_PROMPT).
  • We have better control of error handling. Errors raised by GitPython are usually not well-documented (according to our past experience) and require us to read their actual implementation to make sure we catch all errors that could be raised.

At the same time, there is no particular feature in GitPython's Repo.clone_from that we can benefit from according to our investigation. Hence, we settle with using subprocess.run for now.

Use a context manager to patch the environment variables in os.environ

This PR adds a utility context manager to patch os.environ. This is helpful to better control the environment variables in the shell where a subprocess.run is executed.

In this PR specifically, the context manager is used with the git clone command to set GIT_TERMINAL_PROMPT=0.

@tromai tromai marked this pull request as ready for review October 26, 2023 23:45
@tromai tromai removed their request for review October 26, 2023 23:45
@tromai tromai added the git_operations The issues realated to the git operations that Macaron makes on the target repository label Oct 31, 2023
tests/test_env.py Outdated Show resolved Hide resolved
tests/test_env.py Outdated Show resolved Hide resolved
src/macaron/env.py Outdated Show resolved Hide resolved
src/macaron/slsa_analyzer/git_url.py Show resolved Hide resolved
tests/test_env.py Outdated Show resolved Hide resolved
tests/test_env.py Outdated Show resolved Hide resolved
@tromai tromai merged commit ec4e190 into staging Nov 3, 2023
9 checks passed
@tromai tromai deleted the perf-partial-clone branch November 3, 2023 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
git_operations The issues realated to the git operations that Macaron makes on the target repository OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants