Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: add --path-walk option for better deltas #1813

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Commits on Nov 8, 2024

  1. path-walk: introduce an object walk by path

    In anticipation of a few planned applications, introduce the most basic form
    of a path-walk API. It currently assumes that there are no UNINTERESTING
    objects, and does not include any complicated filters. It calls a function
    pointer on groups of tree and blob objects as grouped by path. This only
    includes objects the first time they are discovered, so an object that
    appears at multiple paths will not be included in two batches.
    
    These batches are collected in 'struct type_and_oid_list' objects, which
    store an object type and an oid_array of objects.
    
    The data structures are documented in 'struct path_walk_context', but in
    summary the most important are:
    
      * 'paths_to_lists' is a strmap that connects a path to a
        type_and_oid_list for that path. To avoid conflicts in path names,
        we make sure that tree paths end in "/" (except the root path with
        is an empty string) and blob paths do not end in "/".
    
      * 'path_stack' is a string list that is added to in an append-only
        way. This stores the stack of our depth-first search on the heap
        instead of using recursion.
    
      * 'path_stack_pushed' is a strmap that stores path names that were
        already added to 'path_stack', to avoid repeating paths in the
        stack. Mostly, this saves us from quadratic lookups from doing
        unsorted checks into the string_list.
    
    The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
    push_to_stack() method. Call this instead of inserting into these
    structures directly.
    
    The walk_objects_by_path() method initializes these structures and
    starts walking commits from the given rev_info struct. The commits are
    used to find the list of root trees which populate the start of our
    depth-first search.
    
    The core of our depth-first search is in a while loop that continues
    while we have not indicated an early exit and our 'path_stack' still has
    entries in it. The loop body pops a path off of the stack and "visits"
    the path via the walk_path() method.
    
    The walk_path() method gets the list of OIDs from the 'path_to_lists'
    strmap and executes the callback method on that list with the given path
    and type. If the OIDs correspond to tree objects, then iterate over all
    trees in the list and run add_children() to add the child objects to
    their own lists, adding new entries to the stack if necessary.
    
    In testing, this depth-first search approach was the one that used the
    least memory while iterating over the object lists. There is still a
    chance that repositories with too-wide path patterns could cause memory
    pressure issues. Limiting the stack size could be done in the future by
    limiting how many objects are being considered in-progress, or by
    visiting blob paths earlier than trees.
    
    There are many future adaptations that could be made, but they are left for
    future updates when consumers are ready to take advantage of those features.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    b7e9b81 View commit details
    Browse the repository at this point in the history
  2. test-lib-functions: add test_cmp_sorted

    This test helper will be helpful to reduce repeated logic in
    t6601-path-walk.sh, but may be helpful elsewhere, too.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    cf2ed61 View commit details
    Browse the repository at this point in the history
  3. t6601: add helper for testing path-walk API

    Add some tests based on the current behavior, doing interesting checks
    for different sets of branches, ranges, and the --boundary option. This
    sets a baseline for the behavior and we can extend it as new options are
    introduced.
    
    Store and output a 'batch_nr' value so we can demonstrate that the paths are
    grouped together in a batch and not following some other ordering. This
    allows us to test the depth-first behavior of the path-walk API. However, we
    purposefully do not test the order of the objects in the batch, so the
    output is compared to the expected output through a sort.
    
    It is important to mention that the behavior of the API will change soon as
    we start to handle UNINTERESTING objects differently, but these tests will
    demonstrate the change in behavior.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    a3c754d View commit details
    Browse the repository at this point in the history
  4. path-walk: allow consumer to specify object types

    We add the ability to filter the object types in the path-walk API so
    the callback function is called fewer times.
    
    This adds the ability to ask for the commits in a list, as well. We
    re-use the empty string for this set of objects because these are passed
    directly to the callback function instead of being part of the
    'path_stack'.
    
    Future changes will add the ability to visit annotated tags.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    83b746f View commit details
    Browse the repository at this point in the history
  5. path-walk: visit tags and cached objects

    The rev_info that is specified for a path-walk traversal may specify
    visiting tag refs (both lightweight and annotated) and also may specify
    indexed objects (blobs and trees). Update the path-walk API to walk
    these objects as well.
    
    When walking tags, we need to peel the annotated objects until reaching
    a non-tag object. If we reach a commit, then we can add it to the
    pending objects to make sure we visit in the commit walk portion. If we
    reach a tree, then we will assume that it is a root tree. If we reach a
    blob, then we have no good path name and so add it to a new list of
    "tagged blobs".
    
    When the rev_info includes the "--indexed-objects" flag, then the
    pending set includes blobs and trees found in the cache entries and
    cache-tree. The cache entries are usually blobs, though they could be
    trees in the case of a sparse index. The cache-tree stores
    previously-hashed tree objects but these are cleared out when staging
    objects below those paths. We add tests that demonstrate this.
    
    The indexed objects come with a non-NULL 'path' value in the pending
    item. This allows us to prepopulate the 'path_to_lists' strmap with
    lists for these paths.
    
    The tricky thing about this walk is that we will want to combine the
    indexed objects walk with the commit walk, especially in the future case
    of walking objects during a command like 'git repack'.
    
    Whenever possible, we want the objects from the index to be grouped with
    similar objects in history. We don't want to miss any paths that appear
    only in the index and not in the commit history.
    
    Thus, we need to be careful to let the path stack be populated initially
    with only the root tree path (and possibly tags and tagged blobs) and go
    through the normal depth-first search. Afterwards, if there are other
    paths that are remaining in the paths_to_lists strmap, we should then
    iterate through the stack and visit those objects recursively.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    97765aa View commit details
    Browse the repository at this point in the history
  6. path-walk: mark trees and blobs as UNINTERESTING

    When the input rev_info has UNINTERESTING starting points, we want to be
    sure that the UNINTERESTING flag is passed appropriately through the
    objects. To match how this is done in places such as 'git pack-objects', we
    use the mark_edges_uninteresting() method.
    
    This method has an option for using the "sparse" walk, which is similar in
    spirit to the path-walk API's walk. To be sure to keep it independent, add a
    new 'prune_all_uninteresting' option to the path_walk_info struct.
    
    To check how the UNINTERSTING flag is spread through our objects, extend the
    'test-tool path-walk' command to output whether or not an object has that
    flag. This changes our tests significantly, including the removal of some
    objects that were previously visited due to the incomplete implementation.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    a4aaa3b View commit details
    Browse the repository at this point in the history
  7. pack-objects: extract should_attempt_deltas()

    This will be helpful in a future change.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    4f0d7b4 View commit details
    Browse the repository at this point in the history
  8. pack-objects: add --path-walk option

    In order to more easily compute delta bases among objects that appear at the
    exact same path, add a --path-walk option to 'git pack-objects'.
    
    This option will use the path-walk API instead of the object walk given by
    the revision machinery. Since objects will be provided in batches
    representing a common path, those objects can be tested for delta bases
    immediately instead of waiting for a sort of the full object list by
    name-hash. This has multiple benefits, including avoiding collisions by
    name-hash.
    
    The objects marked as UNINTERESTING are included in these batches, so we
    are guaranteeing some locality to find good delta bases.
    
    After the individual passes are done on a per-path basis, the default
    name-hash is used to find other opportunistic delta bases that did not
    match exactly by the full path name.
    
    The current implementation performs delta calculations while walking
    objects, which is not ideal for a few reasons. First, this will cause
    the "Enumerating objects" phase to be much longer than usual. Second, it
    does not take advantage of threading during the path-scoped delta
    calculations. Even with this lack of threading, the path-walk option is
    sometimes faster than the usual approach. Future changes will refactor
    this code to allow for threading, but that complexity is deferred until
    later to keep this patch as simple as possible.
    
    This new walk is incompatible with some features and is ignored by
    others:
    
     * Object filters are not currently integrated with the path-walk API,
       such as sparse-checkout or tree depth. A blobless packfile could be
       integrated easily, but that is deferred for later.
    
     * Server-focused features such as delta islands, shallow packs, and
       using a bitmap index are incompatible with the path-walk API.
    
     * The path walk API is only compatible with the --revs option, not
       taking object lists or pack lists over stdin. These alternative ways
       to specify the objects currently ignores the --path-walk option
       without even a warning.
    
    Future changes will create performance tests that demonstrate the power
    of this approach.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    6f5514b View commit details
    Browse the repository at this point in the history
  9. pack-objects: update usage to match docs

    The t0450 test script verifies that builtin usage matches the synopsis
    in the documentation. Adjust the builtin to match and then remove 'git
    pack-objects' from the exception list.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    644bcfd View commit details
    Browse the repository at this point in the history
  10. p5313: add performance tests for --path-walk

    The previous change added a --path-walk option to 'git pack-objects'.
    Create a performance test that demonstrates the time and space benefits
    of the feature.
    
    In order to get an appropriate comparison, we need to avoid reusing
    deltas and recompute them from scratch.
    
    Compare the creation of a thin pack representing a small push and the
    creation of a relatively large non-thin pack.
    
    Running on my copy of the Git repository results in this data:
    
    Test                                      HEAD
    --------------------------------------------------------------
    5313.2: thin pack                         0.00(0.00+0.00)
    5313.3: thin pack size                                589
    5313.4: thin pack with --path-walk        0.00(0.00+0.00)
    5313.5: thin pack size with --path-walk               589
    5313.6: big pack                          2.76(7.19+0.27)
    5313.7: big pack size                               14.0M
    5313.8: big pack with --path-walk         5.76(6.72+0.16)
    5313.9: big pack size with --path-walk              13.2M
    
    Note that the timing is slower because there is no threading in the
    --path-walk case (yet).
    
    The cases where the --path-walk option really shines is when the default
    name-hash is overwhelmed with collisions. An open source example can be
    found in the microsoft/fluentui repo [1] at a certain commit [2].
    
    [1] https://github.com/microsoft/fluentui
    [2] e70848ebac1cd720875bccaa3026f4a9ed700e08
    
    Running the tests on this repo results in the following output:
    
    Test                                      HEAD
    --------------------------------------------------------------
    5313.2: thin pack                         0.28(0.40+0.03)
    5313.3: thin pack size                               1.2M
    5313.4: thin pack with --path-walk        0.07(0.06+0.00)
    5313.5: thin pack size with --path-walk             18.4K
    5313.6: big pack                          4.05(29.48+0.41)
    5313.7: big pack size                               19.7M
    5313.8: big pack with --path-walk         6.01(9.17+0.20)
    5313.9: big pack size with --path-walk              16.5M
    
    Notice in particular that in the small thin pack, the time performance
    has improved from 0.28s to 0.08s and this is likely due to the improved
    size of the resulting pack: 18.4K instead of 1.2M.
    
    Finally, running this on a copy of the Linux kernel repository results
    in these data points:
    
    Test                                      HEAD
    --------------------------------------------------------------
    5313.2: thin pack                         0.03(0.02+0.00)
    5313.3: thin pack size                               4.6K
    5313.4: thin pack with --path-walk        0.03(0.01+0.01)
    5313.5: thin pack size with --path-walk              4.6K
    5313.6: big pack                          21.06(60.57+1.45)
    5313.7: big pack size                              158.3M
    5313.8: big pack with --path-walk         37.65(57.83+0.67)
    5313.9: big pack size with --path-walk             152.3M
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    82069da View commit details
    Browse the repository at this point in the history
  11. pack-objects: introduce GIT_TEST_PACK_PATH_WALK

    There are many tests that validate whether 'git pack-objects' works as
    expected. Instead of duplicating these tests, add a new test environment
    variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
    when specified.
    
    This was useful in testing the implementation of the --path-walk
    implementation, especially in conjunction with test such as:
    
     - t0411-clone-from-partial.sh : One test fetches from a repo that does
       not have the boundary objects. This causes the path-based walk to
       fail. Disable the variable for this test.
    
     - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
       without a boundary object.
    
     - t5310-pack-bitmaps.sh : One test compares the case when packing with
       bitmaps to the case when packing without them. Since we disable the
       test variable when writing bitmaps, this causes a difference in the
       object list (the --path-walk option adds an extra object). Specify
       --no-path-walk in both processes for the comparison. Another test
       checks for a specific delta base, but when computing dynamically
       without using bitmaps, the base object it too small to be considered
       in the delta calculations so no base is used.
    
     - t5316-pack-delta-depth.sh : This script cares about certain delta
       choices and their chain lengths. The --path-walk option changes how
       these chains are selected, and thus changes the results of this test.
    
     - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
       the --sparse option and how it combines with --path-walk.
    
     - t5332-multi-pack-reuse.sh : This test verifies that the preferred
       pack is used for delta reuse when possible. The --path-walk option is
       not currently aware of the preferred pack at all, so finds a
       different delta base.
    
     - t7406-submodule-update.sh : When using the variable, the --depth
       option collides with the --path-walk feature, resulting in a warning
       message. Disable the variable so this warning does not appear.
    
    I want to call out one specific test change that is only temporary:
    
     - t5530-upload-pack-error.sh : One test cares specifically about an
       "unable to read" error message. Since the current implementation
       performs delta calculations within the path-walk API callback, a
       different "unable to get size" error message appears. When this
       is changed in a future refactoring, this test change can be reverted.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    eba56bf View commit details
    Browse the repository at this point in the history
  12. t5538: add tests to confirm deltas in shallow pushes

    It can be notoriously difficult to detect if delta bases are being
    computed properly during 'git push'. Construct an example where it will
    make a kilobyte worth of difference when a delta base is not found. We
    can then use the progress indicators to distinguish between bytes and
    KiB depending on whether the delta base is found and used.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    71dc5af View commit details
    Browse the repository at this point in the history
  13. repack: add --path-walk option

    Since 'git pack-objects' supports a --path-walk option, allow passing it
    through in 'git repack'. This presents interesting testing opportunities for
    comparing the different repacking strategies against each other.
    
    Add the --path-walk option to the performance tests in p5313.
    
    For the microsoft/fluentui repo [1] checked out at a specific commit [2],
    the results are very interesting:
    
    Test                                      HEAD
    ----------------------------------------------------------
    5313.2: thin pack                         0.41(0.48+0.03)
    5313.3: thin pack size                               1.2M
    5313.4: thin pack with --path-walk        0.08(0.05+0.02)
    5313.5: thin pack size with --path-walk             18.4K
    5313.6: big pack                          4.47(30.62+0.40)
    5313.7: big pack size                               19.6M
    5313.8: big pack with --path-walk         6.76(9.87+0.23)
    5313.9: big pack size with --path-walk              16.5M
    5313.10: repack                           96.87(664.29+2.75)
    5313.11: repack size                               439.5M
    5313.12: repack with --path-walk          95.68(109.90+0.92)
    5313.13: repack size with --path-walk              122.6M
    
    [1] https://github.com/microsoft/fluentui
    [2] e70848ebac1cd720875bccaa3026f4a9ed700e08
    
    This repo suffers from having a lot of paths that collide in the name
    hash, so examining them in groups by path leads to better deltas. Also,
    in this case, the single-threaded implementation is competitive with the
    full repack. This is saving time diffing files that have significant
    differences from each other.
    
    A similar, but private, repo has even more extremes during repacking:
    
    Test                                      HEAD
    -----------------------------------------------------------------
    5313.10: repack                           2138.22(11961.00+17.67)
    5313.11: repack size                                 6.4G
    5313.12: repack with --path-walk          1351.46(1418.28+3.96)
    5313.13: repack size with --path-walk              804.1M
    
    There are small benefits in size for my copy of the Git repository:
    
    Test                                      HEAD
    -----------------------------------------------------------
    5313.10: repack                           22.11(98.37+1.64)
    5313.11: repack size                               126.4M
    5313.12: repack with --path-walk          66.89(75.61+0.58)
    5313.13: repack size with --path-walk              109.6M
    
    As well as in the nodejs/node repository [3]:
    
    Test                                      HEAD
    -------------------------------------------------------------
    5313.2: thin pack                         0.01(0.01+0.00)
    5313.3: thin pack size                               1.6K
    5313.4: thin pack with --path-walk        0.02(0.01+0.00)
    5313.5: thin pack size with --path-walk              1.6K
    5313.6: big pack                          5.35(12.43+0.32)
    5313.7: big pack size                               52.2M
    5313.8: big pack with --path-walk         7.12(11.97+0.27)
    5313.9: big pack size with --path-walk              52.1M
    5313.10: repack                           87.74(342.90+4.24)
    5313.11: repack size                               739.7M
    5313.12: repack with --path-walk          212.79(245.05+1.78)
    5313.13: repack size with --path-walk              697.6M
    
    [3] https://github.com/nodejs/node
    
    This benefit also repeats in this instance in the Linux kernel repository:
    
    Test                                      HEAD
    ---------------------------------------------------------------
    5313.2: thin pack                         0.04(0.00+0.03)
    5313.3: thin pack size                               4.6K
    5313.4: thin pack with --path-walk        0.03(0.01+0.01)
    5313.5: thin pack size with --path-walk              4.6K
    5313.6: big pack                          21.16(62.81+1.35)
    5313.7: big pack size                              158.3M
    5313.8: big pack with --path-walk         36.09(55.25+0.67)
    5313.9: big pack size with --path-walk             152.2M
    5313.10: repack                           734.26(2149.62+31.24)
    5313.11: repack size                                 2.5G
    5313.12: repack with --path-walk          1457.23(1618.15+7.00)
    5313.13: repack size with --path-walk                2.2G
    
    It is important to see that even when the repository shape does not have
    many name-hash collisions, there is a slight space boost to be found
    using this method. Also, there is no known case where the space is
    worse with --path-walk. This is of course due to the second pass where
    all objects to be packed are sorted in the usual way and checked for
    deltas. This second pass is usually very fast as the path-walk has
    primed many objects with quality deltas that short-circuit other delta
    computation attempts.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    8d15862 View commit details
    Browse the repository at this point in the history
  14. repack: update usage to match docs

    The t0450 test script verifies that the builtin usage matches the
    synopsis in the documentation. Update 'git repack' to match and remove
    it from the exception list.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    222abd2 View commit details
    Browse the repository at this point in the history
  15. pack-objects: enable --path-walk via config

    Users may want to enable the --path-walk option for 'git pack-objects' by
    default, especially underneath commands like 'git push' or 'git repack'.
    
    This should be limited to client repositories, since the --path-walk option
    disables bitmap walks, so would be bad to include in Git servers when
    serving fetches and clones. There is potential that it may be helpful to
    consider when repacking the repository, to take advantage of improved deltas
    across historical versions of the same files.
    
    Much like how "pack.useSparse" was introduced and included in
    "feature.experimental" before being enabled by default, use the repository
    settings infrastructure to make the new "pack.usePathWalk" config enabled by
    "feature.experimental" and "feature.manyFiles".
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    318190a View commit details
    Browse the repository at this point in the history
  16. scalar: enable path-walk during push via config

    Repositories registered with Scalar are expected to be client-only
    repositories that are rather large. This means that they are more likely to
    be good candidates for using the --path-walk option when running 'git
    pack-objects', especially under the hood of 'git push'. Enable this config
    in Scalar repositories.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    a38e033 View commit details
    Browse the repository at this point in the history
  17. pack-objects: refactor path-walk delta phase

    Previously, the --path-walk option to 'git pack-objects' would compute
    deltas inline with the path-walk logic. This would make the progress
    indicator look like it is taking a long time to enumerate objects, and
    then very quickly computed deltas.
    
    Instead of computing deltas on each region of objects organized by tree,
    store a list of regions corresponding to these groups. These can later
    be pulled from the list for delta compression before doing the "global"
    delta search.
    
    This presents a new progress indicator that can be used in tests to
    verify that this stage is happening.
    
    The current implementation is not integrated with threads, but could be
    done in a future update.
    
    Since we do not attempt to sort objects by size until after exploring
    all trees, we can remove the previous change to t5530 due to a different
    error message appearing first.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    1f84e2d View commit details
    Browse the repository at this point in the history
  18. pack-objects: thread the path-based compression

    Adapting the implementation of ll_find_deltas(), create a threaded
    version of the --path-walk compression step in 'git pack-objects'.
    
    This involves adding a 'regions' member to the thread_params struct,
    allowing each thread to own a section of paths. We can simplify the way
    jobs are split because there is no value in extending the batch based on
    name-hash the way sections of the object entry array are attempted to be
    grouped. We re-use the 'list_size' and 'remaining' items for the purpose
    of borrowing work in progress from other "victim" threads when a thread
    has finished its batch of work more quickly.
    
    Using the Git repository as a test repo, the p5313 performance test
    shows that the resulting size of the repo is the same, but the threaded
    implementation gives gains of varying degrees depending on the number of
    objects being packed. (This was tested on a 16-core machine.)
    
    Test                                      HEAD~1    HEAD
    -----------------------------------------------------------------
    5313.2: thin pack                         0.00        0.00  =
    5313.3: thin pack size                     589         589  +0.0%
    5313.4: thin pack with --path-walk        0.00        0.00  =
    5313.5: thin pack size with --path-walk    589         589  +0.0%
    5313.6: big pack                          2.84        2.80  -1.4%
    5313.7: big pack size                    14.0M       14.1M  +0.3%
    5313.8: big pack with --path-walk         5.46        3.77 -31.0%
    5313.9: big pack size with --path-walk   13.2M       13.2M  -0.0%
    5313.10: repack                           22.11      21.50  -2.8%
    5313.11: repack size                     126.4M     126.2M  -0.2%
    5313.12: repack with --path-walk          66.89      26.41 -60.5%
    5313.13: repack size with --path-walk    109.6M     109.6M  +0.0%
    
    This 60% reduction in 'git repack --path-walk' time is typical across
    all repos I used for testing. What is interesting is to compare when the
    overall time improves enough to outperform the standard case. These time
    improvements correlate with repositories with data shapes that
    significantly improve their data size as well.
    
    For example, the microsoft/fluentui repo has a 439M to 122M size
    reduction, and the repack time is now 36.6 seconds with --path-walk
    compared to 95+ seconds without it:
    
    Test                                      HEAD~!    HEAD
    -----------------------------------------------------------------
    5313.2: thin pack                         0.41        0.42  +2.4%
    5313.3: thin pack size                    1.2M        1.2M  +0.0%
    5313.4: thin pack with --path-walk        0.08        0.05 -37.5%
    5313.5: thin pack size with --path-walk  18.4K       18.4K  +0.0%
    5313.6: big pack                          4.47        4.53  +1.3%
    5313.7: big pack size                    19.6M       19.7M  +0.3%
    5313.8: big pack with --path-walk         6.76        3.51 -48.1%
    5313.9: big pack size with --path-walk   16.5M       16.4M  -0.2%
    5313.10: repack                          96.87       99.05  +2.3%
    5313.11: repack size                    439.5M      439.0M  -0.1%
    5313.12: repack with --path-walk         95.68       36.55 -61.8%
    5313.13: repack size with --path-walk   122.6M      122.6M  +0.0%
    
    In a more extreme example, an internal repository that has a similar
    name-hash collision issue to microsoft/fluentui reduces its size from
    6.4G to 805M with the --path-walk option. This also reduces the
    repacking time from 2,138 seconds to 478 seconds.
    
    Test                                      HEAD~1    HEAD
    ------------------------------------------------------------------
    5313.10: repack                           2138.22   2138.19  -0.0%
    5313.11: repack size                         6.4G      6.4G  -0.0%
    5313.12: repack with --path-walk          1351.46    477.91 -64.6%
    5313.13: repack size with --path-walk      804.1M    804.1M  -0.0%
    
    Finally, the Linux kernel repository is a good test for this repacking
    time change, even though the space savings is more reasonable:
    
    Test                                      HEAD~1      HEAD
    ----------------------------------------------------------------
    5313.10: repack                           734.26   735.11  +0.1%
    5313.11: repack size                        2.5G     2.5G  -0.0%
    5313.12: repack with --path-walk         1457.23   598.17 -59.0%
    5313.13: repack size with --path-walk       2.2G     2.2G  +0.0%
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    bc37596 View commit details
    Browse the repository at this point in the history