aboutsummaryrefslogtreecommitdiffstats
path: root/pack-bitmap-write.c
AgeCommit message (Collapse)AuthorFilesLines
2024-03-28Merge branch 'eb/hash-transition'Junio C Hamano1-1/+1
Work to support a repository that work with both SHA-1 and SHA-256 hash algorithms has started. * eb/hash-transition: (30 commits) t1016-compatObjectFormat: add tests to verify the conversion between objects t1006: test oid compatibility with cat-file t1006: rename sha1 to oid test-lib: compute the compatibility hash so tests may use it builtin/ls-tree: let the oid determine the output algorithm object-file: handle compat objects in check_object_signature tree-walk: init_tree_desc take an oid to get the hash algorithm builtin/cat-file: let the oid determine the output algorithm rev-parse: add an --output-object-format parameter repository: implement extensions.compatObjectFormat object-file: update object_info_extended to reencode objects object-file-convert: convert commits that embed signed tags object-file-convert: convert commit objects when writing object-file-convert: don't leak when converting tag objects object-file-convert: convert tag objects when writing object-file-convert: add a function to convert trees between algorithms object: factor out parse_mode out of fast-import and tree-walk into in object.h cache: add a function to read an OID of a specific algorithm tag: sign both hashes commit: export add_header_signature to support handling signatures on tags ...
2024-01-12Merge branch 'tb/multi-pack-verbatim-reuse'Junio C Hamano1-1/+8
Streaming spans of packfile data used to be done only from a single, primary, pack in a repository with multiple packfiles. It has been extended to allow reuse from other packfiles, too. * tb/multi-pack-verbatim-reuse: (26 commits) t/perf: add performance tests for multi-pack reuse pack-bitmap: enable reuse from all bitmapped packs pack-objects: allow setting `pack.allowPackReuse` to "single" t/test-lib-functions.sh: implement `test_trace2_data` helper pack-objects: add tracing for various packfile metrics pack-bitmap: prepare to mark objects from multiple packs for reuse pack-revindex: implement `midx_pair_to_pack_pos()` pack-revindex: factor out `midx_key_to_pack_pos()` helper midx: implement `midx_preferred_pack()` git-compat-util.h: implement checked size_t to uint32_t conversion pack-objects: include number of packs reused in output pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse pack-objects: prepare `write_reused_pack()` for multi-pack reuse pack-objects: pass `bitmapped_pack`'s to pack-reuse functions pack-objects: keep track of `pack_start` for each reuse pack pack-objects: parameterize pack-reuse routines over a single pack pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature ewah: implement `bitmap_is_empty()` pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions ...
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-3/+0
Each of these were checked with gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE} to ensure that removing the direct inclusion of the header actually resulted in that header no longer being included at all (i.e. that no other header pulled it in transitively). ...except for a few cases where we verified that although the header was brought in transitively, nothing from it was directly used in that source file. These cases were: * builtin/credential-cache.c * builtin/pull.c * builtin/send-pack.c Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14pack-bitmap-write: deep-clear the `bb_commit` slabTaylor Blau1-1/+8
The `bb_commit` commit slab is used by the pack-bitmap-write machinery to track various pieces of bookkeeping used to generate reachability bitmaps. Even though we clear the slab when freeing the bitmap_builder struct (with `bitmap_builder_clear()`), there are still pointers which point to locations in memory that have not yet been freed, resulting in a leak. Plug the leak by introducing a suitable `free_fn` for the `struct bb_commit` type, and make sure it is called on each member of the slab via the `deep_clear_bb_data()` function. Note that it is possible for both of the arguments to `bitmap_free()` to be NULL, but `bitmap_free()` is a noop for NULL arguments, so it is OK to pass them unconditionally. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02tree-walk: init_tree_desc take an oid to get the hash algorithmEric W. Biederman1-1/+1
To make it possible for git ls-tree to display the tree encoded in the hash algorithm of the oid specified to git ls-tree, update init_tree_desc to take as a parameter the oid of the tree object. Update all callers of init_tree_desc and init_tree_desc_gently to pass the oid of the tree object. Use the oid of the tree object to discover the hash algorithm of the oid and store that hash algorithm in struct tree_desc. Use the hash algorithm in decode_tree_entry and update_tree_entry_internal to handle reading a tree object encoded in a hash algorithm that differs from the repositories hash algorithm. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02pack-bitmap-write: rebuild using new bitmap when remappingChristian Couder1-1/+5
`git repack` is about to learn a new `--filter=<filter-spec>` option and we will want to check that this option is incompatible with `--write-bitmap-index`. Unfortunately it appears that a test like: test_expect_success '--filter fails with --write-bitmap-index' ' test_must_fail \ env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \ git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none ' sometimes fail because when rebuilding bitmaps, it appears that we are reusing existing bitmap information. So instead of detecting that some objects are missing and erroring out as it should, the `git repack --write-bitmap-index --filter=...` command succeeds. Let's fix that by making sure we rebuild bitmaps using new bitmaps instead of existing ones. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan1-1/+0
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21object-store-ll.h: split this header out of object-store.hElijah Newren1-1/+1
The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21repository: remove unnecessary include of path.hElijah Newren1-0/+1
This also made it clear that several .c files that depended upon path.h were missing a #include for it; add the missing includes while at it. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24diff.h: reduce unnecessary includesElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24commit.h: reduce unnecessary includesElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: remove cache.h inclusion due to object.h changesElijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-04Merge branch 'ab/remove-implicit-use-of-the-repository' into ↵Junio C Hamano1-1/+2
en/header-split-cache-h * ab/remove-implicit-use-of-the-repository: libs: use "struct repository *" argument, not "the_repository" post-cocci: adjust comments for recent repo_* migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"
2023-03-28cocci: apply the "commit.h" part of "the_repository.pending"Ævar Arnfjörð Bjarmason1-1/+2
Apply the part of "the_repository.pending.cocci" pertaining to "commit.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21csum-file.h: remove unnecessary inclusion of cache.hElijah Newren1-1/+2
With the change in the last commit to move several functions to write-or-die.h, csum-file.h no longer needs to include cache.h. However, removing that include forces several other C files, which directly or indirectly dependend upon csum-file.h's inclusion of cache.h, to now be more explicit about their dependencies. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21environment.h: move declarations for environment.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren1-0/+1
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren1-1/+2
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-13pack-bitmap-write.c: instrument number of reused bitmapsTaylor Blau1-1/+7
When debugging bitmap generation performance, it is useful to know how many bitmaps were generated from scratch, and how many were the result of permuting the bit-order of an existing bitmap. Keep track of the latter, and emit the count as a trace2_data line to aid in debugging. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-28pack-bitmap-write: drop unused pack_idx_entry parametersJeff King1-6/+2
Our write_selected_commits_v1() function takes an array of pack_idx_entry structs. We used to need them for computing commit positions, but since aa30162559 (bitmap: move `get commit positions` code to `bitmap_writer_finish`, 2022-08-14), the caller passes in a separate array of positions for us. We can drop the unused array (and its matching length parameter). Likewise, when we added write_lookup_table() in 93eb41e240 (pack-bitmap-write.c: write lookup table extension, 2022-08-14), it receives the same array of positions. So its "index" parameter was never used at all. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-26pack-bitmap-write.c: write lookup table extensionAbhradeep Chakraborty1-2/+89
The bitmap lookup table extension was documented by an earlier change, but Git does not yet know how to write that extension. Teach Git to write bitmap lookup table extension. The table contains the list of `N` <commit_pos, offset, xor_row>` triplets. These triplets are sorted according to their commit pos (ascending order). The meaning of each data in the i'th triplet is given below: - commit_pos stores commit position (in the pack-index or midx). It is a 4 byte network byte order unsigned integer. - offset is the position (in the bitmap file) from which that commit's bitmap can be read. - xor_row is the position of the triplet in the lookup table whose bitmap is used to compress this bitmap, or `0xffffffff` if no such bitmap exists. Mentored-by: Taylor Blau <me@ttaylorr.com> Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com> Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-26bitmap: move `get commit positions` code to `bitmap_writer_finish`Abhradeep Chakraborty1-9/+20
The `write_selected_commits_v1` function takes care of writing commit positions along with their corresponding bitmaps in the disk. It is OK because this `search commit position of a given commit` algorithm is needed only once here. But in later changes of the `lookup table extension series`, we need same commit positions which means we have to run the above mentioned algorithm one more time. Move the `search commit position of a given commit` algorithm to `bitmap_writer_finish()` and use the `commit_positions` array to get commit positions of their corresponding bitmaps. Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-19pack-bitmap-write: use const for hashesDerrick Stolee1-1/+1
The next change will use a const array when calling this method. There is no need for the non-const version, so let's do this cleanup quickly. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-07Merge branch 'ab/plug-leak-in-revisions'Junio C Hamano1-0/+1
Plug the memory leaks from the trickiest API of all, the revision walker. * ab/plug-leak-in-revisions: (27 commits) revisions API: add a TODO for diff_free(&revs->diffopt) revisions API: have release_revisions() release "topo_walk_info" revisions API: have release_revisions() release "date_mode" revisions API: call diff_free(&revs->pruning) in revisions_release() revisions API: release "reflog_info" in release revisions() revisions API: clear "boundary_commits" in release_revisions() revisions API: have release_revisions() release "prune_data" revisions API: have release_revisions() release "grep_filter" revisions API: have release_revisions() release "filter" revisions API: have release_revisions() release "cmdline" revisions API: have release_revisions() release "mailmap" revisions API: have release_revisions() release "commits" revisions API users: use release_revisions() for "prune_data" users revisions API users: use release_revisions() with UNLEAK() revisions API users: use release_revisions() in builtin/log.c revisions API users: use release_revisions() in http-push.c revisions API users: add "goto cleanup" for release_revisions() stash: always have the owner of "stash_info" free it revisions API users: use release_revisions() needing REV_INFO_INIT revision.[ch]: document and move code declared around "init" ...
2022-04-13revisions API users: add straightforward release_revisions()Ævar Arnfjörð Bjarmason1-0/+1
Add a release_revisions() to various users of "struct rev_list" in those straightforward cases where we only need to add the release_revisions() call to the end of a block, and don't need to e.g. refactor anything to use a "goto cleanup" pattern. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10core.fsync: introduce granular fsync control infrastructureNeeraj Singh1-1/+2
This commit introduces the infrastructure for the core.fsync configuration knob. The repository components we want to sync are identified by flags so that we can turn on or off syncing for specific components. If core.fsyncObjectFiles is set and the core.fsync configuration also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any loose objects. This picks the strictest data integrity behavior if core.fsync and core.fsyncObjectFiles are set to conflicting values. This change introduces the currently unused fsync_component helper, which will be used by a later patch that adds fsyncing to the refs backend. Actual configuration and documentation of the fsync components list are in other patches in the series to separate review of the underlying mechanism from the policy of how it's configured. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25Merge branch 'ab/only-single-progress-at-once'Junio C Hamano1-3/+3
Further tweaks on progress API. * ab/only-single-progress-at-once: pack-bitmap-write.c: don't return without stop_progress() progress API: unify stop_progress{,_msg}(), fix trace2 bug progress.c: refactor stop_progress{,_msg}() to use helpers progress.c: use dereferenced "progress" variable, not "(*p_progress)" progress.h: format and be consistent with progress.c naming progress.c tests: test some invalid usage progress.c tests: make start/stop commands on stdin progress.c test helper: add missing braces leak tests: fix a memory leak in "test-progress" helper
2022-02-03pack-bitmap-write.c: don't return without stop_progress()Ævar Arnfjörð Bjarmason1-3/+3
Fix a bug that's been here since 7cc8f971085 (pack-objects: implement bitmap writing, 2013-12-21), we did not call stop_progress() if we reached the early exit in this function. We could call stop_progress() before we return, but better yet is to defer calling start_progress() until we need it. For now this only matters in practice because we'd previously omit the "region_leave" for the progress trace2 event. Suggested-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap: read multi-pack bitmapsTaylor Blau1-1/+1
This prepares the code in pack-bitmap to interpret the new multi-pack bitmaps described in Documentation/technical/bitmap-format.txt, which mostly involves converting bit positions to accommodate looking them up in a MIDX. Note that there are currently no writers who write multi-pack bitmaps, and that this will be implemented in the subsequent commit. Note also that get_midx_checksum() and get_midx_filename() are made non-static so they can be called from pack-bitmap.c. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-24pack-bitmap-write.c: free existing bitmapsTaylor Blau1-0/+1
When writing a new bitmap, the bitmap writer code attempts to read the existing bitmap (if one is present). This is done in order to quickly permute the bits of any bitmaps for commits which appear in the existing bitmap, and were also selected for the new bitmap. But since this code was added in 341fa34887 (pack-bitmap-write: use existing bitmaps, 2020-12-08), the resources associated with opening an existing bitmap were never released. It's fine to ignore this, but it's bad hygiene. It will also cause a problem for the multi-pack-index builtin, which will be responsible not only for writing bitmaps, but also for expiring any old multi-pack bitmaps. If an existing bitmap was reused here, it will also be expired. That will cause a problem on platforms which require file resources to be closed before unlinking them, like Windows. Avoid this by ensuring we close reused bitmaps with free_bitmap_index() before removing them. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-24pack-bitmap-write.c: gracefully fail to write non-closed bitmapsTaylor Blau1-23/+53
The set of objects covered by a bitmap must be closed under reachability, since it must be the case that there is a valid bit position assigned for every possible reachable object (otherwise the bitmaps would be incomplete). Pack bitmaps are never written from 'git repack' unless repacking all-into-one, and so we never write non-closed bitmaps (except in the case of partial clones where we aren't guaranteed to have all objects). But multi-pack bitmaps change this, since it isn't known whether the set of objects in the MIDX is closed under reachability until walking them. Plumb through a bit that is set when a reachable object isn't found. As soon as a reachable object isn't found in the set of objects to include in the bitmap, bitmap_writer_build() knows that the set is not closed, and so it now fails gracefully. A test is added in t0410 to trigger a bitmap write without full reachability closure by removing local copies of some reachable objects from a promisor remote. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28oid_pos(): access table through const pointersJeff King1-2/+2
When we are looking up an oid in an array, we obviously don't need to write to the array. Let's mark it as const in the function interfaces, as well as in the local variables we use to derference the void pointer (note a few cases use pointers-to-pointers, so we mark everything const). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28hash_pos(): convert to oid_pos()Jeff King1-3/+3
All of our callers are actually looking up an object_id, not a bare hash. Likewise, the arrays they are looking in are actual arrays of object_id (not just raw bytes of hashes, as we might find in a pack .idx; those are handled by bsearch_hash()). Using an object_id gives us more type safety, and makes the callers slightly shorter. It also gets rid of the word "sha1" from several access functions, though we could obviously also rename those with s/sha1/hash/. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-15Merge branch 'ma/sha1-is-a-hash'Junio C Hamano1-2/+2
Retire more names with "sha1" in it. * ma/sha1-is-a-hash: hash-lookup: rename from sha1-lookup sha1-lookup: rename `sha1_pos()` as `hash_pos()` object-file.c: rename from sha1-file.c object-name.c: rename from sha1-name.c
2021-01-04hash-lookup: rename from sha1-lookupMartin Ågren1-1/+1
Change all remnants of "sha1" in hash-lookup.c and .h and rename them to reflect that we're not just able to handle SHA-1 these days. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04sha1-lookup: rename `sha1_pos()` as `hash_pos()`Martin Ågren1-1/+1
Rename this function to reflect that we're not just able to handle SHA-1 these days. There are a few instances of "sha1" left in sha1-lookup.[ch] after this, but those will be addressed in the next commit. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: better reuse bitmapsDerrick Stolee1-2/+38
If the old bitmap file contains a bitmap for a given commit, then that commit does not need help from intermediate commits in its history to compute its final bitmap. Eject that commit from the walk and insert it into a separate list of reusable commits that are eventually stored in the list of commits for computing bitmaps. This helps the repeat bitmap computation task, even if the selected commits shift drastically. This helps when a previously-bitmapped commit exists in the first-parent history of a newly-selected commit. Since we stop the walk at these commits and we use a first-parent walk, it is harder to walk "around" these bitmapped commits. It's not impossible, but we can greatly reduce the computation time for many selected commits. | runtime (sec) | peak heap (GB) | | | | | from | with | from | with | | scratch | existing | scratch | existing | -----------+---------+----------+---------+----------- last patch | 88.478 | 53.218 | 2.157 | 2.224 | this patch | 86.681 | 16.164 | 2.157 | 2.222 | Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: relax unique revwalk conditionDerrick Stolee1-9/+5
The previous commits improved the bitmap computation process for very long, linear histories with many refs by removing quadratic growth in how many objects were walked. The strategy of computing "intermediate commits" using bitmasks for which refs can reach those commits partitioned the poset of reachable objects so each part could be walked exactly once. This was effective for linear histories. However, there was a (significant) drawback: wide histories with many refs had an explosion of memory costs to compute the commit bitmasks during the exploration that discovers these intermediate commits. Since these wide histories are unlikely to repeat walking objects, the benefit of walking objects multiple times was not expensive before. But now, the commit walk *before computing bitmaps* is incredibly expensive. In an effort to discover a happy medium, this change reduces the walk for intermediate commits to only the first-parent history. This focuses the walk on how the histories converge, which still has significant reduction in repeat object walks. It is still possible to create quadratic behavior in this version, but it is probably less likely in realistic data shapes. Here is some data taken on a fresh clone of the kernel: | runtime (sec) | peak heap (GB) | | | | | from | with | from | with | | scratch | existing | scratch | existing | -----------+---------+----------+---------+----------- original | 64.044 | 83.241 | 2.088 | 2.194 | last patch | 45.049 | 37.624 | 2.267 | 2.334 | this patch | 88.478 | 53.218 | 2.157 | 2.224 | Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: use existing bitmapsDerrick Stolee1-4/+36
When constructing new bitmaps, we perform a commit and tree walk in fill_bitmap_commit() and fill_bitmap_tree(). This walk would benefit from using existing bitmaps when available. We must track the existing bitmaps and translate them into the new object order, but this is generally faster than parsing trees. In fill_bitmap_commit(), we must reorder thing somewhat. The priority queue walks commits from newest-to-oldest, which means we correctly stop walking when reaching a commit with a bitmap. However, if we walk trees interleaved with the commits, then we might be parsing trees that are actually part of a re-used bitmap. To avoid over-walking trees, add them to a LIFO queue and walk them after exploring commits completely. On git.git, this reduces a second immediate bitmap computation from 2.0s to 1.0s. On linux.git, we go from 32s to 22s. On chromium's fork network, we go from 227s to 198s. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: ignore BITMAP_FLAG_REUSEJeff King1-45/+5
The on-disk bitmap format has a flag to mark a bitmap to be "reused". This is a rather curious feature, and works like this: - a run of pack-objects would decide to mark the last 80% of the bitmaps it generates with the reuse flag - the next time we generate bitmaps, we'd see those reuse flags from the last run, and mark those commits as special: - we'd be more likely to select those commits to get bitmaps in the new output - when generating the bitmap for a selected commit, we'd reuse the old bitmap as-is (rearranging the bits to match the new pack, of course) However, neither of these behaviors particularly makes sense. Just because a commit happened to be bitmapped last time does not make it a good candidate for having a bitmap this time. In particular, we may choose bitmaps based on how recent they are in history, or whether a ref tip points to them, and those things will change. We're better off re-considering fresh which commits are good candidates. Reusing the existing bitmap _is_ a reasonable thing to do to save computation. But only reusing exact bitmaps is a weak form of this. If we have an old bitmap for A and now want a new bitmap for its child, we should be able to compute that only by looking at trees and that are new to the child. But this code would consider only exact reuse (which is perhaps why it was eager to select those commits in the first place). Furthermore, the recent switch to the reverse-edge algorithm for generating bitmaps dropped this optimization entirely (and yet still performs better). So let's do a few cleanups: - drop the whole "reusing bitmaps" phase of generating bitmaps. It's not helping anything, and is mostly unused code (or worse, code that is using CPU but not doing anything useful) - drop the use of the on-disk reuse flag to select commits to bitmap - stop setting the on-disk reuse flag in bitmaps we generate (since nothing respects it anymore) We will keep a few innards of the reuse code, which will help us implement a more capable version of the "reuse" optimization: - simplify rebuild_existing_bitmaps() into a function that only builds the mapping of bits between the old and new orders, but doesn't actually convert any bitmaps - make rebuild_bitmap() public; we'll call it lazily to convert bitmaps as we traverse (using the mapping created above) Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: build fewer intermediate bitmapsDerrick Stolee1-6/+66
The bitmap_writer_build() method calls bitmap_builder_init() to construct a list of commits reachable from the selected commits along with a "reverse graph". This reverse graph has edges pointing from a commit to other commits that can reach that commit. After computing a reachability bitmap for a commit, the values in that bitmap are then copied to the reachability bitmaps across the edges in the reverse graph. We can now relax the role of the reverse graph to greatly reduce the number of intermediate reachability bitmaps we compute during this reverse walk. The end result is that we walk objects the same number of times as before when constructing the reachability bitmaps, but we also spend much less time copying bits between bitmaps and have much lower memory pressure in the process. The core idea is to select a set of "important" commits based on interactions among the sets of commits reachable from each selected commit. The first technical concept is to create a new 'commit_mask' member in the bb_commit struct. Note that the selected commits are provided in an ordered array. The first thing to do is to mark the ith bit in the commit_mask for the ith selected commit. As we walk the commit-graph, we copy the bits in a commit's commit_mask to its parents. At the end of the walk, the ith bit in the commit_mask for a commit C stores a boolean representing "The ith selected commit can reach C." As we walk, we will discover non-selected commits that are important. We will get into this later, but those important commits must also receive bit positions, growing the width of the bitmasks as we walk. At the true end of the walk, the ith bit means "the ith _important_ commit can reach C." MAXIMAL COMMITS --------------- We use a new 'maximal' bit in the bb_commit struct to represent whether a commit is important or not. The term "maximal" comes from the partially-ordered set of commits in the commit-graph where C >= P if P is a parent of C, and then extending the relationship transitively. Instead of taking the maximal commits across the entire commit-graph, we instead focus on selecting each commit that is maximal among commits with the same bits on in their commit_mask. This definition is important, so let's consider an example. Suppose we have three selected commits A, B, and C. These are assigned bitmasks 100, 010, and 001 to start. Each of these can be marked as maximal immediately because they each will be the uniquely maximal commit that contains their own bit. Keep in mind that that these commits may have different bitmasks after the walk; for example, if B can reach C but A cannot, then the final bitmask for C is 011. Even in these cases, C would still be a maximal commit among all commits with the third bit on in their masks. Now define sets X, Y, and Z to be the sets of commits reachable from A, B, and C, respectively. The intersections of these sets correspond to different bitmasks: * 100: X - (Y union Z) * 010: Y - (X union Z) * 001: Z - (X union Y) * 110: (X intersect Y) - Z * 101: (X intersect Z) - Y * 011: (Y intersect Z) - X * 111: X intersect Y intersect Z This can be visualized with the following Hasse diagram: 100 010 001 | \ / \ / | | \/ \/ | | /\ /\ | | / \ / \ | 110 101 011 \___ | ___/ \ | / 111 Some of these bitmasks may not be represented, depending on the topology of the commit-graph. In fact, we are counting on it, since the number of possible bitmasks is exponential in the number of selected commits, but is also limited by the total number of commits. In practice, very few bitmasks are possible because most commits converge on a common "trunk" in the commit history. With this three-bit example, we wish to find commits that are maximal for each bitmask. How can we identify this as we are walking? As we walk, we visit a commit C. Since we are walking the commits in topo-order, we know that C is visited after all of its children are visited. Thus, when we get C from the revision walk we inspect the 'maximal' property of its bb_data and use that to determine if C is truly important. Its commit_mask is also nearly final. If C is not one of the originally-selected commits, then assign a bit position to C (by incrementing num_maximal) and set that bit on in commit_mask. See "MULTIPLE MAXIMAL COMMITS" below for more detail on this. Now that the commit C is known to be maximal or not, consider each parent P of C. Compute two new values: * c_not_p : true if and only if the commit_mask for C contains a bit that is not contained in the commit_mask for P. * p_not_c : true if and only if the commit_mask for P contains a bit that is not contained in the commit_mask for P. If c_not_p is false, then P already has all of the bits that C would provide to its commit_mask. In this case, move on to other parents as C has nothing to contribute to P's state that was not already provided by other children of P. We continue with the case that c_not_p is true. This means there are bits in C's commit_mask to copy to P's commit_mask, so use bitmap_or() to add those bits. If p_not_c is also true, then set the maximal bit for P to one. This means that if no other commit has P as a parent, then P is definitely maximal. This is because no child had the same bitmask. It is important to think about the maximal bit for P at this point as a temporary state: "P is maximal based on current information." In contrast, if p_not_c is false, then set the maximal bit for P to zero. Further, clear all reverse_edges for P since any edges that were previously assigned to P are no longer important. P will gain all reverse edges based on C. The final thing we need to do is to update the reverse edges for P. These reverse edges respresent "which closest maximal commits contributed bits to my commit_mask?" Since C contributed bits to P's commit_mask in this case, C must add to the reverse edges of P. If C is maximal, then C is a 'closest' maximal commit that contributed bits to P. Add C to P's reverse_edges list. Otherwise, C has a list of maximal commits that contributed bits to its bitmask (and this list is exactly one element). Add all of these items to P's reverse_edges list. Be careful to ignore duplicates here. After inspecting all parents P for a commit C, we can clear the commit_mask for C. This reduces the memory load to be limited to the "width" of the commit graph. Consider our ABC/XYZ example from earlier and let's inspect the state of the commits for an interesting bitmask, say 011. Suppose that D is the only maximal commit with this bitmask (in the first three bits). All other commits with bitmask 011 have D as the only entry in their reverse_edges list. D's reverse_edges list contains B and C. COMPUTING REACHABILITY BITMAPS ------------------------------ Now that we have our definition, let's zoom out and consider what happens with our new reverse graph when computing reachability bitmaps. We walk the reverse graph in reverse-topo-order, so we visit commits with largest commit_masks first. After we compute the reachability bitmap for a commit C, we push the bits in that bitmap to each commit D in the reverse edge list for C. Then, when we finally visit D we already have the bits for everything reachable from maximal commits that D can reach and we only need to walk the objects in the set-difference. In our ABC/XYZ example, when we finally walk for the commit A we only need to walk commits with bitmask equal to A's bitmask. If that bitmask is 100, then we are only walking commits in X - (Y union Z) because the bitmap already contains the bits for objects reachable from (X intersect Y) union (X intersect Z) (i.e. the bits from the reachability bitmaps for the maximal commits with bitmasks 110 and 101). The behavior is intended to walk each commit (and the trees that commit introduces) at most once while allocating and copying fewer reachability bitmaps. There is one caveat: what happens when there are multiple maximal commits with the same bitmask, with respect to the initial set of selected commits? MULTIPLE MAXIMAL COMMITS ------------------------ Earlier, we mentioned that when we discover a new maximal commit, we assign a new bit position to that commit and set that bit position to one for that commit. This is absolutely important for interesting commit-graphs such as git/git and torvalds/linux. The reason is due to the existence of "butterflies" in the commit-graph partial order. Here is an example of four commits forming a butterfly: I J |\ /| | \/ | | /\ | |/ \| M N \ / |/ Q Here, I and J both have parents M and N. In general, these do not need to be exact parent relationships, but reachability relationships. The most important part is that M and N cannot reach each other, so they are independent in the partial order. If I had commit_mask 10 and J had commit_mask 01, then M and N would both be assigned commit_mask 11 and be maximal commits with the bitmask 11. Then, what happens when M and N can both reach a commit Q? If Q is also assigned the bitmask 11, then it is not maximal but is reachable from both M and N. While this is not necessarily a deal-breaker for our abstract definition of finding maximal commits according to a given bitmask, we have a few issues that can come up in our larger picture of constructing reachability bitmaps. In particular, if we do not also consider Q to be a "maximal" commit, then we will walk commits reachable from Q twice: once when computing the reachability bitmap for M and another time when computing the reachability bitmap for N. This becomes much worse if the topology continues this pattern with multiple butterflies. The solution has already been mentioned: each of M and N are assigned their own bits to the bitmask and hence they become uniquely maximal for their bitmasks. Finally, Q also becomes maximal and thus we do not need to walk its commits multiple times. The final bitmasks for these commits are as follows: I:10 J:01 |\ /| | \ _____/ | | /\____ | |/ \ | M:111 N:1101 \ / Q:1111 Further, Q's reverse edge list is { M, N }, while M and N both have reverse edge list { I, J }. PERFORMANCE MEASUREMENTS ------------------------ Now that we've spent a LOT of time on the theory of this algorithm, let's show that this is actually worth all that effort. To test the performance, use GIT_TRACE2_PERF=1 when running 'git repack -abd' in a repository with no existing reachability bitmaps. This avoids any issues with keeping existing bitmaps to skew the numbers. Inspect the "building_bitmaps_total" region in the trace2 output to focus on the portion of work that is affected by this change. Here are the performance comparisons for a few repositories. The timings are for the following versions of Git: "multi" is the timing from before any reverse graph is constructed, where we might perform multiple traversals. "reverse" is for the previous change where the reverse graph has every reachable commit. Finally "maximal" is the version introduced here where the reverse graph only contains the maximal commits. Repository: git/git multi: 2.628 sec reverse: 2.344 sec maximal: 2.047 sec Repository: torvalds/linux multi: 64.7 sec reverse: 205.3 sec maximal: 44.7 sec So in all cases we've not only recovered any time lost to switching to the reverse-edge algorithm, but we come out ahead of "multi" in all cases. Likewise, peak heap has gone back to something reasonable: Repository: torvalds/linux multi: 2.087 GB reverse: 3.141 GB maximal: 2.288 GB While I do not have access to full fork networks on GitHub, Peff has run this algorithm on the chromium/chromium fork network and reported a change from 3 hours to ~233 seconds. That network is particularly beneficial for this approach because it has a long, linear history along with many tags. The "multi" approach was obviously quadratic and the new approach is linear. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: rename children to reverse_edgesDerrick Stolee1-3/+3
The bitmap_builder_init() method walks the reachable commits in topological order and constructs a "reverse graph" along the way. At the moment, this reverse graph contains an edge from commit A to commit B if and only if A is a parent of B. Thus, the name "children" is appropriate for for this reverse graph. In the next change, we will repurpose the reverse graph to not be directly-adjacent commits in the commit-graph, but instead a more abstract relationship. The previous changes have already incorporated the necessary updates to fill_bitmap_commit() that allow these edges to not be immediate children. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: fill bitmap with commit historyDerrick Stolee1-7/+23
The current implementation of bitmap_writer_build() creates a reachability bitmap for every walked commit. After computing a bitmap for a commit, those bits are pushed to an in-progress bitmap for its children. fill_bitmap_commit() assumes the bits corresponding to objects reachable from the parents of a commit are already set. This means that when visiting a new commit, we only have to walk the objects reachable between it and any of its parents. A future change to bitmap_writer_build() will relax this condition so not all parents have their bits set. Prepare for that by having 'fill_bitmap_commit()' walk parents until reaching commits whose bits are already set. Then, walk the trees for these commits as well. This has no functional change with the current implementation of bitmap_writer_build(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: pass ownership of intermediate bitmapsJeff King1-2/+8
Our algorithm to generate reachability bitmaps walks through the commit graph from the bottom up, passing bitmap data from each commit to its descendants. For a linear stretch of history like: A -- B -- C our sequence of steps is: - compute the bitmap for A by walking its trees, etc - duplicate A's bitmap as a starting point for B; we can now free A's bitmap, since we only needed it as an intermediate result - OR in any extra objects that B can reach into its bitmap - duplicate B's bitmap as a starting point for C; likewise, free B's bitmap - OR in objects for C, and so on... Rather than duplicating bitmaps and immediately freeing the original, we can just pass ownership from commit to commit. Note that this doesn't always work: - the recipient may be a merge which already has an intermediate bitmap from its other ancestor. In that case we have to OR our result into it. Note that the first ancestor to reach the merge does get to pass ownership, though. - we may have multiple children; we can only pass ownership to one of them However, it happens often enough and copying bitmaps is expensive enough that this provides a noticeable speedup. On a clone of linux.git, this reduces the time to generate bitmaps from 205s to 70s. This is about the same amount of time it took to generate bitmaps using our old "many traversals" algorithm (the previous commit measures the identical scenario as taking 63s). It unfortunately provides only a very modest reduction in the peak memory usage, though. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: reimplement bitmap writingJeff King1-123/+161
The bitmap generation code works by iterating over the set of commits for which we plan to write bitmaps, and then for each one performing a traditional traversal over the reachable commits and trees, filling in the bitmap. Between two traversals, we can often reuse the previous bitmap result as long as the first commit is an ancestor of the second. However, our worst case is that we may end up doing "n" complete complete traversals to the root in order to create "n" bitmaps. In a real-world case (the shared-storage repo consisting of all GitHub forks of chromium/chromium), we perform very poorly: generating bitmaps takes ~3 hours, whereas we can walk the whole object graph in ~3 minutes. This commit completely rewrites the algorithm, with the goal of accessing each object only once. It works roughly like this: - generate a list of commits in topo-order using a single traversal - invert the edges of the graph (so have parents point at their children) - make one pass in reverse topo-order, generating a bitmap for each commit and passing the result along to child nodes We generate correct results because each node we visit has already had all of its ancestors added to the bitmap. And we make only two linear passes over the commits. We also visit each tree usually only once. When filling in a bitmap, we don't bother to recurse into trees whose bit is already set in the bitmap (since we know we've already done so when setting their bit). That means that if commit A references tree T, none of its descendants will need to open T again. I say "usually", though, because it is possible for a given tree to be mentioned in unrelated parts of history (e.g., cherry-picking to a parallel branch). So we've accomplished our goal, and the resulting algorithm is pretty simple to understand. But there are some downsides, at least with this initial implementation: - we no longer reuse the results of any on-disk bitmaps when generating. So we'd expect to sometimes be slower than the original when bitmaps already exist. However, this is something we'll be able to add back in later. - we use much more memory. Instead of keeping one bitmap in memory at a time, we're passing them up through the graph. So our memory use should scale with the graph width (times the size of a bitmap). So how does it perform? For a clone of linux.git, generating bitmaps from scratch with the old algorithm took 63s. Using this algorithm it takes 205s. Which is much worse, but _might_ be acceptable if it behaved linearly as the size grew. It also increases peak heap usage by ~1G. That's not impossibly large, but not encouraging. On the complete fork-network of torvalds/linux, it increases the peak RAM usage by 40GB. Yikes. (I forgot to record the time it took, but the memory usage was too much to consider this reasonable anyway). On the complete fork-network of chromium/chromium, I ran out of memory before succeeding. Some back-of-the-envelope calculations indicate it would need 80+GB to complete. So at this stage, we've managed to make things much worse. But because of the way this new algorithm is structured, there are a lot of opportunities for optimization on top. We'll start implementing those in the follow-on patches. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-06pack-bitmap-write: use hashwrite_be32() in write_hash_cache()René Scharfe1-2/+1
Call hashwrite_be32() instead of open-coding it. This is shorter and easier to read. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-06pack-objects: drop packlist index_pos optimizationJeff King1-1/+1
Once upon a time, the code to add an object to our packing list in pack-objects all lived in a single function. It computed the position within the hash table once, then used it to check if the object was already present, and if not, to add it. Later, in 2834bc27c1 (pack-objects: refactor the packing list, 2013-10-24), this was split into two functions: packlist_find() and packlist_alloc(). We ended up with an "index_pos" variable that gets passed through several functions to make it from one to the other. The resulting code is rather confusing to follow. The "index_pos" variable is sometimes undefined, if we don't yet have a hash table. This works out in practice because in that case packlist_alloc() won't use it at all, since it will have to create/grow the hash table. But it's hard to verify that, and it does cause gcc 9.2.1's -Wmaybe-uninitialized to complain when compiled with "-flto -O3" (rightfully, since we do pass the uninitialized value as a function parameter, even if nobody ends up using it). All of this is to save computing the hash index again when we're inserting into the hash table, which I found doesn't make a measurable difference in the program runtime (which is not surprising, since we're doing all kinds of other heavyweight things for each object). Let's just drop this index_pos variable entirely, simplifying the code (and pleasing the compiler). We might be better still refactoring this custom hash table to use one of our existing implementations (an oidmap, or a kh_oid_map). I stopped short of that here, but this would be the likely first step towards that anyway. Reported-by: Stephan Beyer <s-beyer@gmx.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-20pack-bitmap: convert khash_sha1 maps into kh_oid_mapJeff King1-7/+7
All of the users of our khash_sha1 maps actually have a "struct object_id". Let's use the more descriptive type. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-20pack-objects: convert packlist_find() to use object_idJeff King1-1/+1
We take a raw hash pointer, but most of our callers have a "struct object_id" already. Let's switch to taking the full struct, which will let us continue removing uses of raw sha1 buffers. There are two callers that do need special attention: - in rebuild_existing_bitmaps(), we need to switch to nth_packed_object_oid(). This incurs an extra hash copy over pointing straight to the mmap'd sha1, but it shouldn't be measurable compared to the rest of the operation. - in can_reuse_delta() we already spent the effort to copy the sha1 into a "struct object_id", but now we just have to do so a little earlier in the function (we can't easily convert that function's callers because they may be pointing at mmap'd REF_DELTA blocks). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-20pack-bitmap-write: convert some helpers to use object_idJeff King1-9/+9
A few functions take raw hash pointers, but all of their callers actually have a "struct object_id". Let's retain that extra type as long as possible (which will let future patches extend that further, and so on). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01pack-bitmap: replace sha1_to_hexbrian m. carlson1-3/+3
Replace the uses of sha1_to_hex in the pack bitmap code with hash_to_hex to allow the use of SHA-256 as well. Rename a few variables since they are no longer limited to SHA-1. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01pack-bitmap: make bitmap header handling hash agnosticbrian m. carlson1-1/+1
Increase the checksum field in struct bitmap_disk_header to be GIT_MAX_RAWSZ bytes in length and ensure that we hash the proper number of bytes out when computing the bitmap checksum. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12pack-*.c: remove the_repository referencesNguyễn Thái Ngọc Duy1-3/+3
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-30Merge branch 'bc/hash-transition-part-15'Junio C Hamano1-1/+1
More codepaths are moving away from hardcoded hash sizes. * bc/hash-transition-part-15: rerere: convert to use the_hash_algo submodule: make zero-oid comparison hash function agnostic apply: rename new_sha1_prefix and old_sha1_prefix apply: replace hard-coded constants tag: express constant in terms of the_hash_algo transport: use parse_oid_hex instead of a constant upload-pack: express constants in terms of the_hash_algo refs/packed-backend: express constants using the_hash_algo packfile: express constants in terms of the_hash_algo pack-revindex: express constants in terms of the_hash_algo builtin/fetch-pack: remove constants with parse_oid_hex builtin/mktree: remove hard-coded constant builtin/repack: replace hard-coded constants pack-bitmap-write: use GIT_MAX_RAWSZ for allocation object_id.cocci: match only expressions of type 'struct object_id'
2018-10-19Merge branch 'nd/the-index'Junio C Hamano1-1/+1
Various codepaths in the core-ish part learn to work on an arbitrary in-core index structure, not necessarily the default instance "the_index". * nd/the-index: (23 commits) revision.c: reduce implicit dependency the_repository revision.c: remove implicit dependency on the_index ws.c: remove implicit dependency on the_index tree-diff.c: remove implicit dependency on the_index submodule.c: remove implicit dependency on the_index line-range.c: remove implicit dependency on the_index userdiff.c: remove implicit dependency on the_index rerere.c: remove implicit dependency on the_index sha1-file.c: remove implicit dependency on the_index patch-ids.c: remove implicit dependency on the_index merge.c: remove implicit dependency on the_index merge-blobs.c: remove implicit dependency on the_index ll-merge.c: remove implicit dependency on the_index diff-lib.c: remove implicit dependency on the_index read-cache.c: remove implicit dependency on the_index diff.c: remove implicit dependency on the_index grep.c: remove implicit dependency on the_index diff.c: remove the_index dependency in textconv() functions blame.c: rename "repo" argument to "r" combine-diff.c: remove implicit dependency on the_index ...
2018-10-15pack-bitmap-write: use GIT_MAX_RAWSZ for allocationbrian m. carlson1-1/+1
Reviewed-by: Stefan Beller <sbeller@google.com> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-21revision.c: remove implicit dependency on the_indexNguyễn Thái Ngọc Duy1-1/+1
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17Merge branch 'ds/reachable'Junio C Hamano1-0/+1
The code for computing history reachability has been shuffled, obtained a bunch of new tests to cover them, and then being improved. * ds/reachable: commit-reach: correct accidental #include of C file commit-reach: use can_all_from_reach commit-reach: make can_all_from_reach... linear commit-reach: replace ref_newer logic test-reach: test commit_contains test-reach: test can_all_from_reach_with_flags test-reach: test reduce_heads test-reach: test get_merge_bases_many test-reach: test is_descendant_of test-reach: test in_merge_bases test-reach: create new test tool for ref_newer commit-reach: move can_all_from_reach_with_flags upload-pack: generalize commit date cutoff upload-pack: refactor ok_to_give_up() upload-pack: make reachable() more generic commit-reach: move commit_contains from ref-filter commit-reach: move ref_newer from remote.c commit.h: remove method declarations commit-reach: move walk methods from commit.c
2018-07-20commit.h: remove method declarationsDerrick Stolee1-0/+1
These methods are now declared in commit-reach.h. Remove them from commit.h and add new include statements in all files that require these declarations. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-18Merge branch 'jt/remove-pack-bitmap-global'Junio C Hamano1-2/+8
The effort to move globals to per-repository in-core structure continues. * jt/remove-pack-bitmap-global: pack-bitmap: add free function pack-bitmap: remove bitmap_git global variable
2018-07-18Merge branch 'sb/object-store-grafts'Junio C Hamano1-0/+1
The conversion to pass "the_repository" and then "a_repository" throughout the object access API continues. * sb/object-store-grafts: commit: allow lookup_commit_graft to handle arbitrary repositories commit: allow prepare_commit_graft to handle arbitrary repositories shallow: migrate shallow information into the object parser path.c: migrate global git_path_* to take a repository argument cache: convert get_graft_file to handle arbitrary repositories commit: convert read_graft_file to handle arbitrary repositories commit: convert register_commit_graft to handle arbitrary repositories commit: convert commit_graft_pos() to handle arbitrary repositories shallow: add repository argument to is_repository_shallow shallow: add repository argument to check_shallow_file_for_update shallow: add repository argument to register_shallow shallow: add repository argument to set_alternate_shallow_file commit: add repository argument to lookup_commit_graft commit: add repository argument to prepare_commit_graft commit: add repository argument to read_graft_file commit: add repository argument to register_commit_graft commit: add repository argument to commit_graft_pos object: move grafts to object parser object-store: move object access functions to object-store.h
2018-06-29Merge branch 'sb/object-store-grafts' into sb/object-store-lookupJunio C Hamano1-0/+1
* sb/object-store-grafts: commit: allow lookup_commit_graft to handle arbitrary repositories commit: allow prepare_commit_graft to handle arbitrary repositories shallow: migrate shallow information into the object parser path.c: migrate global git_path_* to take a repository argument cache: convert get_graft_file to handle arbitrary repositories commit: convert read_graft_file to handle arbitrary repositories commit: convert register_commit_graft to handle arbitrary repositories commit: convert commit_graft_pos() to handle arbitrary repositories shallow: add repository argument to is_repository_shallow shallow: add repository argument to check_shallow_file_for_update shallow: add repository argument to register_shallow shallow: add repository argument to set_alternate_shallow_file commit: add repository argument to lookup_commit_graft commit: add repository argument to prepare_commit_graft commit: add repository argument to read_graft_file commit: add repository argument to register_commit_graft commit: add repository argument to commit_graft_pos object: move grafts to object parser object-store: move object access functions to object-store.h
2018-06-21pack-bitmap: add free functionJonathan Tan1-0/+4
Add a function to free struct bitmap_index instances, and use it where needed (except when rebuild_existing_bitmaps() is used, since it creates references to the bitmaps within the struct bitmap_index passed to it). Note that the hashes field in struct bitmap_index is not freed because it points to another field within the same struct. The documentation for that field has been updated to clarify that. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-21pack-bitmap: remove bitmap_git global variableJonathan Tan1-2/+4
Remove the bitmap_git global variable. Instead, generate on demand an instance of struct bitmap_index for code that needs to access it. This allows us significant control over the lifetime of instances of struct bitmap_index. In particular, packs can now be closed without worrying if an unnecessarily long-lived "pack" field in struct bitmap_index still points to it. The bitmap API is also clearer in that we need to first obtain a struct bitmap_index, then we use it. This patch raises two potential issues: (1) memory for the struct bitmap_index is allocated without being freed, and (2) prepare_bitmap_git() and prepare_bitmap_walk() can reuse a previously loaded bitmap. For (1), this will be dealt with in a subsequent patch in this patch set that also deals with freeing the contents of the struct bitmap_index (which were not freed previously, because they have global scope). For (2), current bitmap users only load the bitmap once at most (note that pack-objects can use bitmaps or write bitmaps, but not both at the same time), so support for reuse has no effect - and future users can pass around the struct bitmap_index * obtained if they need to do 2 or more things with the same bitmap. Helped-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-30Merge branch 'js/use-bug-macro'Junio C Hamano1-1/+1
Developer support update, by using BUG() macro instead of die() to mark codepaths that should not happen more clearly. * js/use-bug-macro: BUG_exit_code: fix sparse "symbol not declared" warning Convert remaining die*(BUG) messages Replace all die("BUG: ...") calls by BUG() ones run-command: use BUG() to report bugs, not die() test-tool: help verifying BUG() code paths
2018-05-23Merge branch 'nd/pack-objects-pack-struct'Junio C Hamano1-6/+8
"git pack-objects" needs to allocate tons of "struct object_entry" while doing its work, and shrinking its size helps the performance quite a bit. * nd/pack-objects-pack-struct: ci: exercise the whole test suite with uncommon code in pack-objects pack-objects: reorder members to shrink struct object_entry pack-objects: shrink delta_size field in struct object_entry pack-objects: shrink size field in struct object_entry pack-objects: clarify the use of object_entry::size pack-objects: don't check size when the object is bad pack-objects: shrink z_delta_size field in struct object_entry pack-objects: refer to delta objects by index instead of pointer pack-objects: move in_pack out of struct object_entry pack-objects: move in_pack_pos out of struct object_entry pack-objects: use bitfield for object_entry::depth pack-objects: use bitfield for object_entry::dfs_state pack-objects: turn type and in_pack_type to bitfields pack-objects: a bit of document about struct object_entry read-cache.c: make $GIT_TEST_SPLIT_INDEX boolean
2018-05-23Merge branch 'sb/oid-object-info'Junio C Hamano1-1/+2
The codepath around object-info API has been taught to take the repository object (which in turn tells the API which object store the objects are to be located). * sb/oid-object-info: cache.h: allow oid_object_info to handle arbitrary repositories packfile: add repository argument to cache_or_unpack_entry packfile: add repository argument to unpack_entry packfile: add repository argument to read_object packfile: add repository argument to packed_object_info packfile: add repository argument to packed_to_object_type packfile: add repository argument to retry_bad_packed_offset cache.h: add repository argument to oid_object_info cache.h: add repository argument to oid_object_info_extended
2018-05-16object-store: move object access functions to object-store.hStefan Beller1-0/+1
This should make these functions easier to find and cache.h less overwhelming to read. In particular, this moves: - read_object_file - oid_object_info - write_object_file As a result, most of the codebase needs to #include object-store.h. In this patch the #include is only added to files that would fail to compile otherwise. It would be better to #include wherever identifiers from the header are used. That can happen later when we have better tooling for it. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-08Merge branch 'ds/commit-graph'Junio C Hamano1-1/+1
Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. * ds/commit-graph: commit-graph: implement "--append" option commit-graph: build graph from starting commits commit-graph: read only from specific pack-indexes commit: integrate commit graph with commit parsing commit-graph: close under reachability commit-graph: add core.commitGraph setting commit-graph: implement git commit-graph read commit-graph: implement git-commit-graph write commit-graph: implement write_commit_graph() commit-graph: create git-commit-graph builtin graph: add commit graph design document commit-graph: add format document csum-file: refactor finalize_hashfile() method csum-file: rename hashclose() to finalize_hashfile()
2018-05-06Replace all die("BUG: ...") calls by BUG() onesJohannes Schindelin1-1/+1
In d8193743e08 (usage.c: add BUG() function, 2017-05-12), a new macro was introduced to use for reporting bugs instead of die(). It was then subsequently used to convert one single caller in 588a538ae55 (setup_git_env: convert die("BUG") to BUG(), 2017-05-12). The cover letter of the patch series containing this patch (cf 20170513032414.mfrwabt4hovujde2@sigill.intra.peff.net) is not terribly clear why only one call site was converted, or what the plan is for other, similar calls to die() to report bugs. Let's just convert all remaining ones in one fell swoop. This trick was performed by this invocation: sed -i 's/die("BUG: /BUG("/g' $(git grep -l 'die("BUG' \*.c) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-26cache.h: add repository argument to oid_object_infoStefan Beller1-1/+2
Add a repository argument to allow the callers of oid_object_info to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Stefan Beller <sbeller@google.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: move in_pack_pos out of struct object_entryNguyễn Thái Ngọc Duy1-3/+5
This field is only need for pack-bitmap, which is an optional feature. Move it to a separate array that is only allocated when pack-bitmap is used (like objects[], it is not freed, since we need it until the end of the process) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16pack-objects: turn type and in_pack_type to bitfieldsNguyễn Thái Ngọc Duy1-3/+3
An extra field type_valid is added to carry the equivalent of OBJ_BAD in the original "type" field. in_pack_type always contains a valid type so we only need 3 bits for it. A note about accepting OBJ_NONE as "valid" type. The function read_object_list_from_stdin() can pass this value [1] and it eventually calls create_object_entry() where current code skip setting "type" field if the incoming type is zero. This does not have any bad side effects because "type" field should be memset()'d anyway. But since we also need to set type_valid now, skipping oe_set_type() leaves type_valid zero/false, which will make oe_type() return OBJ_BAD, not OBJ_NONE anymore. Apparently we do care about OBJ_NONE in prepare_pack(). This switch from OBJ_NONE to OBJ_BAD may trigger fatal: unable to get type of object ... Accepting OBJ_NONE [2] does sound wrong, but this is how it is has been for a very long time and I haven't time to dig in further. [1] See 5c49c11686 (pack-objects: better check_object() performances - 2007-04-16) [2] 21666f1aae (convert object type handling from a string to a number - 2007-02-26) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02csum-file: refactor finalize_hashfile() methodDerrick Stolee1-1/+1
If we want to use a hashfile on the temporary file for a lockfile, then we need finalize_hashfile() to fully write the trailing hash but also keep the file descriptor open. Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional change that checks this flag before writing the checksum to the stream. This differs from previous behavior since it would be written if either CSUM_CLOSE or CSUM_FSYNC is provided. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02csum-file: rename hashclose() to finalize_hashfile()Derrick Stolee1-1/+1
The hashclose() method behaves very differently depending on the flags parameter. In particular, the file descriptor is not always closed. Perform a simple rename of "hashclose()" to "finalize_hashfile()" in preparation for functional changes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14sha1_file: convert sha1_object_info* to object_idbrian m. carlson1-2/+1
Convert sha1_object_info and sha1_object_info_extended to take pointers to struct object_id and rename them to use "oid" instead of "sha1" in their names. Update the declaration and definition and apply the following semantic patch, plus the standard object_id transforms: @@ expression E1, E2; @@ - sha1_object_info(E1.hash, E2) + oid_object_info(&E1, E2) @@ expression E1, E2; @@ - sha1_object_info(E1->hash, E2) + oid_object_info(E1, E2) @@ expression E1, E2, E3; @@ - sha1_object_info_extended(E1.hash, E2, E3) + oid_object_info_extended(&E1, E2, E3) @@ expression E1, E2, E3; @@ - sha1_object_info_extended(E1->hash, E2, E3) + oid_object_info_extended(E1, E2, E3) Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-02csum-file: rename sha1file to hashfilebrian m. carlson1-15/+15
Rename struct sha1file to struct hashfile, along with all of its related functions. The transformation in this commit was made by global search-and-replace. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-24pack-bitmap[-write]: use `object_array_clear()`, don't leakMartin Ågren1-3/+1
Instead of setting the fields of rev->pending to 0/NULL, thereby leaking memory, call `object_array_clear(&rev->pending)`. In pack-bitmap.c, we make copies of those fields as `pending_nr` and `pending_e`. We never update the aliases and the original fields never change, so the aliases are not really needed and just make it harder than necessary to understand the code. While we're here, remove the aliases to make the code easier to follow. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08pack: convert struct pack_idx_entry to struct object_idbrian m. carlson1-3/+5
Convert struct pack_idx_entry to use struct object_id by changing the definition and applying the following semantic patch, plus the standard object_id transforms: @@ struct pack_idx_entry E1; @@ - E1.sha1 + E1.oid.hash @@ struct pack_idx_entry *E1; @@ - E1->sha1 + E1->oid.hash Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-28odb_mkstemp: write filename into strbufJeff King1-5/+7
The odb_mkstemp() function expects the caller to provide a fixed buffer to write the resulting tempfile name into. But it creates the template using snprintf without checking the return value. This means we could silently truncate the filename. In practice, it's unlikely that the truncation would end in the template-pattern that mkstemp needs to open the file. So we'd probably end up failing either way, unless the path was specially crafted. The simplest fix would be to notice the truncation and die. However, we can observe that most callers immediately xstrdup() the result anyway. So instead, let's switch to using a strbuf, which is easier for them (and isn't a big deal for the other 2 callers, who can just strbuf_release when they're done with it). Note that many of the callers used static buffers, but this was purely to avoid putting a large buffer on the stack. We never passed the static buffers out of the function, so there's no complicated memory handling we need to change. Signed-off-by: Jeff King <peff@peff.net>
2017-03-28do not check odb_mkstemp return value for errorsJeff King1-2/+0
The odb_mkstemp function does not return an error; it dies on failure instead. But many of its callers compare the resulting descriptor against -1 and die themselves. Mostly this is just pointless, but it does raise a question when looking at the callers: if they show the results of the "template" buffer after a failure, what's in it? The answer is: it doesn't matter, because it cannot happen. So let's make that clear by removing the bogus error checks. In bitmap_writer_finish(), we can drop the error-handling code entirely. In the other two cases, it's shared with the open() in another code path; we can just move the error-check next to that open() call. And while we're at it, let's flesh out the function's docstring a bit to make the error behavior clear. Signed-off-by: Jeff King <peff@peff.net>
2016-09-29use QSORTRené Scharfe1-2/+1
Apply the semantic patch contrib/coccinelle/qsort.cocci to the code base, replacing calls of qsort(3) with QSORT. The resulting code is shorter and supports empty arrays with NULL pointers. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-12list-objects: pass full pathname to callbacksJeff King1-2/+1
When we find a blob at "a/b/c", we currently pass this to our show_object_fn callbacks as two components: "a/b/" and "c". Callbacks which want the full value then call path_name(), which concatenates the two. But this is an inefficient interface; the path is a strbuf, and we could simply append "c" to it temporarily, then roll back the length, without creating a new copy. So we could improve this by teaching the callsites of path_name() this trick (and there are only 3). But we can also notice that no callback actually cares about the broken-down representation, and simply pass each callback the full path "a/b/c" as a string. The callback code becomes even simpler, then, as we do not have to worry about freeing an allocated buffer, nor rolling back our modification to the strbuf. This is theoretically less efficient, as some callbacks would not bother to format the final path component. But in practice this is not measurable. Since we use the same strbuf over and over, our work to grow it is amortized, and we really only pay to memcpy a few bytes. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-12list-objects: drop name_path entirelyJeff King1-1/+1
In the previous commit, we left name_path as a thin wrapper around a strbuf. This patch drops it entirely. As a result, every show_object_fn callback needs to be adjusted. However, none of their code needs to be changed at all, because the only use was to pass it to path_name(), which now handles the bare strbuf. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-11-20Remove get_object_hash.brian m. carlson1-7/+7
Convert all instances of get_object_hash to use an appropriate reference to the hash member of the oid member of struct object. This provides no functional change, as it is essentially a macro substitution. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>
2015-11-20Convert struct object to object_idbrian m. carlson1-1/+1
struct object is one of the major data structures dealing with object IDs. Convert it to use struct object_id instead of an unsigned char array. Convert get_object_hash to refer to the new member as well. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>
2015-11-20Add several uses of get_object_hash.brian m. carlson1-7/+7
Convert most instances where the sha1 member of struct object is dereferenced to use get_object_hash. Most instances that are passed to functions that have versions taking struct object_id, such as get_sha1_hex/get_oid_hex, or instances that can be trivially converted to use struct object_id instead, are not converted. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>
2014-12-12Merge branch 'jk/pack-bitmap'Junio C Hamano1-5/+3
* jk/pack-bitmap: pack-bitmap: do not use gcc packed attribute
2014-11-30pack-bitmap: do not use gcc packed attributeKarsten Blees1-5/+3
The "__attribute__" flag may be a noop on some compilers. That's OK as long as the code is correct without the attribute, but in this case it is not. We would typically end up with a struct that is 2 bytes too long due to struct padding, breaking both reading and writing of bitmaps. Instead of marshalling the data in a struct, let's just provide helpers for reading and writing the appropriate types. Besides being correct on all platforms, the result is more efficient and simpler to read. Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-09-18use REALLOC_ARRAY for changing the allocation size of arraysRené Scharfe1-2/+1
Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-06Use hashcpy() when copying object namesSun He1-1/+1
We invented hashcpy() to keep the abstraction of "object name" behind it. Use it instead of calling memcpy() with hard-coded 20-byte length when moving object names between pieces of memory. Leave ppc/sha1.c as-is, because the function is about the SHA-1 hash algorithm whose output is and will always be 20 bytes. Helped-by: Michael Haggerty <mhagger@alum.mit.edu> Helped-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Sun He <sunheehnus@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-12-30pack-bitmap: implement optional name_hash cacheVicent Marti1-2/+19
When we use pack bitmaps rather than walking the object graph, we end up with the list of objects to include in the packfile, but we do not know the path at which any tree or blob objects would be found. In a recently packed repository, this is fine. A fetch would use the paths only as a heuristic in the delta compression phase, and a fully packed repository should not need to do much delta compression. As time passes, though, we may acquire more objects on top of our large bitmapped pack. If clients fetch frequently, then they never even look at the bitmapped history, and all works as usual. However, a client who has not fetched since the last bitmap repack will have "have" tips in the bitmapped history, but "want" newer objects. The bitmaps themselves degrade gracefully in this circumstance. We manually walk the more recent bits of history, and then use bitmaps when we hit them. But we would also like to perform delta compression between the newer objects and the bitmapped objects (both to delta against what we know the user already has, but also between "new" and "old" objects that the user is fetching). The lack of pathnames makes our delta heuristics much less effective. This patch adds an optional cache of the 32-bit name_hash values to the end of the bitmap file. If present, a reader can use it to match bitmapped and non-bitmapped names during delta compression. Here are perf results for p5310: Test origin/master HEAD^ HEAD ------------------------------------------------------------------------------------------------- 5310.2: repack to disk 36.81(37.82+1.43) 47.70(48.74+1.41) +29.6% 47.75(48.70+1.51) +29.7% 5310.3: simulated clone 30.78(29.70+2.14) 1.08(0.97+0.10) -96.5% 1.07(0.94+0.12) -96.5% 5310.4: simulated fetch 3.16(6.10+0.08) 3.54(10.65+0.06) +12.0% 1.70(3.07+0.06) -46.2% 5310.6: partial bitmap 36.76(43.19+1.81) 6.71(11.25+0.76) -81.7% 4.08(6.26+0.46) -88.9% You can see that the time spent on an incremental fetch goes down, as our delta heuristics are able to do their work. And we save time on the partial bitmap clone for the same reason. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-12-30pack-objects: implement bitmap writingVicent Marti1-0/+535
This commit extends more the functionality of `pack-objects` by allowing it to write out a `.bitmap` index next to any written packs, together with the `.idx` index that currently gets written. If bitmap writing is enabled for a given repository (either by calling `pack-objects` with the `--write-bitmap-index` flag or by having `pack.writebitmaps` set to `true` in the config) and pack-objects is writing a packfile that would normally be indexed (i.e. not piping to stdout), we will attempt to write the corresponding bitmap index for the packfile. Bitmap index writing happens after the packfile and its index has been successfully written to disk (`finish_tmp_packfile`). The process is performed in several steps: 1. `bitmap_writer_set_checksum`: this call stores the partial checksum for the packfile being written; the checksum will be written in the resulting bitmap index to verify its integrity 2. `bitmap_writer_build_type_index`: this call uses the array of `struct object_entry` that has just been sorted when writing out the actual packfile index to disk to generate 4 type-index bitmaps (one for each object type). These bitmaps have their nth bit set if the given object is of the bitmap's type. E.g. the nth bit of the Commits bitmap will be 1 if the nth object in the packfile index is a commit. This is a very cheap operation because the bitmap writing code has access to the metadata stored in the `struct object_entry` array, and hence the real type for each object in the packfile. 3. `bitmap_writer_reuse_bitmaps`: if there exists an existing bitmap index for one of the packfiles we're trying to repack, this call will efficiently rebuild the existing bitmaps so they can be reused on the new index. All the existing bitmaps will be stored in a `reuse` hash table, and the commit selection phase will prioritize these when selecting, as they can be written directly to the new index without having to perform a revision walk to fill the bitmap. This can greatly speed up the repack of a repository that already has bitmaps. 4. `bitmap_writer_select_commits`: if bitmap writing is enabled for a given `pack-objects` run, the sequence of commits generated during the Counting Objects phase will be stored in an array. We then use that array to build up the list of selected commits. Writing a bitmap in the index for each object in the repository would be cost-prohibitive, so we use a simple heuristic to pick the commits that will be indexed with bitmaps. The current heuristics are a simplified version of JGit's original implementation. We select a higher density of commits depending on their age: the 100 most recent commits are always selected, after that we pick 1 commit of each 100, and the gap increases as the commits grow older. On top of that, we make sure that every single branch that has not been merged (all the tips that would be required from a clone) gets their own bitmap, and when selecting commits between a gap, we tend to prioritize the commit with the most parents. Do note that there is no right/wrong way to perform commit selection; different selection algorithms will result in different commits being selected, but there's no such thing as "missing a commit". The bitmap walker algorithm implemented in `prepare_bitmap_walk` is able to adapt to missing bitmaps by performing manual walks that complete the bitmap: the ideal selection algorithm, however, would select the commits that are more likely to be used as roots for a walk in the future (e.g. the tips of each branch, and so on) to ensure a bitmap for them is always available. 5. `bitmap_writer_build`: this is the computationally expensive part of bitmap generation. Based on the list of commits that were selected in the previous step, we perform several incremental walks to generate the bitmap for each commit. The walks begin from the oldest commit, and are built up incrementally for each branch. E.g. consider this dag where A, B, C, D, E, F are the selected commits, and a, b, c, e are a chunk of simplified history that will not receive bitmaps. A---a---B--b--C--c--D \ E--e--F We start by building the bitmap for A, using A as the root for a revision walk and marking all the objects that are reachable until the walk is over. Once this bitmap is stored, we reuse the bitmap walker to perform the walk for B, assuming that once we reach A again, the walk will be terminated because A has already been SEEN on the previous walk. This process is repeated for C, and D, but when we try to generate the bitmaps for E, we can reuse neither the current walk nor the bitmap we have generated so far. What we do now is resetting both the walk and clearing the bitmap, and performing the walk from scratch using E as the origin. This new walk, however, does not need to be completed. Once we hit B, we can lookup the bitmap we have already stored for that commit and OR it with the existing bitmap we've composed so far, allowing us to limit the walk early. After all the bitmaps have been generated, another iteration through the list of commits is performed to find the best XOR offsets for compression before writing them to disk. Because of the incremental nature of these bitmaps, XORing one of them with its predecesor results in a minimal "bitmap delta" most of the time. We can write this delta to the on-disk bitmap index, and then re-compose the original bitmaps by XORing them again when loaded. This is a phase very similar to pack-object's `find_delta` (using bitmaps instead of objects, of course), except the heuristics have been greatly simplified: we only check the 10 bitmaps before any given one to find best compressing one. This gives good results in practice, because there is locality in the ordering of the objects (and therefore bitmaps) in the packfile. 6. `bitmap_writer_finish`: the last step in the process is serializing to disk all the bitmap data that has been generated in the two previous steps. The bitmap is written to a tmp file and then moved atomically to its final destination, using the same process as `pack-write.c:write_idx_file`. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>