aboutsummaryrefslogtreecommitdiffstats
path: root/commit-graph.c
AgeCommit message (Collapse)AuthorFilesLines
2024-05-07cocci: apply rules to rewrite callers of "refs" interfacesPatrick Steinhardt1-1/+2
Apply the rules that rewrite callers of "refs" interfaces to explicitly pass `struct ref_store`. The resulting patch has been applied with the `--whitespace=fix` option. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-26Merge branch 'ps/commit-graph-write-leakfix'Junio C Hamano1-11/+8
Leakfix. * ps/commit-graph-write-leakfix: commit-graph: fix memory leak when not writing graph
2024-01-16Merge branch 'jk/commit-graph-slab-clear-fix'Junio C Hamano1-0/+3
Clearing in-core repository (happens during e.g., "git fetch --recurse-submodules" with commit graph enabled) made in-core commit object in an inconsistent state by discarding the necessary data from commit-graph too early, which has been corrected. * jk/commit-graph-slab-clear-fix: commit-graph: retain commit slab when closing NULL commit_graph
2024-01-15commit-graph: fix memory leak when not writing graphPatrick Steinhardt1-11/+8
When `write_commit_graph()` bails out writing a split commit-graph early then it may happen that we have already gathered the set of existing commit-graph file names without yet determining the new merged set of files. This can result in a memory leak though because we only clear the preimage of files when we have collected the postimage. Fix this issue by dropping the condition altogether so that we always try to free both preimage and postimage filenames. As the context structure is zero-initialized this simplification is safe to do. Signed-off-by: Patrick Steinhardt <ps@pks.im> Acked-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-08Merge branch 'en/header-cleanup'Junio C Hamano1-2/+1
Remove unused header "#include". * en/header-cleanup: treewide: remove unnecessary includes in source files treewide: add direct includes currently only pulled in transitively trace2/tr2_tls.h: remove unnecessary include submodule-config.h: remove unnecessary include pkt-line.h: remove unnecessary include line-log.h: remove unnecessary include http.h: remove unnecessary include fsmonitor--daemon.h: remove unnecessary includes blame.h: remove unnecessary includes archive.h: remove unnecessary include treewide: remove unnecessary includes in source files treewide: remove unnecessary includes from header files
2024-01-05commit-graph: retain commit slab when closing NULL commit_graphJeff King1-0/+3
This fixes a regression introduced in ac6d45d11f (commit-graph: move slab-clearing to close_commit_graph(), 2023-10-03), in which running: git -c fetch.writeCommitGraph=true fetch --recurse-submodules multiple times in a freshly cloned repository causes a segfault. What happens in the second (and subsequent) runs is this: 1. We make a "struct commit" for any ref tips which we're storing (even if we already have them, they still go into FETCH_HEAD). Because the first run will have created a commit graph, we'll find those commits in the graph. The commit struct is therefore created with a NULL "maybe_tree" entry, because we can load its oid from the graph later. But to do that we need to remember that we got the commit from the graph, which is recorded in a global commit_graph_data_slab object. 2. Because we're using --recurse-submodules, we'll try to fetch each of the possible submodules. That implies creating a separate "struct repository" in-process for each submodule, which will require a later call to repo_clear(). The call to repo_clear() calls raw_object_store_clear(), which in turn calls close_object_store(), which in turn calls close_commit_graph(). And the latter frees the commit graph data slab. 3. Later, when trying to write out a new commit graph, we'll ask for their tree oid via get_commit_tree_oid(), which will see that the object is parsed but with a NULL maybe_tree field. We'd then usually pull it from the graph file, but because the slab was cleared, we don't realize that we can do so! We end up returning NULL and segfaulting. (It seems questionable that we'd write a graph entry for such a commit anyway, since we know we already have one. I didn't double-check, but that may simply be another side effect of having cleared the slab). The bug is in step (2) above. We should not be clearing the slab when cleaning up the submodule repository structs. Prior to ac6d45d11f, we did not do so because it was done inside a helper function that returned early when it saw NULL. So the behavior change from that commit is that we'll now _always_ clear the slab via repo_clear(), even if the repository being closed did not have a commit graph (and thus would have a NULL commit_graph struct). The most immediate fix is to add in a NULL check in close_commit_graph(), making it a true noop when passed in an object_store with a NULL commit_graph (it's OK to just return early, since the rest of its code is already a noop when passed NULL). That restores the pre-ac6d45d11f behavior. And that's what this patch does, along with a test that exercises it (we already have a test that uses submodules along with fetch.writeCommitGraph, but the bug only triggers when there is a subsequent fetch and when that fetch uses --recurse-submodules). So that fixes the regression in the least-risky way possible. I do think there's some fragility here that we might want to follow up on. We have a global commit_graph_data_slab that contains graph positions, and our global commit structs depend on the that slab remaining valid. But close_commit_graph() is just about closing _one_ object store's graph. So it's dangerous to call that function and clear the slab without also throwing away any "struct commit" we might have parsed that depends on it. Which at first glance seems like a bug we could already trigger. In the situation described here, there is no commit graph in the submodule repository, so our commit graph is NULL (in fact, in our test script there is no submodule repo at all, so we immediately return from repo_init() and call repo_clear() only to free up memory). But what would happen if there was one? Wouldn't we see a non-NULL commit_graph entry, and then clear the global slab anyway? The answer is "no", but for very bizarre reasons. Remember that repo_clear() calls raw_object_store_clear(), which then calls close_object_store() and thus close_commit_graph(). But before it does so, raw_object_store_clear() does something else: it frees the commit graph and sets it to NULL! So by this code path we'll _never_ see a non-NULL commit_graph struct, and thus never clear the slab. So it happens to work out. But it still seems questionable to me that we would clear a global slab (which might still be in use) when closing the commit graph. This clearing comes from 957ba814bf (commit-graph: when closing the graph, also release the slab, 2021-09-08), and was fixing a case where we really did need it to be closed (and in that case we presumably call close_object_store() more directly). So I suspect there may still be a bug waiting to happen there, as any object loaded before the call to close_object_store() may be stranded with a bogus maybe_tree entry (and thus looking at it after the call might cause an error). But I'm not sure how to trigger it, nor what the fix should look like (you probably would need to "unparse" any objects pulled from the graph). And so this patch punts on that for now in favor of fixing the recent regression in the most direct way, which should not have any other fallouts. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-1/+0
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26treewide: add direct includes currently only pulled in transitivelyElijah Newren1-0/+1
The next commit will remove a bunch of unnecessary includes, but to do so, we need some of the lower level direct includes that files rely on to be explicitly specified. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-1/+0
Each of these were checked with gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE} to ensure that removing the direct inclusion of the header actually resulted in that header no longer being included at all (i.e. that no other header pulled it in transitively). ...except for a few cases where we verified that although the header was brought in transitively, nothing from it was directly used in that source file. These cases were: * builtin/credential-cache.c * builtin/pull.c * builtin/send-pack.c Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-18Merge branch 'ps/commit-graph-less-paranoid'Junio C Hamano1-1/+1
Earlier we stopped relying on commit-graph that (still) records information about commits that are lost from the object store, which has negative performance implications. The default has been flipped to disable this pessimization. * ps/commit-graph-less-paranoid: commit-graph: disable GIT_COMMIT_GRAPH_PARANOIA by default
2023-12-09Merge branch 'jk/chunk-bounds-more'Junio C Hamano1-61/+33
Code clean-up for jk/chunk-bounds topic. * jk/chunk-bounds-more: commit-graph: mark chunk error messages for translation commit-graph: drop verify_commit_graph_lite() commit-graph: check order while reading fanout chunk commit-graph: use fanout value for graph size commit-graph: abort as soon as we see a bogus chunk commit-graph: clarify missing-chunk error messages commit-graph: drop redundant call to "lite" verification midx: check consistency of fanout table commit-graph: handle overflow in chunk_size checks
2023-11-26commit-graph: disable GIT_COMMIT_GRAPH_PARANOIA by defaultPatrick Steinhardt1-1/+1
In 7a5d604443 (commit: detect commits that exist in commit-graph but not in the ODB, 2023-10-31), we have introduced a new object existence check into `repo_parse_commit_internal()` so that we do not parse commits via the commit-graph that don't have a corresponding object in the object database. This new check of course comes with a performance penalty, which the commit put at around 30% for `git rev-list --topo-order`. But there are in fact scenarios where the performance regression is even higher. The following benchmark against linux.git with a fully-build commit-graph: Benchmark 1: git.v2.42.1 rev-list --count HEAD Time (mean ± σ): 658.0 ms ± 5.2 ms [User: 613.5 ms, System: 44.4 ms] Range (min … max): 650.2 ms … 666.0 ms 10 runs Benchmark 2: git.v2.43.0-rc1 rev-list --count HEAD Time (mean ± σ): 1.333 s ± 0.019 s [User: 1.263 s, System: 0.069 s] Range (min … max): 1.302 s … 1.361 s 10 runs Summary git.v2.42.1 rev-list --count HEAD ran 2.03 ± 0.03 times faster than git.v2.43.0-rc1 rev-list --count HEAD While it's a noble goal to ensure that results are the same regardless of whether or not we have a potentially stale commit-graph, taking twice as much time is a tough sell. Furthermore, we can generally assume that the commit-graph will be updated by git-gc(1) or git-maintenance(1) as required so that the case where the commit-graph is stale should not at all be common. With that in mind, default-disable GIT_COMMIT_GRAPH_PARANOIA and restore the behaviour and thus performance previous to the mentioned commit. In order to not be inconsistent, also disable this behaviour by default in `lookup_commit_in_graph()`, where the object existence check has been introduced right at its inception via f559d6d45e (revision: avoid hitting packfiles when commits are in commit-graph, 2021-08-09). This results in another speedup in commands that end up calling this function, even though it's less pronounced compared to the above benchmark. The following has been executed in linux.git with ~1.2 million references: Benchmark 1: GIT_COMMIT_GRAPH_PARANOIA=true git rev-list --all --no-walk=unsorted Time (mean ± σ): 2.947 s ± 0.003 s [User: 2.412 s, System: 0.534 s] Range (min … max): 2.943 s … 2.949 s 3 runs Benchmark 2: GIT_COMMIT_GRAPH_PARANOIA=false git rev-list --all --no-walk=unsorted Time (mean ± σ): 2.724 s ± 0.030 s [User: 2.207 s, System: 0.514 s] Range (min … max): 2.704 s … 2.759 s 3 runs Summary GIT_COMMIT_GRAPH_PARANOIA=false git rev-list --all --no-walk=unsorted ran 1.08 ± 0.01 times faster than GIT_COMMIT_GRAPH_PARANOIA=true git rev-list --all --no-walk=unsorted So whereas 7a5d604443 initially introduced the logic to start doing an object existence check in `repo_parse_commit_internal()` by default, the updated logic will now instead cause `lookup_commit_in_graph()` to stop doing the check by default. This behaviour continues to be tweakable by the user via the GIT_COMMIT_GRAPH_PARANOIA environment variable. Note that this requires us to amend some tests to manually turn on the paranoid checks again. This is because we cause repository corruption by manually deleting objects which are part of the commit graph already. These circumstances shouldn't usually happen in repositories. Reported-by: Jeff King <peff@peff.net> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: mark chunk error messages for translationJeff King1-9/+9
The patches from f32af12cee (Merge branch 'jk/chunk-bounds', 2023-10-23) added many new untranslated error messages. While it's unlikely for most users to see these messages at all, most of the other commit-graph error messages are translated (and likewise for the matching midx messages). Let's mark them all for consistency (and to help any poor unfortunate user who does manage to find a broken graph file). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: drop verify_commit_graph_lite()Jeff King1-20/+0
As we've moved all of the checks from this function directly into the chunk-reading code used by the caller (and there is only one caller), we can just drop it entirely. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: check order while reading fanout chunkJeff King1-12/+13
We read the fanout chunk, storing a pointer to it, but only confirm that the entries are monotonic in a final "lite" verification step. Let's move that into the actual OIDF chunk callback, so that we can report problems immediately (for all the reasons given in the previous "commit-graph: abort as soon as we see a bogus chunk" commit). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: use fanout value for graph sizeJeff King1-5/+3
Commit-graph, midx, and pack idx files all have both a lookup table of oids and an oid fanout table. In midx and pack idx files, we take the final entry of the fanout table as the source of truth for the number of entries, and then verify that the size of the lookup table matches that. But for commit-graph files, we do the opposite: we use the size of the lookup table as the source of truth, and then check the final fanout entry against it. As noted in 4169d89645 (commit-graph: check consistency of fanout table, 2023-10-09), either is correct. But there are a few reasons to prefer the fanout table as the source of truth: 1. The fanout entries are 32-bits on disk, and that defines the maximum number of entries we can store. But since the size of the lookup table is only bounded by the filesystem, it can be much larger. And hence computing it as the commit-graph does means that we may truncate the result when storing it in a uint32_t. 2. We read the fanout first, then the lookup table. If we're verifying the chunks as we read them, then we'd want to take the fanout as truth (we have nothing yet to check it against) and then we can check that the lookup table matches what we already know. 3. It is pointlessly inconsistent with the midx and pack idx code. Since the three have to do similar size and bounds checks, it is easier to reason about all three if they use the same approach. So this patch moves the assignment of g->num_commits to the fanout parser, and then we can check the size of the lookup chunk as soon as we try to load it. There's already a test covering this situation, which munges the final fanout entry to 2^32-1. In the current code we complain that it does not agree with the table size. But now that we treat the munged value as the source of truth, we'll complain that the lookup table is the wrong size (again, either is correct). So we'll have to update the message we expect (and likewise for an earlier test which does similar munging). There's a similar test for this situation on the midx side, but rather than making a very-large fanout value, it just truncates the lookup table. We could do that here, too, but the very-large fanout value actually shows an interesting corner case. On a 32-bit system, multiplying to find the expected table size would cause an integer overflow. Using st_mult() would detect that, but cause us to die() rather than falling back to the non-graph code path. Checking the size using division (as we do with existing chunk-size checks) avoids the overflow entirely, and the test demonstrates this when run on a 32-bit system. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: abort as soon as we see a bogus chunkJeff King1-16/+13
The code to read commit-graph files tries to read all of the required chunks, but doesn't abort if we can't find one (or if it's corrupted). It's only at the end of reading the file that we then do some sanity checks for NULL entries. But it's preferable to detect the errors and bail immediately, for a few reasons: 1. It's less error-prone. It's easy in the reader functions to flag an error but still end up setting some struct fields (an error I in fact made while working on this patch series). 2. It's safer. Since verifying some chunks depends on the values of other chunks, we may be depending on not-yet-verified data. I don't know offhand of any case where this can cause problems, but it's one less subtle thing to worry about in the reader code. 3. It prevents the user from seeing nonsense errors. If we're missing an OIDL chunk, then g->num_commits will be zero. And so we may complain that the size of our CDAT chunk (which should have a fixed-size record for each commit) is wrong unless it's also zero. But that's misleading; the problem is the missing OIDL chunk; the CDAT one might be fine! So let's just check the return value from read_chunk(). This is exactly how the midx chunk-reading code does it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: clarify missing-chunk error messagesJeff King1-3/+3
When a required commit-graph chunk cannot be loaded, we leave its entry in the struct NULL, and then later complain that it is missing. But that's just one reason we might not have loaded it, as we also do some data quality checks. Let's switch these messages to say "missing or corrupted", which is exactly what the midx code says for the same cases. Likewise, we'll use the same phrasing and capitalization as those for consistency. And while we're here, we can mark them for translation (just like the midx ones). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: drop redundant call to "lite" verificationJeff King1-4/+0
The idea of verify_commit_graph_lite() is to have cheap verification checks both for everyday use of the graph files (to avoid out of bounds reads, etc) as well as for doing a full check via "commit-graph verify" (which will also check the hash, etc). But the expensive verification checks operate on a commit_graph struct, which we get by using the normal everyday-reader code! So any problem we'd find by calling it would have been found before we even got to the verify_one_commit_graph() function. Removing it simplifies the code a bit, but also frees us up to move the "lite" verification steps around within that everyday-reader code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09commit-graph: handle overflow in chunk_size checksJeff King1-3/+3
We check the size of chunks with fixed records by multiplying the width of each record by the number of commits in the file. Like: if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH) If this multiplication overflows, we may not notice a chunk is too small (which could later lead to out-of-bound reads). In the current code this is only possible for the CDAT chunk, but the reasons are quite subtle. We compute g->num_commits by dividing the size of the OIDL chunk by the hash length (since it consists of a bunch of hashes). So we know that any size_t multiplication that uses a value smaller than the hash length cannot overflow. And the CDAT records are the only ones that are larger (the others are just 4-byte records). So it's worth fixing all of these, to make it clear that they're not subject to overflow (without having to reason about seemingly unrelated code). The obvious thing to do is add an st_mult(), like: if (chunk_size != st_mult(g->num_commits, GRAPH_DATA_WIDTH)) And that certainly works, but it has one downside: if we detect an overflow, we'll immediately die(). But the commit graph is an optional file; if we run into other problems loading it, we'll generally return an error and fall back to accessing the full objects. Using st_mult() means a malformed file will abort the whole process. So instead, we can do a division like this: if (chunk_size / GRAPH_DATA_WIDTH != g->num_commits) where there's no possibility of overflow. We do lose a little bit of precision; due to integer division truncation we'd allow up to an extra GRAPH_DATA_WIDTH-1 bytes of data in the chunk. That's OK. Our main goal here is making sure we don't have too _few_ bytes, which would cause an out-of-bounds read (we could actually replace our "!=" with "<", but I think it's worth being a little pedantic, as a large mismatch could be a sign of other problems). I didn't add a test here. We'd need to generate a very large graph file in order to get g->num_commits large enough to cause an overflow. And a later patch in this series will use this same division technique in a way that is much easier to trigger in the tests. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-08Merge branch 'ps/do-not-trust-commit-graph-blindly-for-existence'Junio C Hamano1-1/+5
The codepath to traverse the commit-graph learned to notice that a commit is missing (e.g., corrupt repository lost an object), even though it knows something about the commit (like its parents) from what is in commit-graph. * ps/do-not-trust-commit-graph-blindly-for-existence: commit: detect commits that exist in commit-graph but not in the ODB commit-graph: introduce envvar to disable commit existence checks
2023-11-01commit-graph: introduce envvar to disable commit existence checksPatrick Steinhardt1-1/+5
Our `lookup_commit_in_graph()` helper tries to look up commits from the commit graph and, if it doesn't exist there, falls back to parsing it from the object database instead. This is intended to speed up the lookup of any such commit that exists in the database. There is an edge case though where the commit exists in the graph, but not in the object database. To avoid returning such stale commits the helper function thus double checks that any such commit parsed from the graph also exists in the object database. This makes the function safe to use even when commit graphs aren't updated regularly. We're about to introduce the same pattern into other parts of our code base though, namely `repo_parse_commit_internal()`. Here the extra sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was a newly introduced helper, and as such there was no performance hit by adding this sanity check. If we added `repo_parse_commit_internal()` with that sanity check right from the beginning as well, this would probably never have been an issue to begin with. But by retrofitting it with this sanity check now we do add a performance regression to preexisting code, and thus there is a desire to avoid this or at least give an escape hatch. In practice, there is no inherent reason why either of those functions should have the sanity check whereas the other one does not: either both of them are able to detect this issue or none of them should be. This also means that the default of whether we do the check should likely be the same for both. To err on the side of caution, we thus rather want to make `repo_parse_commit_internal()` stricter than to loosen the checks that we already have in `lookup_commit_in_graph()`. The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is the default, we will double check that commits looked up in the commit graph via `lookup_commit_in_graph()` also exist in the object database. This same check will also be added in `repo_parse_commit_internal()`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-23Merge branch 'jk/chunk-bounds'Junio C Hamano1-17/+102
The codepaths that read "chunk" formatted files have been corrected to pay attention to the chunk size and notice broken files. * jk/chunk-bounds: (21 commits) t5319: make corrupted large-offset test more robust chunk-format: drop pair_chunk_unsafe() commit-graph: detect out-of-order BIDX offsets commit-graph: check bounds when accessing BIDX chunk commit-graph: check bounds when accessing BDAT chunk commit-graph: bounds-check generation overflow chunk commit-graph: check size of generations chunk commit-graph: bounds-check base graphs chunk commit-graph: detect out-of-bounds extra-edges pointers commit-graph: check size of commit data chunk midx: check size of revindex chunk midx: bounds-check large offset chunk midx: check size of object offset chunk midx: enforce chunk alignment on reading midx: check size of pack names chunk commit-graph: check consistency of fanout table midx: check size of oid lookup chunk commit-graph: check size of oid fanout chunk midx: stop ignoring malformed oid fanout chunk t: add library for munging chunk-format files ...
2023-10-13Merge branch 'jk/commit-graph-leak-fixes'Junio C Hamano1-21/+19
Leakfix. * jk/commit-graph-leak-fixes: commit-graph: clear oidset after finishing write commit-graph: free write-context base_graph_name during cleanup commit-graph: free write-context entries before overwriting commit-graph: free graph struct that was not added to chain commit-graph: delay base_graph assignment in add_graph_to_chain() commit-graph: free all elements of graph chain commit-graph: move slab-clearing to close_commit_graph() merge: free result of repo_get_merge_bases() commit-reach: free temporary list in get_octopus_merge_bases() t6700: mark test as leak-free
2023-10-09commit-graph: check bounds when accessing BIDX chunkJeff King1-2/+14
We load the bloom_filter_indexes chunk using pair_chunk(), so we have no idea how big it is. This can lead to out-of-bounds reads if it is smaller than expected, since we index it based on the number of commits found elsewhere in the graph file. We can check the chunk size up front, like we do for CDAT and other chunks with one fixed-size record per commit. The test case demonstrates the problem. It actually won't segfault, because we end up reading random data from the follow-on chunk (BDAT in this case), and the bounds checks added in the previous patch complain. But this is by no means assured, and you can craft a commit-graph file with BIDX at the end (or a smaller BDAT) that does segfault. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: check bounds when accessing BDAT chunkJeff King1-0/+10
When loading Bloom filters from a commit-graph file, we use the offset values in the BIDX chunk to index into the memory mapped for the BDAT chunk. But since we don't record how big the BDAT chunk is, we just trust that the BIDX offsets won't cause us to read outside of the chunk memory. A corrupted or malicious commit-graph file will cause us to segfault (in practice this isn't a very interesting attack, since commit-graph files are local-only, and the worst case is an out-of-bounds read). We can't fix this by checking the chunk size during parsing, since the data in the BDAT chunk doesn't have a fixed size (that's why we need the BIDX in the first place). So we'll fix it in two parts: 1. Record the BDAT chunk size during parsing, and then later check that the BIDX offsets we look up are within bounds. 2. Because the offsets are relative to the end of the BDAT header, we must also make sure that the BDAT chunk is at least as large as the expected header size. Otherwise, we overflow when trying to move past the header, even for an offset of "0". We can check this early, during the parsing stage. The error messages are rather verbose, but since this is not something you'd expect to see outside of severe bugs or corruption, it makes sense to err on the side of too many details. Sadly we can't mention the filename during the chunk-parsing stage, as we haven't set g->filename at this point, nor passed it down through the stack. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: bounds-check generation overflow chunkJeff King1-3/+7
If the generation entry in a commit-graph doesn't fit, we instead insert an offset into a generation overflow chunk. But since we don't record the size of the chunk, we may read outside the chunk if the offset we find on disk is malicious or corrupted. We can't check the size of the chunk up-front; it will vary based on how many entries need overflow. So instead, we'll do a bounds-check before accessing the chunk memory. Unfortunately there is no error-return from this function, so we'll just have to die(), which is what it does for other forms of corruption. As with other cases, we can drop the st_mult() call, since we know our bounds-checked value will fit within a size_t. Before this patch, the test here actually "works" because we read garbage data from the next chunk. And since that garbage data happens not to provide a generation number which changes the output, it appears to work. We could construct a case that actually segfaults or produces wrong output, but it would be a bit tricky. For our purposes its sufficient to check that we've detected the bounds error. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: check size of generations chunkJeff King1-2/+12
We neither check nor record the size of the generations chunk we parse from a commit-graph file. This should have one uint32_t for each commit in the file; if it is smaller (due to corruption, etc), we may read outside the mapped memory. The included test segfaults without this patch, as it shrinks the size considerably (and the chunk is near the end of the file, so we read off the end of the array rather than accidentally reading another chunk). We can fix this by checking the size up front (like we do for other fixed-size chunks, like CDAT). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: bounds-check base graphs chunkJeff King1-1/+7
When we are loading a commit-graph chain, we check that each slice of the chain points to the appropriate set of base graphs via its BASE chunk. But since we don't record the size of the chunk, we may access out-of-bounds memory if the file is corrupted. Since we know the number of entries we expect to find (based on the position within the commit-graph-chain file), we can just check the size up front. In theory this would also let us drop the st_mult() call a few lines later when we actually access the memory, since we know that the computed offset will fit in a size_t. But because the operands "g->hash_len" and "n" have types "unsigned char" and "int", we'd have to cast to size_t first. Leaving the st_mult() does that cast, and makes it more obvious that we don't have an overflow problem. Note that the test does not actually segfault before this patch, since it just reads garbage from the chunk after BASE (and indeed, it even rejects the file because that garbage does not have the expected hash value). You could construct a file with BASE at the end that did segfault, but corrupting the existing one is easy, and we can check stderr for the expected message. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: detect out-of-bounds extra-edges pointersJeff King1-6/+14
If an entry in a commit-graph file has more than 2 parents, the fixed-size parent fields instead point to an offset within an "extra edges" chunk. We blindly follow these, assuming that the chunk is present and sufficiently large; this can lead to an out-of-bounds read for a corrupt or malicious file. We can fix this by recording the size of the chunk and adding a bounds-check in fill_commit_in_graph(). There are a few tricky bits: 1. We'll switch from working with a pointer to an offset. This makes some corner cases just fall out naturally: a. If we did not find an EDGE chunk at all, our size will correctly be zero (so everything is "out of bounds"). b. Comparing "size / 4" lets us make sure we have at least 4 bytes to read, and we never compute a pointer more than one element past the end of the array (computing a larger pointer is probably OK in practice, but is technically undefined behavior). c. The current code casts to "uint32_t *". Replacing it with an offset avoids any comparison between different types of pointer (since the chunk is stored as "unsigned char *"). 2. This is the first case in which fill_commit_in_graph() may return anything but success. We need to make sure to roll back the "parsed" flag (and any parents we might have added before running out of buffer) so that the caller can cleanly fall back to loading the commit object itself. It's a little non-trivial to do this, and we might benefit from factoring it out. But we can wait on that until we actually see a second case where we return an error. As a bonus, this lets us drop the st_mult() call. Since we've already done a bounds check, we know there won't be any integer overflow (it would imply our buffer is larger than a size_t can hold). The included test does not actually segfault before this patch (though you could construct a case where it does). Instead, it reads garbage from the next chunk which results in it complaining about a bogus parent id. This is sufficient for our needs, though (we care that the fallback succeeds, and that stderr mentions the out-of-bounds read). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: check size of commit data chunkJeff King1-1/+11
We expect a commit-graph file to have a fixed-size data record for each commit in the file (and we know the number of commits to expct from the size of the lookup table). If we encounter a file where this is too small, we'll look past the end of the chunk (and possibly even off the mapped memory). We can fix this by checking the size up front when we record the pointer. The included test doesn't segfault, since it ends up reading bytes from another chunk. But it produces nonsense results, since the values it reads are garbage. Our test notices this by comparing the output to a non-corrupted run of the same command (and of course we also check that the expected error is printed to stderr). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: enforce chunk alignment on readingJeff King1-1/+1
The midx reader assumes chunks are aligned to a 4-byte boundary: we treat the fanout chunk as an array of uint32_t, indexing it to feed the results to ntohl(). Without aligning the chunks, we may violate the CPU's alignment constraints. Though many platforms allow this, some do not. And certanily UBSan will complain, since it is undefined behavior. Even though most chunks are naturally 4-byte-aligned (because they are storing uint32_t or larger types), PNAM is not. It stores NUL-terminated pack names, so you can have a valid chunk with any length. The writing side handles this by 4-byte-aligning the chunk, introducing a few extra NULs as necessary. But since we don't check this on the reading side, we may end up with a misaligned fanout and trigger the undefined behavior. We have two options here: 1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The latter handles alignment itself. It's possible that it's slightly slower (though in practice I'm not sure how true that is, especially for these code paths which then go on to do a binary search). 2. Enforce the alignment when reading the chunks. This is easy to do, since the table-of-contents reader can check it in one spot. I went with the second option here, just because it places less burden on maintenance going forward (it is OK to continue using ntohl), and we know it can't have any performance impact on the actual reads. The commit-graph code uses the same chunk API. It's usually also 4-byte aligned, but some chunks are not (like Bloom filter BDAT chunks). So we'll pass "1" here to allow any alignment. It doesn't suffer from the same problem as midx with its fanout because the fanout chunk is always the first (and the rest of the format dictates that the first chunk will start aligned). The new test shows the effect on a midx with a misaligned PNAM chunk. Note that the midx-reading code treats chunk-toc errors as soft, falling back to the non-midx path rather than calling die(), as we do for other parsing errors. Arguably we should make all of these behave the same, but that's out of scope for this patch. For now the test just expects the fallback behavior. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: check consistency of fanout tableJeff King1-0/+16
We use bsearch_hash() to look up items in the oid index of a commit-graph. It also has a fanout table to reduce the initial range in which we'll search. But since the fanout comes from the on-disk file, a corrupted or malicious file can cause us to look outside of the allocated index memory. One solution here would be to pass the total table size to bsearch_hash(), which could then bounds check the values it reads from the fanout. But there's an inexpensive up-front check we can do, and it's the same one used by the midx and pack idx code (both of which likewise have fanout tables and use bsearch_hash(), but are not affected by this bug): 1. We can check the value of the final fanout entry against the size of the table we got from the index chunk. These must always match, since the fanout is just slicing up the index. As a side note, the midx and pack idx code compute it the other way around: they use the final fanout value as the object count, and check the index size against it. Either is valid; if they disagree we cannot know which is wrong (a corrupted fanout value, or a too-small table of oids). 2. We can quickly scan the fanout table to make sure it is monotonically increasing. If it is, then we know that every value is less than or equal to the final value, and therefore less than or equal to the table size. It would also be sufficient to just check that each fanout value is smaller than the final one, but the midx and pack idx code both do a full monotonicity check. It's the same cost, and it catches some other corruptions (though not all; the checks done by "commit-graph verify" are more complete but more expensive, and our goal here is to be fast and memory-safe). There are two new tests. One just checks the final fanout value (this is the mirror image of the "too small oid lookup" case added for the midx in the previous commit; it's flipped here because commit-graph considers the oid lookup chunk to be the source of truth). The other actually creates a fanout with many out-of-bounds entries, and prior to this patch, it does cause the segfault you'd expect. But note that the error is not "your fanout entry is out-of-bounds", but rather "fanout value out of order". That's because we leave the final fanout value in place (to get past the table size check), making the index non-monotonic (the second-to-last entry is big, but the last one must remain small to match the actual table). We need adjustments to a few existing tests, as well: - an earlier test in t5318 corrupts the fanout and runs "commit-graph verify". Its message is now changed, since we catch the problem earlier (during the load step, rather than the careful validation step). - in t5324, we test that "commit-graph verify --shallow" does not do expensive verification on the base file of the chain. But the corruption it uses (munging a byte at offset 1000) happens to be in the middle of the fanout table. And now we detect that problem in the cheaper checks that are performed for every part of the graph. We'll push this back to offset 1500, which is only caught by the more expensive checksum validation. Likewise, there's a later test in t5324 which munges an offset 100 bytes into a file (also in the fanout table) that is referenced by an alternates file. So we now find that corruption during the load step, rather than the verification step. At the very least we need to change the error message (like the case above in t5318). But it is probably good to make sure we handle all parts of the verification even for alternate graph files. So let's likewise corrupt byte 1500 and make sure we found the invalid checksum. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09commit-graph: check size of oid fanout chunkJeff King1-2/+11
We load the oid fanout chunk with pair_chunk(), which means we never see the size of the chunk. We just assume the on-disk file uses the appropriate size, and if it's too small we'll access random memory. It's easy to check this up-front; the fanout always consists of 256 uint32's, since it is a fanout of the first byte of the hash pointing into the oid index. These parameters can't be changed without introducing a new chunk type. This matches the similar check in the midx OIDF chunk (but note that rather than checking for the error immediately, the graph code just leaves parts of the struct NULL and checks for required fields later). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09chunk-format: note that pair_chunk() is unsafeJeff King1-7/+7
The pair_chunk() function is provided as an easy helper for parsing chunks that just want a pointer to a set of bytes. But every caller has a hidden bug: because we return only the pointer without the matching chunk size, the callers have no clue how many bytes they are allowed to look at. And as a result, they may read off the end of the mmap'd data when the on-disk file does not match their expectations. Since chunk files are typically used for local-repository data like commit-graph files and midx's, the security implications here are pretty mild. The worst that can happen is that you hand somebody a corrupted repository tarball, and running Git on it does an out-of-bounds read and crashes. So it's worth being more defensive, but we don't need to drop everything and fix every caller immediately. I noticed the problem because the pair_chunk_fn() callback does not look at its chunk_size argument, and wanted to annotate it to silence -Wunused-parameter. We could do that now, but we'd lose the hint that this code should be audited and fixed. So instead, let's set ourselves up for going down that path: 1. Provide a pair_chunk() function that does return the size, which prepares us for fixing these cases. 2. Rename the existing function to pair_chunk_unsafe(). That gives us an easy way to grep for cases which still need to be fixed, and the name should cause anybody adding new calls to think twice before using it. There are no callers of the "safe" version yet, but we'll add some in subsequent patches. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03commit-graph: free write-context base_graph_name during cleanupJeff King1-0/+1
Commit 6c622f9f0b (commit-graph: write commit-graph chains, 2019-06-18) added a base_graph_name string to the write_commit_graph_context struct. But the end-of-function cleanup forgot to free it, causing a leak. This (presumably in combination with the preceding leak-fixes) lets us mark t5328 as leak-free. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03commit-graph: free write-context entries before overwritingJeff King1-0/+2
When writing a split graph file, we replace the final element of the commit_graph_hash_after and commit_graph_filenames_after arrays. But since these are allocated strings, we need to free them before overwriting to avoid leaking the old string. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03commit-graph: free graph struct that was not added to chainJeff King1-0/+2
When reading the graph chain file, we open (and allocate) each individual slice it mentions and then add them to a linked-list chain. But if adding to the chain fails (e.g., because the base-graph chunk it contains didn't match what we expected), we leave the function without freeing the graph struct that caused the failure, leaking it. We can fix it by calling free_graph_commit(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03commit-graph: delay base_graph assignment in add_graph_to_chain()Jeff King1-2/+2
When adding a graph to a chain, we do some consistency checks and then if everything looks good, set g->base_graph to add a link to the chain. But when we added a new consistency check in 209250ef38 (commit-graph.c: prevent overflow in add_graph_to_chain(), 2023-07-12), it comes _after_ we've already set g->base_graph. So we might return failure, even though we actually added to the chain. This hasn't caused a bug yet, because after failing to add to the chain, we discard the failed graph struct completely, leaking it. But in order to fix that, it's important that the struct be in a consistent and predictable state after the failure. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03commit-graph: free all elements of graph chainJeff King1-18/+11
When running "commit-graph verify", we call free_commit_graph(). That's sufficient for the case of a single graph file, but if we loaded a chain of split graph files, they form a linked list via the base_graph pointers. We need to free all of them, or we leak all but the first struct. We can make this work by teaching free_commit_graph() to walk the base_graph pointers and free each element. This in turn lets us simplify close_commit_graph(), which does the same thing by recursion (we cannot just use close_commit_graph() in "commit-graph verify", as the function takes a pointer to an object store, and the verify command creates a single one-off graph struct). While indenting the code in free_commit_graph() for the loop, I noticed that setting g->data to NULL is rather pointless, as we free the struct a few lines later. So I cleaned that up while we're here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03commit-graph: move slab-clearing to close_commit_graph()Jeff King1-1/+1
When closing and freeing a commit-graph, the main entry point is close_commit_graph(), which then uses close_commit_graph_one() to recurse through the base_graph links and free each one. Commit 957ba814bf (commit-graph: when closing the graph, also release the slab, 2021-09-08) put the call to clear the slab into the recursive function, but this is pointless: there's only a single global slab variable. It works OK in practice because clearing the slab is idempotent, but it makes the code harder to reason about and refactor. Move it into the parent function so it's only called once (and there are no other direct callers of the recursive close_commit_graph_one(), so we are not hurting them). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28commit-graph: report incomplete chains during verificationJeff King1-2/+5
The load_commit_graph_chain_fd_st() function will stop loading chains when it sees an error. But if it has loaded any graph slice at all, it will return it. This is a good thing for normal use (we use what data we can, and this is just an optimization). But it's a bad thing for "commit-graph verify", which should be careful about finding any irregularities. We do complain to stderr with a warning(), but the verify command still exits with a successful return code. The new tests here cover corruption of both the base and tip slices of the chain. The corruption of the base file already works (it is the first file we look at, so when we see the error we return NULL). The "tip" case is what is fixed by this patch (it complains to stderr but still returns the base slice). Likewise the existing tests for corruption of the commit-graph-chain file itself need to be updated. We already exited non-zero correctly for the "base" case, but the "tip" case can now do so, too. Note that this also causes us to adjust a test later in the file that similarly corrupts a tip (though confusingly the test script calls this "base"). It checks stderr but erroneously expects the whole "verify" command to exit with a successful code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28commit-graph: tighten chain size checkJeff King1-2/+8
When we open a commit-graph-chain file, if it's smaller than a single entry, we just quietly treat that as ENOENT. That make some sense if the file is truly zero bytes, but it means that "commit-graph verify" will quietly ignore a file that contains garbage if that garbage happens to be short. Instead, let's only simulate ENOENT when the file is truly empty, and otherwise return EINVAL. The normal graph-loading routines don't care, but "commit-graph verify" will notice and complain about the difference. It's not entirely clear to me that the 0-is-ENOENT case actually happens in real life, so we could perhaps just eliminate this special-case altogether. But this is how we've always behaved, so I'm preserving it in the name of backwards compatibility (though again, it really only matters for "verify", as the regular routines are happy to load what they can). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28commit-graph: check mixed generation validation when loading chain fileJeff King1-27/+27
In read_commit_graph_one(), we call validate_mixed_generation_chain() after loading the graph. Even though we don't check the return value, this has the side effect of clearing the read_generation_data flag, which is important when working with mixed generation numbers. But doing this in load_commit_graph_chain_fd_st() makes more sense: 1. We are calling it even when we did not load a chain at all, which is pointless (you cannot have mixed generations in a single file). 2. For now, all callers load the graph via read_commit_graph_one(). But the point of factoring out the open/load in the previous commit was to let "commit-graph verify" call them separately. So it needs to trigger this function as part of the load. Without this patch, the mixed-generation tests in t5324 would start failing on "git commit-graph verify" calls, once we switch to using a separate open/load call there. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28commit-graph: factor out chain opening functionJeff King1-19/+39
The load_commit_graph_chain() function opens the chain file and all of the slices of graph that it points to. If there is no chain file (which is a totally normal condition), we return NULL. But if we run into errors with the chain file or loading the actual graph data, we also return NULL, and the caller cannot tell the difference. The caller can check for ENOENT for the unremarkable "no such file" case. But I'm hesitant to assume that the rest of the function would never accidentally set errno to ENOENT itself, since it is opening the slice files (and that would mean the caller fails to notice a real error). So let's break this into two functions: one to open the file, and one to actually load it. This matches the interface we provide for the non-chain graph file, which will also come in handy in a moment when we fix some bugs in the "git commit-graph verify" code. Some notes: - I've kept the "1 is good, 0 is bad" return convention (and the weird "fd" out-parameter) used by the matching open_commit_graph() function and other parts of the commit-graph code. This is unlike most of the rest of Git (which would just return the fd, with -1 for error), but it makes sense to stay consistent with the adjacent bits of the API here. - The existing chain loading function will quietly return if the file is too small to hold a single entry. I've retained that behavior (and explicitly set ENOENT in the opener function) for now, under the notion that it's probably valid (though I'd imagine unusual) to have an empty chain file. There are two small behavior changes here, but I think both are strictly positive: 1. The original blindly did a stat() before checking if fopen() succeeded, meaning we were making a pointless extra stat call. 2. We now use fstat() to check the file size. The previous code using a regular stat() on the pathname meant we could technically race with somebody updating the chain file, and end up with a size that does not match what we just opened with fopen(). I doubt anybody ever hit this in practice, but it may have caused an out-of-bounds read. We'll retain the load_commit_graph_chain() function which does both the open and reading steps (most existing callers do not care about seeing errors anyway, since loading commit-graphs is optimistic). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-07Merge branch 'jk/unused-post-2.42'Junio C Hamano1-4/+5
Unused parameters to functions are marked as such, and/or removed, in order to bring us closer to -Wunused-parameter clean. * jk/unused-post-2.42: (22 commits) update-ref: mark unused parameter in parser callbacks gc: mark unused descriptors in scheduler callbacks bundle-uri: mark unused parameters in callbacks fetch: mark unused parameter in ref_transaction callback credential: mark unused parameter in urlmatch callback grep: mark unused parmaeters in pcre fallbacks imap-send: mark unused parameters with NO_OPENSSL worktree: mark unused parameters in noop repair callback negotiator/noop: mark unused callback parameters add-interactive: mark unused callback parameters grep: mark unused parameter in output function test-trace2: mark unused argv/argc parameters trace2: mark unused config callback parameter trace2: mark unused us_elapsed_absolute parameters stash: mark unused parameter in diff callback ls-tree: mark unused parameter in callback commit-graph: mark unused data parameters in generation callbacks worktree: mark unused parameters in each_ref_fn callback pack-bitmap: mark unused parameters in show_object callback ref-filter: mark unused parameters in parser callbacks ...
2023-08-29commit-graph: mark unused data parameters in generation callbacksJeff King1-4/+5
The compute_generation_info code uses function pointers to abstract the get/set generation operations. Some callers don't need the extra void data pointer, which should be annotated to appease -Wunused-parameter. Note that we can drop the assignment of the "data" parameter in compute_generation_numbers(), as we've just shown that neither of the callbacks it uses will access it. This matches the caller in ensure_generations_valid(), which already does not bother to set "data". Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-21commit-graph: avoid repeated mixed generation number warningsTaylor Blau1-16/+13
When validating that a commit-graph has either all zero, or all non-zero generation numbers, we emit a warning on both the rising and falling edge of transitioning between the two. So if we are unfortunate enough to see a commit-graph which has a repeating sequence of zero, then non-zero generation numbers, we'll generate many warnings that contain more or less the same information. Avoid this by keeping track of a single example for a commit with zero- and non-zero generation, and emit a single warning at the end of verification if both are non-NULL. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-21commit-graph: verify swapped zero/non-zero generation casesJeff King1-3/+6
In verify_one_commit_graph(), we have code that complains when a commit is found with a generation number of zero, and then later with a non-zero number. It works like this: 1. When we see an entry with generation zero, we set the generation_zero flag to GENERATION_ZERO_EXISTS. 2. When we later see an entry with a non-zero generation, we complain if the flag is GENERATION_ZERO_EXISTS. There's a matching GENERATION_NUMBER_EXISTS value, which in theory would be used to find the case that we see the entries in the opposite order: 1. When we see an entry with a non-zero generation, we set the generation_zero flag to GENERATION_NUMBER_EXISTS. 2. When we later see an entry with a zero generation, we complain if the flag is GENERATION_NUMBER_EXISTS. But that doesn't work; step 2 is implemented, but there is no step 1. We never use NUMBER_EXISTS at all, and Coverity rightly complains that step 2 is dead code. We can fix that by implementing that step 1. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-21commit-graph: introduce `commit_graph_generation_from_graph()`Taylor Blau1-2/+12
In 2ee11f7261 (commit-graph: return generation from memory, 2023-03-20), the `commit_graph_generation()` function stopped returning zeros when asked to locate the generation number of a given commit. This was done at the time to prepare for a later change which set generation values in memory, meaning that we could no longer rely on `graph_pos` alone to tell us whether or not to trust the generation number returned by this function. In 2ee11f7261, it was noted that this change only impacted very old commit-graphs, which were written with all commits having generation number 0. Indeed, zero is not a valid generation number, so we should never expect to see that value outside of the aforementioned case. The test fallout in 2ee11f7261 indicated that we were no longer able to fsck a specific old case of commit-graph corruption, where we see a non-zero generation number after having seen a generation number of 0 earlier. Introduce a variant of `commit_graph_generation()` which behaves like that function did prior to 2ee11f7261, known as `commit_graph_generation_from_graph()`. Then use this function in the context of `verify_one_commit_graph()`, where we only want to trust the values from the graph. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-25Merge branch 'tb/object-access-overflow-protection'Junio C Hamano1-23/+40
Various offset computation in the code that accesses the packfiles and other data in the object layer has been hardened against arithmetic overflow, especially on 32-bit systems. * tb/object-access-overflow-protection: commit-graph.c: prevent overflow in `verify_commit_graph()` commit-graph.c: prevent overflow in `write_commit_graph()` commit-graph.c: prevent overflow in `merge_commit_graph()` commit-graph.c: prevent overflow in `split_graph_merge_strategy()` commit-graph.c: prevent overflow in `load_tree_for_commit()` commit-graph.c: prevent overflow in `fill_commit_in_graph()` commit-graph.c: prevent overflow in `fill_commit_graph_info()` commit-graph.c: prevent overflow in `load_oid_from_graph()` commit-graph.c: prevent overflow in add_graph_to_chain() commit-graph.c: prevent overflow in `write_commit_graph_file()` pack-bitmap.c: ensure that eindex lookups don't overflow midx.c: prevent overflow in `fill_included_packs_batch()` midx.c: prevent overflow in `write_midx_internal()` midx.c: store `nr`, `alloc` variables as `size_t`'s midx.c: prevent overflow in `nth_midxed_offset()` midx.c: prevent overflow in `nth_midxed_object_oid()` midx.c: use `size_t`'s for fanout nr and alloc packfile.c: use checked arithmetic in `nth_packed_object_offset()` packfile.c: prevent overflow in `load_idx()` packfile.c: prevent overflow in `nth_packed_object_id()`
2023-07-18Merge branch 'tb/fsck-no-progress'Junio C Hamano1-17/+34
"git fsck --no-progress" still spewed noise from the commit-graph subsystem, which has been corrected. * tb/fsck-no-progress: commit-graph.c: avoid duplicated progress output during `verify` commit-graph.c: pass progress to `verify_one_commit_graph()` commit-graph.c: iteratively verify commit-graph chains commit-graph.c: extract `verify_one_commit_graph()` fsck: suppress MIDX output with `--no-progress` fsck: suppress commit-graph output with `--no-progress`
2023-07-14commit-graph.c: prevent overflow in `verify_commit_graph()`Taylor Blau1-2/+2
In a similar spirit as previous commits, ensure that we don't overflow when trying to read an OID out of an existing commit-graph during verification. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `write_commit_graph()`Taylor Blau1-1/+1
In a similar spirit as previous commits, ensure that we don't overflow when trying to read an existing OID while writing a new commit-graph. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `merge_commit_graph()`Taylor Blau1-0/+5
When merging two commit graphs, ensure that we don't attempt to merge two graphs which, when combined, have more total commits than the 32-bit unsigned maximum. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `split_graph_merge_strategy()`Taylor Blau1-1/+6
In a similar spirit as previous commits, ensure that we don't overflow when choosing how to split and merge different layers of the commit-graph. In particular, avoid a potential overflow between `size_mult` and `num_commits`, as well as a potential overflow between the number of commits currently in the merged graph, and the number of commits in the graph about to be merged. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `load_tree_for_commit()`Taylor Blau1-1/+1
In a similar spirit as previous commits, ensure that we don't overflow when computing an offset into the commit_data chunk when the (relative) graph position exceeds 2^32-1/GRAPH_DATA_WIDTH. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `fill_commit_in_graph()`Taylor Blau1-2/+2
In a similar spirit as previous commits, ensure that we don't overflow when the lex_index of the commit we are trying to fill out exceeds 2^32-1/(g->hash_len+16). The other hunk touched in this patch is not susceptible to overflow, since an explicit cast is made to a 64-bit unsigned value. For clarity and consistency with the rest of the commits in this series, avoid a tricky to reason about cast, and use `st_mult()` directly. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `fill_commit_graph_info()`Taylor Blau1-3/+3
In a similar spirit as previous commits, ensure that we don't overflow in a few spots within `fill_commit_graph_info()`: - First, when computing an offset into the commit data chunk, which can occur when the `lex_index` of the item we're looking up exceeds 2^32-1/GRAPH_DATA_WIDTH. - A similar issue when computing the generation date offset for commits with `lex_index` greater than 2^32-1/4. Note that in practice this will never overflow, since the left-hand operand is from calling `sizeof(...)` and is thus already a `size_t`. But wrap that in an `st_mult()` to make it clear that we intend to perform this computation using 64-bit operands. - Finally, a nearly identical issue as above when computing an offset into the `generation_data_overflow` chunk. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `load_oid_from_graph()`Taylor Blau1-1/+1
In a similar spirit as previous commits, ensure that we don't overflow when trying to compute an offset into the `chunk_oid_lookup` table when the `lex_index` of the item we're trying to look up exceeds `2^32-1/g->hash_len`. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in add_graph_to_chain()Taylor Blau1-2/+9
The commit-graph uses a fanout table with 4-byte entries to store the number of commits at each shard of the commit-graph. So it is OK to have a commit graph with as many as 2^32-1 stored commits. But we risk overflowing any computation which may exceed the 32-bit (unsigned) maximum when those computations are (incorrectly) performed using 32-bit operands. There are a couple of spots in `add_graph_to_chain()` where we could potentially overflow the result: - First, when comparing the list of existing entries in the commit-graph chain. It is unlikely that this should ever overflow, since it would require having roughly 2^32-1/g->hash_len commit-graphs in the chain. But let's guard that computation with a `st_mult()` just to be safe. - Second, when computing the number of commits in the graph added to the front of the chain. This value is also a 32-bit unsigned, but we should make sure that it does not grow beyond the maximum value. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14commit-graph.c: prevent overflow in `write_commit_graph_file()`Taylor Blau1-10/+10
When writing a commit-graph, we use the chunk-format API to write out each individual chunk of the commit-graph. Each chunk of the commit-graph is tracked via a call to `add_chunk()`, along with the expected size of that chunk. Similar to an earlier commit which handled the identical issue in the MIDX machinery, guard against overflow when dealing with a commit-graph with a large number of entries to avoid corrupting the contents of the commit-graph itself. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10commit-graph.c: avoid duplicated progress output during `verify`Taylor Blau1-10/+17
When `git commit-graph verify` was taught how to verify commit-graph chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode, 2019-06-18), it produced one line of progress per layer of the commit-graph chain. $ git.compile commit-graph verify Verifying commits in commit graph: 100% (4356/4356), done. Verifying commits in commit graph: 100% (131912/131912), done. This could be somewhat confusing to users, who may wonder why there are multiple occurrences of "Verifying commits in commit graph". There are likely good arguments on whether or not there should be one line of progress output per commit-graph layer. On the one hand, the existing output shows us verifying each individual layer of the chain. But on the other hand, the fact that a commit-graph may be stored among multiple layers is an implementation detail that the caller need not be aware of. Clarify this by showing a single progress meter regardless of the number of layers in the commit-graph chain. After this patch, the output reflects the logical contents of a commit-graph chain, instead of showing one line of output per commit-graph layer: $ git.compile commit-graph verify Verifying commits in commit graph: 100% (136268/136268), done. Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10commit-graph.c: pass progress to `verify_one_commit_graph()`Taylor Blau1-8/+9
This is the final step to prepare for consolidating the output of `git commit-graph verify`. Instead of having each call to `verify_one_commit_graph()` initialize its own progress struct, have the caller pass one in instead. This patch does not alter the output of `git commit-graph verify`, but the next commit will consolidate the output. Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10commit-graph.c: iteratively verify commit-graph chainsTaylor Blau1-4/+5
Now that we have a function which can verify a single layer of a commit-graph chain, implement `verify_commit_graph()` in terms of iterating over commit-graphs along their `->base_graph` pointers. This further prepares us to consolidate the progress output of `git commit-graph verify`. Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-10commit-graph.c: extract `verify_one_commit_graph()`Taylor Blau1-8/+16
When the `verify_commit_graph()` function was extended to support commit-graph chains via 3da4b609bb1 (commit-graph: verify chains with --shallow mode, 2019-06-18), it did so by recursively calling itself on each layer of the commit-graph chain. In practice this poses no issues, since commit-graph chains do not loop, and there are few enough of them that adding additional frames to the stack is not a problem. A future commit will consolidate the progress output from `git commit-graph verify` when verifying chained commit-graphs to print a single line instead of one progress meter per commit-graph layer. Prepare for this by extracting a routine to verify a single layer of a commit-graph. Note that `verify_commit_graph()` is still recursive after this patch, but this will change in the subsequent patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05treewide: remove unnecessary includes for wrapper.hCalvin Wan1-1/+0
Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-29Merge branch 'en/header-split-cache-h-part-3'Junio C Hamano1-1/+2
Header files cleanup. * en/header-split-cache-h-part-3: (28 commits) fsmonitor-ll.h: split this header out of fsmonitor.h hash-ll, hashmap: move oidhash() to hash-ll object-store-ll.h: split this header out of object-store.h khash: name the structs that khash declares merge-ll: rename from ll-merge git-compat-util.h: remove unneccessary include of wildmatch.h builtin.h: remove unneccessary includes list-objects-filter-options.h: remove unneccessary include diff.h: remove unnecessary include of oidset.h repository: remove unnecessary include of path.h log-tree: replace include of revision.h with simple forward declaration cache.h: remove this no-longer-used header read-cache*.h: move declarations for read-cache.c functions from cache.h repository.h: move declaration of the_index from cache.h merge.h: move declarations for merge.c from cache.h diff.h: move declaration for global in diff.c from cache.h preload-index.h: move declarations for preload-index.c from elsewhere sparse-index.h: move declarations for sparse-index.c from cache.h name-hash.h: move declarations for name-hash.c from cache.h run-command.h: move declarations for run-command.c from cache.h ...
2023-06-22Merge branch 'ds/disable-replace-refs'Junio C Hamano1-3/+1
Introduce a mechanism to disable replace refs globally and per repository. * ds/disable-replace-refs: repository: create read_replace_refs setting replace-objects: create wrapper around setting repository: create disable_replace_refs()
2023-06-21object-store-ll.h: split this header out of object-store.hElijah Newren1-1/+1
The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21repository: remove unnecessary include of path.hElijah Newren1-0/+1
This also made it clear that several .c files that depended upon path.h were missing a #include for it; add the missing includes while at it. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-12replace-objects: create wrapper around settingDerrick Stolee1-3/+1
The 'read_replace_objects' constant is initialized by git_default_config (if core.useReplaceRefs is disabled) and within setup_git_env (if GIT_NO_REPLACE_OBJECTS) is set. To ensure that this variable cannot be set accidentally in other places, wrap it in a replace_refs_enabled() method. Since we still assign this global in config.c, we are not able to remove the global scope of this variable and make it a static within replace-object.c. This will happen in a later change which will also prevent the variable from being read before it is initialized. Centralizing read access to the variable is an important first step. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-05-09Merge branch 'en/header-split-cache-h-part-2'Junio C Hamano1-0/+1
More header clean-up. * en/header-split-cache-h-part-2: (22 commits) reftable: ensure git-compat-util.h is the first (indirect) include diff.h: reduce unnecessary includes object-store.h: reduce unnecessary includes commit.h: reduce unnecessary includes fsmonitor: reduce includes of cache.h cache.h: remove unnecessary headers treewide: remove cache.h inclusion due to previous changes cache,tree: move basic name compare functions from read-cache to tree cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c hash-ll.h: split out of hash.h to remove dependency on repository.h tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h dir.h: move DTYPE defines from cache.h versioncmp.h: move declarations for versioncmp.c functions from cache.h ws.h: move declarations for ws.c functions from cache.h match-trees.h: move declarations for match-trees.c functions from cache.h pkt-line.h: move declarations for pkt-line.c functions from cache.h base85.h: move declarations for base85.c functions from cache.h copy.h: move declarations for copy.c functions from cache.h server-info.h: move declarations for server-info.c functions from cache.h packfile.h: move pack_window and pack_entry from cache.h ...
2023-04-25Merge branch 'en/header-split-cache-h'Junio C Hamano1-1/+3
Header clean-up. * en/header-split-cache-h: (24 commits) protocol.h: move definition of DEFAULT_GIT_PORT from cache.h mailmap, quote: move declarations of global vars to correct unit treewide: reduce includes of cache.h in other headers treewide: remove double forward declaration of read_in_full cache.h: remove unnecessary includes treewide: remove cache.h inclusion due to pager.h changes pager.h: move declarations for pager.c functions from cache.h treewide: remove cache.h inclusion due to editor.h changes editor: move editor-related functions and declarations into common file treewide: remove cache.h inclusion due to object.h changes object.h: move some inline functions and defines from cache.h treewide: remove cache.h inclusion due to object-file.h changes object-file.h: move declarations for object-file.c functions from cache.h treewide: remove cache.h inclusion due to git-zlib changes git-zlib: move declarations for git-zlib functions from cache.h treewide: remove cache.h inclusion due to object-name.h changes object-name.h: move declarations for object-name.c functions from cache.h treewide: remove unnecessary cache.h inclusion treewide: be explicit about dependence on mem-pool.h treewide: be explicit about dependence on oid-array.h ...
2023-04-24commit.h: reduce unnecessary includesElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: remove cache.h inclusion due to object-file.h changesElijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11object-file.h: move declarations for object-file.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: be explicit about dependence on oid-array.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-06Merge branch 'en/header-split-cleanup'Junio C Hamano1-1/+3
Split key function and data structure definitions out of cache.h to new header files and adjust the users. * en/header-split-cleanup: csum-file.h: remove unnecessary inclusion of cache.h write-or-die.h: move declarations for write-or-die.c functions from cache.h treewide: remove cache.h inclusion due to setup.h changes setup.h: move declarations for setup.c functions from cache.h treewide: remove cache.h inclusion due to environment.h changes environment.h: move declarations for environment.c functions from cache.h treewide: remove unnecessary includes of cache.h wrapper.h: move declarations for wrapper.c functions from cache.h path.h: move function declarations for path.c functions from cache.h cache.h: remove expand_user_path() abspath.h: move absolute path functions from cache.h environment: move comment_line_char from cache.h treewide: remove unnecessary cache.h inclusion from several sources treewide: remove unnecessary inclusion of gettext.h treewide: be explicit about dependence on gettext.h treewide: remove unnecessary cache.h inclusion from a few headers
2023-04-06Merge branch 'ab/remove-implicit-use-of-the-repository'Junio C Hamano1-2/+2
Code clean-up around the use of the_repository. * ab/remove-implicit-use-of-the-repository: libs: use "struct repository *" argument, not "the_repository" post-cocci: adjust comments for recent repo_* migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"
2023-04-06Merge branch 'ps/ahead-behind-truncation-fix'Junio C Hamano1-1/+1
Fix unnecessary truncation of generation numbers used in-core. * ps/ahead-behind-truncation-fix: commit-graph: fix truncated generation numbers
2023-04-06Merge branch 'ds/ahead-behind'Junio C Hamano1-71/+136
"git for-each-ref" learns '%(ahead-behind:<base>)' that computes the distances from a single reference point in the history with bunch of commits in bulk. * ds/ahead-behind: commit-reach: add tips_reachable_from_bases() for-each-ref: add ahead-behind format atom commit-reach: implement ahead_behind() logic commit-graph: introduce `ensure_generations_valid()` commit-graph: return generation from memory commit-graph: simplify compute_generation_numbers() commit-graph: refactor compute_topological_levels() for-each-ref: explicitly test no matches for-each-ref: add --stdin option
2023-04-04Merge branch 'ab/remove-implicit-use-of-the-repository' into ↵Junio C Hamano1-2/+2
en/header-split-cache-h * ab/remove-implicit-use-of-the-repository: libs: use "struct repository *" argument, not "the_repository" post-cocci: adjust comments for recent repo_* migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"
2023-03-28libs: use "struct repository *" argument, not "the_repository"Ævar Arnfjörð Bjarmason1-1/+1
As can easily be seen from grepping in our sources, we had these uses of "the_repository" in various library code in cases where the function in question was already getting a "struct repository *" argument. Let's use that argument instead. Out of these changes only the changes to "cache-tree.c", "commit-reach.c", "shallow.c" and "upload-pack.c" would have cleanly applied before the migration away from the "repo_*()" wrapper macros in the preceding commits. The rest aren't new, as we'd previously implicitly refer to "the_repository", but it's now more obvious that we were doing the wrong thing all along, and should have used the parameter instead. The change to change "get_index_format_default(the_repository)" in "read-cache.c" to use the "r" variable instead should arguably have been part of [1], or in the subsequent cleanup in [2]. Let's do it here, as can be seen from the initial code in [3] it's not important that we use "the_repository" there, but would prefer to always use the current repository. This change excludes the "the_repository" use in "upload-pack.c"'s upload_pack_advertise(), as the in-flight [4] makes that change. 1. ee1f0c242ef (read-cache: add index.skipHash config option, 2023-01-06) 2. 6269f8eaad0 (treewide: always have a valid "index_state.repo" member, 2023-01-17) 3. 7211b9e7534 (repo-settings: consolidate some config settings, 2019-08-13) 4. <Y/hbUsGPVNAxTdmS@coredump.intra.peff.net> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28cocci: apply the "packfile.h" part of "the_repository.pending"Ævar Arnfjörð Bjarmason1-1/+1
Apply the part of "the_repository.pending.cocci" pertaining to "packfile.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28cocci: apply the "commit.h" part of "the_repository.pending"Ævar Arnfjörð Bjarmason1-1/+1
Apply the part of "the_repository.pending.cocci" pertaining to "commit.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-27commit-graph: fix truncated generation numbersPatrick Steinhardt1-1/+1
In 80c928d947 (commit-graph: simplify compute_generation_numbers(), 2023-03-20), the code to compute generation numbers was simplified to use the same infrastructure as is used to compute topological levels. This refactoring introduced a bug where the generation numbers are truncated when they exceed UINT32_MAX because we explicitly cast the computed generation number to `uint32_t`. This is not required though: both the computed value and the field of `struct commit_graph_data` are of the same type `timestamp_t` already, so casting to `uint32_t` will cause truncation. This cast can cause us to miscompute generation data overflows: 1. Given a commit with no parents and committer date `UINT32_MAX + 1`. 2. We compute its generation number as `UINT32_MAX + 1`, but truncate it to `1`. 3. We calculate the generation offset via `$generation - $date`, which is thus `1 - (UINT32_MAX + 1)`. The computation underflows and we thus end up with an offset that is bigger than the maximum allowed offset. As a result, we'd be writing generation data overflow information into the commit-graph that is bogus and ultimately not even required. Fix this bug by removing the needless cast. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21csum-file.h: remove unnecessary inclusion of cache.hElijah Newren1-1/+1
With the change in the last commit to move several functions to write-or-die.h, csum-file.h no longer needs to include cache.h. However, removing that include forces several other C files, which directly or indirectly dependend upon csum-file.h's inclusion of cache.h, to now be more explicit about their dependencies. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21wrapper.h: move declarations for wrapper.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren1-0/+1
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20commit-graph: introduce `ensure_generations_valid()`Taylor Blau1-0/+29
Use the just-introduced compute_reachable_generation_numbers_1() to implement a function which dynamically computes topological levels (or corrected commit dates) for out-of-graph commits. This will be useful for the ahead-behind algorithm we are about to introduce, which needs accurate topological levels on _all_ commits reachable from the tips in order to avoid over-counting. Co-authored-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20commit-graph: return generation from memoryDerrick Stolee1-5/+3
The commit_graph_generation() method used to report a value of GENERATION_NUMBER_INFINITY if the commit_graph_data_slab had an instance for the given commit but the graph_pos indicated the commit was not in the commit-graph file. However, an upcoming change will introduce the ability to set generation values in-memory without writing the commit-graph file. Thus, we can no longer trust 'graph_pos' to indicate whether or not the generation member can be trusted. Instead, trust the 'generation' member if the commit has a value in the slab _and_ the 'generation' member is non-zero. Otherwise, treat it as GENERATION_NUMBER_INFINITY. This only makes a difference for a very old case for the commit-graph: the very first Git release to write commit-graph files wrote zeroes in the topological level positions. If we are parsing a commit-graph with all zeroes, those commits will now appear to have GENERATION_NUMBER_INFINITY (as if they were not parsed from the commit-graph). I attempted several variations to work around the need for providing an uninitialized 'generation' member, but this was the best one I found. It does require a change to a verification test in t5318 because it reports a different error than the one about non-zero generation numbers. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20commit-graph: simplify compute_generation_numbers()Derrick Stolee1-43/+21
The previous change introduced the generic algorithm compute_reachable_generation_numbers() and used it as the core functionality of compute_topological_levels(). Now, use it as the core functionality of compute_generation_numbers(). The main difference here is that we use generation version 2, which is used in to toggle the logic in compute_generation_from_max() for computing the corrected commit date based on the corrected commit dates of the parent commits (and the commit date of the current commit). It also uses different methods for (get|set)_generation in the vtable in order to store and access the value in the correct places. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20commit-graph: refactor compute_topological_levels()Derrick Stolee1-23/+83
This patch extracts the common code used to compute topological levels and corrected committer dates into a common routine, compute_reachable_generation_numbers(). For ease of reading, it only modifies compute_topological_levels() to use this new routine, leaving compute_generation_numbers() to be modified in the next change. This new routine dispatches to call the necessary functions to get and set the generation number for a given commit through a vtable (the compute_generation_info struct). Computing the generation number itself is done in compute_generation_from_max(), which dispatches its implementation based on the generation version requested, or issuing a BUG() for unrecognized generation versions. This does not use a vtable because the logic depends only on the generation number version, not where the data is being loaded from or being stored to. This is a subtle point that will make more sense in a future change that modifies the in-memory generation values instead of just preparing values for writing to a commit-graph file. This change looks like it adds a lot of new code. However, two upcoming changes will be quite small due to the work being done in this change. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-09use DUP_ARRAYRené Scharfe1-2/+1
Add a semantic patch for replace ALLOC_ARRAY+COPY_ARRAY with DUP_ARRAY to reduce code duplication and apply its results. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-21Merge branch 'ml/commit-graph-expire-dir-leak-fix'Junio C Hamano1-0/+2
A result from opendir() was leaking in the commit-graph expiration codepath, which has been plugged. * ml/commit-graph-expire-dir-leak-fix: commit-graph: Fix missing closedir in expire_commit_graphs
2022-09-19commit-graph: Fix missing closedir in expire_commit_graphsMiaoqian Lin1-0/+2
The function calls opendir() but missing the corresponding closedir() before exit the function. Add missing closedir() to fix it. Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-14Merge branch 'ab/unused-annotation'Junio C Hamano1-2/+2
Undoes 'jk/unused-annotation' topic and redoes it to work around Coccinelle rules misfiring false positives in unrelated codepaths. * ab/unused-annotation: git-compat-util.h: use "deprecated" for UNUSED variables git-compat-util.h: use "UNUSED", not "UNUSED(var)"
2022-09-14Merge branch 'jk/unused-annotation'Junio C Hamano1-2/+2
Annotate function parameters that are not used (but cannot be removed for structural reasons), to prepare us to later compile with -Wunused warning turned on. * jk/unused-annotation: is_path_owned_by_current_uid(): mark "report" parameter as unused run-command: mark unused async callback parameters mark unused read_tree_recursive() callback parameters hashmap: mark unused callback parameters config: mark unused callback parameters streaming: mark unused virtual method parameters transport: mark bundle transport_options as unused refs: mark unused virtual method parameters refs: mark unused reflog callback parameters refs: mark unused each_ref_fn parameters git-compat-util: add UNUSED macro
2022-09-13Merge branch 'jk/rev-list-verify-objects-fix'Junio C Hamano1-1/+1
"git rev-list --verify-objects" ought to inspect the contents of objects and notice corrupted ones, but it didn't when the commit graph is in use, which has been corrected. * jk/rev-list-verify-objects-fix: rev-list: disable commit graph with --verify-objects lookup_commit_in_graph(): use prepare_commit_graph() to check for graph
2022-09-07lookup_commit_in_graph(): use prepare_commit_graph() to check for graphJeff King1-1/+1
We exit early from lookup_commit_in_graph() if the commit_graph pointer is NULL, under the assumption that we don't have a graph to look at. But the graph pointer is lazy-loaded; if no other code happens to have called prepare_commit_graph(), we'll incorrectly assume that one isn't available at all. This has a pretty small performance impact in practice, because the fallback will generally be to call parse_object() instead. That ends up in parse_commit_buffer(), which loads the graph data itself. So the first commit we see won't use the graph, but subsequent ones will. Since using the graph is just an optimization there's generally no user-visible difference, but if you instrument rev-list like so: diff --git a/revision.c b/revision.c index ee702e498a..63c488ffb6 100644 --- a/revision.c +++ b/revision.c @@ -381,6 +381,9 @@ static struct object *get_reference(struct rev_info *revs, const char *name, * parsing commit data from disk. */ commit = lookup_commit_in_graph(revs->repo, oid); + warning("%s %s in commit graph", + commit ? "found" : "did not find", + name); if (commit) object = &commit->object; else and run (in git.git): git commit-graph write --reachable git rev-list origin/master origin/next >/dev/null you'll see that we fail to find the first one: warning: did not find origin/master in commit graph warning: found origin/next in commit graph After this patch, you'll see that we find both: warning: found origin/master in commit graph warning: found origin/next in commit graph Even though the performance implication is small here, there are two important reasons to do this: - it's downright confusing if you are hunting a bug triggered by the use of the commit graph. It may or may not trigger depending on the number and ordering of tips you ask for. - prepare_commit_graph() has other policy logic, too. In particular, if we've loaded a commit graph and then disabled the graph via disable_commit_graph(), that should take precedence. I'm not sure if this can trigger bad behavior in practice. The only caller there is upload-pack's deepen_by_rev_list(), which should be avoiding the commit graph for its traversal tips, but probably wasn't before this patch. Whether you could come up with a case where that mattered is unclear. Still, this is obviously the right thing to be doing. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-01git-compat-util.h: use "UNUSED", not "UNUSED(var)"Ævar Arnfjörð Bjarmason1-2/+2
As reported in [1] the "UNUSED(var)" macro introduced in 2174b8c75de (Merge branch 'jk/unused-annotation' into next, 2022-08-24) breaks coccinelle's parsing of our sources in files where it occurs. Let's instead partially go with the approach suggested in [2] of making this not take an argument. As noted in [1] "coccinelle" will ignore such tokens in argument lists that it doesn't know about, and it's less of a surprise to syntax highlighters. This undoes the "help us notice when a parameter marked as unused is actually use" part of 9b240347543 (git-compat-util: add UNUSED macro, 2022-08-19), a subsequent commit will further tweak the macro to implement a replacement for that functionality. 1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-19refs: mark unused each_ref_fn parametersJeff King1-2/+2
Functions used with for_each_ref(), etc, need to conform to the each_ref_fn interface. But most of them don't need every parameter; let's annotate the unused ones to quiet -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-10Merge branch 'tb/commit-graph-genv2-upgrade-fix' into maintJunio C Hamano1-3/+9
There was a bug in the codepath to upgrade generation information in commit-graph from v1 to v2 format, which has been corrected. source: <cover.1657667404.git.me@ttaylorr.com> * tb/commit-graph-genv2-upgrade-fix: commit-graph: fix corrupt upgrade from generation v1 to v2 commit-graph: introduce `repo_find_commit_pos_in_graph()` t5318: demonstrate commit-graph generation v2 corruption
2022-08-10Merge branch 'hx/lookup-commit-in-graph-fix' into maintJunio C Hamano1-1/+1
A corner case bug where lazily fetching objects from a promisor remote resulted in infinite recursion has been corrected. source: <cover.1656593279.git.hanxin.hx@bytedance.com> * hx/lookup-commit-in-graph-fix: t5330: remove run_with_limited_processses() commit-graph.c: no lazy fetch in lookup_commit_in_graph()
2022-08-03Merge branch 'tb/commit-graph-genv2-upgrade-fix'Junio C Hamano1-3/+9
There was a bug in the codepath to upgrade generation information in commit-graph from v1 to v2 format, which has been corrected. * tb/commit-graph-genv2-upgrade-fix: commit-graph: fix corrupt upgrade from generation v1 to v2 commit-graph: introduce `repo_find_commit_pos_in_graph()` t5318: demonstrate commit-graph generation v2 corruption
2022-07-27Merge branch 'js/commit-graph-parsing-without-repo-settings'Junio C Hamano1-6/+5
API tweak to make it easier to run fuzz testing on commit-graph parser. * js/commit-graph-parsing-without-repo-settings: commit-graph: pass repo_settings instead of repository
2022-07-19Merge branch 'hx/lookup-commit-in-graph-fix'Junio C Hamano1-1/+1
A corner case bug where lazily fetching objects from a promisor remote resulted in infinite recursion has been corrected. * hx/lookup-commit-in-graph-fix: t5330: remove run_with_limited_processses() commit-graph.c: no lazy fetch in lookup_commit_in_graph()
2022-07-15commit-graph: introduce `repo_find_commit_pos_in_graph()`Taylor Blau1-3/+9
Low-level callers in systems that are adjacent to the commit-graph (like the changed-path Bloom filter code) could benefit from being able to call a function like `parse_commit_in_graph()` without modifying the corresponding commit slab data. This is useful in contexts where that slab data is being used to prepare for an upcoming commit-graph write, where Git must be careful to avoid clobbering any of that data during a read operation. Introduce a low-level variant of `parse_commit_in_graph()` which returns the graph position of a given commit only, without modifying any of the slab data. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-14commit-graph: pass repo_settings instead of repositoryTaylor Blau1-6/+5
The parse_commit_graph() function takes a 'struct repository *' pointer, but it only ever accesses config settings (either directly or through the .settings field of the repo struct). Move all relevant config settings into the repo_settings struct, and update parse_commit_graph() and its existing callers so that it takes 'struct repo_settings *' instead. Callers of parse_commit_graph() will now need to call prepare_repo_settings() themselves, or initialize a 'struct repo_settings' directly. Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more places, 2020-09-09), parsing a commit-graph was a pure function depending only on the contents of the commit-graph itself. Commit ab14d0676c introduced a dependency on a `struct repository` pointer, and later commits such as b66d84756f (commit-graph: respect 'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config settings, which were accessed through the `settings` field of the repository pointer. This field was initialized via a call to `prepare_repo_settings()`. Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62 (2021-12-06, repo-settings:prepare_repo_settings only in git repos), prepare_repo_settings was changed to issue a BUG() if it is called by a process whose CWD is not a Git repository. The combination of commits mentioned above broke fuzz-commit-graph, which attempts to parse arbitrary fuzzing-engine-provided bytes as a commit graph file. Prior to this change, parse_commit_graph() called prepare_repo_settings(), but since we run the fuzz tests without a valid repository, we are hitting the BUG() from 44c7e62 for every test case. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-30commit-graph.c: no lazy fetch in lookup_commit_in_graph()Han Xin1-1/+1
The commit-graph is used to opportunistically optimize accesses to certain pieces of information on commit objects, and lookup_commit_in_graph() tries to say "no" when the requested commit does not locally exist by returning NULL, in which case the caller can ask for (which may result in on-demand fetching from a promisor remote) and parse the commit object itself. However, it uses a wrong helper, repo_has_object_file(), to do so. This helper not only checks if an object is mmediately available in the local object store, but also tries to fetch from a promisor remote. But the fetch machinery calls lookup_commit_in_graph(), thus causing an infinite loop. We should make lookup_commit_in_graph() expect that a commit given to it can be legitimately missing from the local object store, by using the has_object_file() helper instead. Signed-off-by: Han Xin <hanxin.hx@bytedance.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-03Merge branch 'tb/cruft-packs'Junio C Hamano1-15/+3
A mechanism to pack unreachable objects into a "cruft pack", instead of ejecting them into loose form to be reclaimed later, has been introduced. * tb/cruft-packs: sha1-file.c: don't freshen cruft packs builtin/gc.c: conditionally avoid pruning objects via loose builtin/repack.c: add cruft packs to MIDX during geometric repack builtin/repack.c: use named flags for existing_packs builtin/repack.c: allow configuring cruft pack generation builtin/repack.c: support generating a cruft pack builtin/pack-objects.c: --cruft with expiration reachable: report precise timestamps from objects in cruft packs reachable: add options to add_unseen_recent_objects_to_traversal builtin/pack-objects.c: --cruft without expiration builtin/pack-objects.c: return from create_object_entry() t/helper: add 'pack-mtimes' test-tool pack-mtimes: support writing pack .mtimes files chunk-format.h: extract oid_version() pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' pack-mtimes: support reading .mtimes files Documentation/technical: add cruft-packs.txt
2022-05-26chunk-format.h: extract oid_version()Taylor Blau1-15/+3
There are three definitions of an identical function which converts `the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a copy of this function for writing both the commit-graph and multi-pack-index file, and another inline definition used to write the .rev header. Consolidate these into a single definition in chunk-format.h. It's not clear that this is the best header to define this function in, but it should do for now. (Worth noting, the .rev caller expects a 4-byte unsigned, but the other two callers work with a single unsigned byte. The consolidated version uses the latter type, and lets the compiler widen it when required). Another caller will be added in a subsequent patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-23Merge branch 'ab/valgrind-fixes'Junio C Hamano1-2/+4
A bit of test framework fixes with a few fixes to issues found by valgrind. * ab/valgrind-fixes: commit-graph.c: don't assume that stat() succeeds object-file: fix a unpack_loose_header() regression in 3b6a8db3b03 log test: skip a failing mkstemp() test under valgrind tests: using custom GIT_EXEC_PATH breaks --valgrind tests
2022-05-20Merge branch 'ep/maint-equals-null-cocci'Junio C Hamano1-2/+2
Introduce and apply coccinelle rule to discourage an explicit comparison between a pointer and NULL, and applies the clean-up to the maintenance track. * ep/maint-equals-null-cocci: tree-wide: apply equals-null.cocci tree-wide: apply equals-null.cocci contrib/coccinnelle: add equals-null.cocci
2022-05-12commit-graph.c: don't assume that stat() succeedsÆvar Arnfjörð Bjarmason1-2/+4
Fix code added in 8d84097f965 (commit-graph: expire commit-graph files, 2019-06-18) to check the return value of the stat() system call. Not doing so caused us to use uninitialized memory in the "Bloom generation is limited by --max-new-filters" test in t4216-log-bloom.sh: + rm -f trace.event + pwd + GIT_TRACE2_EVENT=[...]/t/trash directory.t4216-log-bloom/limits/trace.event git commit-graph write --reachable --split=replace --changed-paths --max-new-filters=2 ==24835== Syscall param utimensat(times[0].tv_sec) points to uninitialised byte(s) ==24835== at 0x499E65A: __utimensat64_helper (utimensat.c:34) ==24835== by 0x4999142: utime (utime.c:36) ==24835== by 0x552BE0: mark_commit_graphs (commit-graph.c:2213) ==24835== by 0x550822: write_commit_graph (commit-graph.c:2424) ==24835== by 0x54E3A0: write_commit_graph_reachable (commit-graph.c:1681) ==24835== by 0x4374BB: graph_write (commit-graph.c:269) ==24835== by 0x436F7D: cmd_commit_graph (commit-graph.c:326) ==24835== by 0x407B9A: run_builtin (git.c:465) ==24835== by 0x406651: handle_builtin (git.c:719) ==24835== by 0x407575: run_argv (git.c:786) ==24835== by 0x406410: cmd_main (git.c:917) ==24835== by 0x511F09: main (common-main.c:56) ==24835== Address 0x1ffeffde70 is on thread 1's stack ==24835== in frame #1, created by utime (utime.c:25) ==24835== Uninitialised value was created by a stack allocation ==24835== at 0x552B50: mark_commit_graphs (commit-graph.c:2201) ==24835== [...] error: last command exited with $?=126 not ok 137 - Bloom generation is limited by --max-new-filters This would happen as we stat'd the non-existing ".git/objects/info/commit-graph" file. Let's fix mark_commit_graphs() to check the stat()'s return value, and while we're at it fix another case added in the same commit to do the same. The caller in expire_commit_graphs() would have been less likely to run into this, as it's operating on files it just got from readdir(), but it could still happen due to a race with e.g. a concurrent "rm -rf" of the commit-graph files. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-02Merge branch 'ep/maint-equals-null-cocci' for maint-2.35Junio C Hamano1-2/+2
* ep/maint-equals-null-cocci: tree-wide: apply equals-null.cocci contrib/coccinnelle: add equals-null.cocci
2022-05-02tree-wide: apply equals-null.cocciJunio C Hamano1-2/+2
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-20commit-graph: close file before returning NULLKleber Tarcísio1-3/+6
There are two reasons that we could return NULL early within load_commit_graph_chain(): 1. The file does not exist, so the file pointer is NULL. 2. The file exists, but is too small to contain a single hash. These were grouped together when the function was first written in 5c84b3396 (commit-graph: load commit-graph chains, 2019-06-18) in order to simplify how the 'chain_name' string is freed. However, the current code leaves a narrow window where the file pointer is not closed when the file exists, but is rejected for being too small. Split out these cases separately to ensure we close the file in this case. Signed-off-by: Kleber Tarcísio <klebertarcisio@yahoo.com.br> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-25Merge branch 'ns/core-fsyncmethod'Junio C Hamano1-1/+2
Replace core.fsyncObjectFiles with two new configuration variables, core.fsync and core.fsyncMethod. * ns/core-fsyncmethod: core.fsync: documentation and user-friendly aggregate options core.fsync: new option to harden the index core.fsync: add configuration parsing core.fsync: introduce granular fsync control infrastructure core.fsyncmethod: add writeout-only mode wrapper: make inclusion of Windows csprng header tightly scoped
2022-03-16Merge branch 'ab/string-list-count-in-size-t'Junio C Hamano1-3/+3
Count string_list items in size_t, not "unsigned int". * ab/string-list-count-in-size-t: string-list API: change "nr" and "alloc" to "size_t" gettext API users: don't explicitly cast ngettext()'s "n"
2022-03-16Merge branch 'ds/commit-graph-gen-v2-fixes'Junio C Hamano1-6/+13
Fixes to the way generation number v2 in the commit-graph files are (not) handled. * ds/commit-graph-gen-v2-fixes: commit-graph: declare bankruptcy on GDAT chunks commit-graph: fix generation number v2 overflow values commit-graph: start parsing generation v2 (again) commit-graph: fix ordering bug in generation numbers t5318: extract helpers to lib-commit-graph.sh test-read-graph: include extra post-parse info
2022-03-10core.fsync: introduce granular fsync control infrastructureNeeraj Singh1-1/+2
This commit introduces the infrastructure for the core.fsync configuration knob. The repository components we want to sync are identified by flags so that we can turn on or off syncing for specific components. If core.fsyncObjectFiles is set and the core.fsync configuration also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any loose objects. This picks the strictest data integrity behavior if core.fsync and core.fsyncObjectFiles are set to conflicting values. This change introduces the currently unused fsync_component helper, which will be used by a later patch that adds fsyncing to the refs backend. Actual configuration and documentation of the fsync components list are in other patches in the series to separate review of the underlying mechanism from the policy of how it's configured. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-07string-list API: change "nr" and "alloc" to "size_t"Ævar Arnfjörð Bjarmason1-3/+3
Change the "nr" and "alloc" members of "struct string_list" to use "size_t" instead of "nr". On some platforms the size of an "unsigned int" will be smaller than a "size_t", e.g. a 32 bit unsigned v.s. 64 bit unsigned. As "struct string_list" is a generic API we use in a lot of places this might cause overflows. As one example: code in "refs.c" keeps track of the number of refs with a "size_t", and auxiliary code in builtin/remote.c in get_ref_states() appends those to a "struct string_list". While we're at it split the "nr" and "alloc" in string-list.h across two lines, which is the case for most such struct member declarations (e.g. in "strbuf.h" and "strvec.h"). Changing e.g. "int i" to "size_t i" in run_and_feed_hook() isn't strictly necessary, and there are a lot more cases where we'll use a local "int", "unsigned int" etc. variable derived from the "nr" in the "struct string_list". But in that case as well as add_wrapped_shortlog_msg() in builtin/shortlog.c we need to adjust the printf format referring to "nr" anyway, so let's also change the other variables referring to it. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-07commit-graph: declare bankruptcy on GDAT chunksDerrick Stolee1-2/+2
The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks store corrected commit date offsets, used for generation number v2. Recent changes have demonstrated that previous versions of Git were incorrectly parsing data from these chunks, but might have also been writing them incorrectly. I asserted [1] that the previous fixes were sufficient because the known reasons for incorrectly writing generation number v2 data relied on parsing the information incorrectly out of a commit-graph file, but the previous versions of Git were not reading the generation number v2 data. However, Patrick demonstrated [2] a case where in split commit-graphs across an alternate boundary (and possibly some other special conditions) it was possible to have a commit-graph that was generated by a previous version of Git have incorrect generation number v2 data which results in errors like the following: commit-graph generation for commit <oid> is 1623273624 < 1623273710 [1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/ [2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/ Clearly, there is something else going on. The situation is not completely understood, but the errors do not reproduce if the commit-graphs are all generated by a Git version including these recent fixes. If we cannot trust the existing data in the GDAT and GDOV chunks, then we can alter the format to change the chunk IDs for these chunks. This causes the new version of Git to silently ignore the older chunks (and disabling generation number v2 in the process) while writing new commit-graph files with correct data in the GDA2 and GDO2 chunks. Update commit-graph-format.txt including a historical note about these deprecated chunks. Reported-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-04lockfile API users: simplify and don't leak "path"Ævar Arnfjörð Bjarmason1-0/+2
Fix a memory leak in code added in 6c622f9f0bb (commit-graph: write commit-graph chains, 2019-06-18). We needed to free the "lock_name" if we encounter errors, and the "graph_name" after we'd run unlink() on it. For the case of write_commit_graph_file() refactoring the code to free the "lock_name" after we were done using the "struct lock_file lk" would have made the control flow more complex. Luckily we can free the "lock_file" right after the hold_lock_file_for_update() call, if it makes use of "path" at all it'll have copied its contents to a "struct strbuf" of its own. While I'm at it let's fix code added in fb10ca5b543 (sparse-checkout: write using lockfile, 2019-11-21) in write_patterns_and_update() to avoid the same complexity that I thought I needed when I wrote the initial fix for write_commit_graph_file(). We can free the "sparse_filename" right after calling hold_lock_file_for_update(), we don't need to wait until we're exiting the function to do so. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-04commit-graph: stop fill_oids_from_packs() progress on error and free()Ævar Arnfjörð Bjarmason1-5/+7
Fix a bug in fill_oids_from_packs(), we should always stop_progress(), but did not do so if we returned an error here. This also plugs a memory leak in those cases by releasing the two "struct strbuf" variables the function uses. While I'm at it stop hardcoding "-1" here and just use the return value of error() instead, which happens to be "-1". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-04commit-graph: fix memory leak in misused string_list APIÆvar Arnfjörð Bjarmason1-2/+2
When this code was migrated to the string_list API in d88b14b3fd6 (commit-graph: use string-list API for input, 2018-06-27) it was made to use used both STRING_LIST_INIT_NODUP and a strbuf_detach() pattern. Those should not be used together if string_list_clear() is expected to free the memory, instead we need to either use STRING_LIST_INIT_DUP with a string_list_append_nodup(), or a STRING_LIST_INIT_NODUP and manually fiddle with the "strdup_strings" member before calling string_list_clear(). Let's do the former. Since "strdup_strings = 1" is set now other code might be broken by relying on "pack_indexes" not to duplicate it strings, but that doesn't happen. When we pass this down to write_commit_graph() that code uses the "struct string_list" without modifying it. Let's add a "const" to the variable to have the compiler enforce that assumption. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01commit-graph: fix generation number v2 overflow valuesDerrick Stolee1-1/+1
The Generation Data Chunk was implemented and tested in e8b63005c (commit-graph: implement generation data chunk, 2021-01-16), but the test was carefully constructed to work on systems with 32-bit dates. Since the corrected commit date offsets still required more than 31 bits, this triggered writing the generation_data_overflow chunk. However, upon closer look, the write_graph_chunk_generation_data_overflow() method writes the offsets to the chunk (as dictated by the format) but fill_commit_graph_info() treats the value in the chunk as if it is the full corrected commit date (not an offset). For some reason, this does not cause an issue when using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does show up as a failure in 'git commit-graph verify' if we increase that FUTURE_DATE to be above four billion. Fix this error and create a 64-bit timestamp version of the test so we can test these larger values. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01commit-graph: start parsing generation v2 (again)Derrick Stolee1-0/+3
The 'read_generation_data' member of 'struct commit_graph' was introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire chain does, 2021-01-16). The intention was to avoid using corrected commit dates if not all layers of a commit-graph had that data stored. The logic in validate_mixed_generation_chain() at that point incorrectly initialized read_generation_data to 1 if and only if the tip commit-graph contained the Corrected Commit Date chunk. This was "fixed" in 448a39e65 (commit-graph: validate layers for generation data, 2021-02-02) to validate that read_generation_data was either non-zero for all layers, or it would set read_generation_data to zero for all layers. The problem here is that read_generation_data is not initialized to be non-zero anywhere! This change initializes read_generation_data immediately after the chunk is parsed, so each layer will have its value present as soon as possible. The read_generation_data member is used in fill_commit_graph_info() to determine if we should use the corrected commit date or the topological levels stored in the Commit Data chunk. Due to this bug, all previous versions of Git were defaulting to topological levels in all cases! This can be measured with some performance tests. Using the Linux kernel as a testbed, I generated a complete commit-graph containing corrected commit dates and tested the 'new' version against the previous, 'old' version. First, rev-list with --topo-order demonstrates a 26% improvement using corrected commit dates: hyperfine \ -n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \ -n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 57.1 ms ± 3.1 ms Range (min … max): 52.9 ms … 62.0 ms 55 runs Benchmark 2: new Time (mean ± σ): 45.5 ms ± 3.3 ms Range (min … max): 39.9 ms … 51.7 ms 59 runs Summary 'new' ran 1.26 ± 0.11 times faster than 'old' These performance improvements are due to the algorithmic improvements given by walking fewer commits due to the higher cutoffs from corrected commit dates. However, this comes at a cost. The additional I/O cost of parsing the corrected commit dates is visible in case of merge-base commands that do not reduce the overall number of walked commits. hyperfine \ -n "old" "$OLD_GIT merge-base v4.8 v4.9" \ -n "new" "$NEW_GIT merge-base v4.8 v4.9" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 110.4 ms ± 6.4 ms Range (min … max): 96.0 ms … 118.3 ms 25 runs Benchmark 2: new Time (mean ± σ): 150.7 ms ± 1.1 ms Range (min … max): 149.3 ms … 153.4 ms 19 runs Summary 'old' ran 1.36 ± 0.08 times faster than 'new' Performance issues like this are what motivated 702110aac (commit-graph: use config to specify generation type, 2021-02-25). In the future, we could fix this performance problem by inserting the corrected commit date offsets into the Commit Date chunk instead of having that data in an extra chunk. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01commit-graph: fix ordering bug in generation numbersDerrick Stolee1-3/+7
When computing the generation numbers for a commit-graph, we compute the corrected commit dates and then check if their offsets from the actual dates is too large to fit in the 32-bit Generation Data chunk. However, there is a problem with this approach: if we have parsed the generation data from the previous commit-graph, then we continue the loop because the corrected commit date is already computed. This causes an under-count in the number of overflow values. It is incorrect to add an increment to num_generation_data_overflows next to this 'continue' statement, because we might start double-counting commits that are computed because of the depth-first search walk from a commit with an earlier OID. Instead, iterate over the full commit list at the end, checking the offsets to see how many grow beyond the maximum value. Create a new t5328-commit-graph-64-bit-time.sh test script to handle special cases of testing 64-bit timestamps. This helps demonstrate this bug in more cases. It still won't hit all potential cases until the next change, which reenables reading generation numbers. Use the skip_all trick from 0a2bfccb9c8 (t0051: use "skip_all" under !MINGW in single-test file, 2022-02-04) to make the output clean when run on a 32-bit system. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-06commit-graph: return if there is no git directoryLessley Dennington1-1/+4
Return early if git directory does not exist. This will protect against test failures in the upcoming change to BUG in prepare_repo_settings if no git directory exists. Signed-off-by: Lessley Dennington <lessleydennington@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-20Merge branch 'js/run-command-close-packs'Junio C Hamano1-0/+1
The run-command API has been updated so that the callers can easily ask the file descriptors open for packfiles to be closed immediately before spawning commands that may trigger auto-gc. * js/run-command-close-packs: Close object store closer to spawning child processes run_auto_maintenance(): implicitly close the object store run-command: offer to close the object store before running run-command: prettify the `RUN_COMMAND_*` flags pull: release packs before fetching commit-graph: when closing the graph, also release the slab
2021-09-20Merge branch 'ab/progress-users-adjust-counters'Junio C Hamano1-1/+1
The code to show progress indicator in a few code paths did not cover between 0-100%, which has been corrected. * ab/progress-users-adjust-counters: entry: show finer-grained counter in "Filtering content" progress line commit-graph: fix bogus counter in "Scanning merged commits" progress line
2021-09-09commit-graph: fix bogus counter in "Scanning merged commits" progress lineSZEDER Gábor1-1/+1
The final value of the counter of the "Scanning merged commits" progress line is always one less than its expected total, e.g.: Scanning merged commits: 83% (5/6), done. This happens because while iterating over an array the loop variable is passed to display_progress() as-is, but while C arrays (and thus the loop variable) start at 0 and end at N-1, the progress counter must end at N. Fix this by passing 'i + 1' to display_progress(), like most other callsites do. There's an RFC series to add a GIT_TEST_CHECK_PROGRESS=1 mode[1] which catches this issue in the 'fetch.writeCommitGraph' and 'fetch.writeCommitGraph with submodules' tests in 't5510-fetch.sh'. The GIT_TEST_CHECK_PROGRESS=1 mode is not part of this series, but future changes to progress.c may add it or similar assertions to catch this and similar bugs elsewhere. 1. https://lore.kernel.org/git/20210620200303.2328957-1-szeder.dev@gmail.com/ Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08commit-graph: when closing the graph, also release the slabJohannes Schindelin1-0/+1
The slab has information about the commit graph. That means that it is meaningless (and even misleading) when the commit graph was closed. This seems not to matter currently, but we're about to fix a Windows-specific bug where `git pull` does not close the object store before fetching (risking that an implicit auto-gc fails to remove the now-obsolete pack file(s)), and once we have that bug fix in place, it does matter: after that bug fix, we will open the object store, do some stuff with it, then close it, fetch, and then open it again, and do more stuff. If we close the commit graph without releasing the corresponding slab, we're hit by a symptom like this in t5520.19: BUG: commit-reach.c:85: bad generation skip 9223372036854775807 > 3 at 5cd378271655d43a3b4477520014f02213ad1546 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-09revision: avoid hitting packfiles when commits are in commit-graphPatrick Steinhardt1-0/+24
When queueing references in git-rev-list(1), we try to optimize parsing of commits via the commit-graph. To do so, we first look up the object's type, and if it is a commit we call `repo_parse_commit()` instead of `parse_object()`. This is quite inefficient though given that we're always uncompressing the object header in order to determine the type. Instead, we can opportunistically search the commit-graph for the object ID: in case it's found, we know it's a commit and can directly fill in the commit object without having to uncompress the object header. Expose a new function `lookup_commit_in_graph()`, which tries to find a commit in the commit-graph by ID, and convert `get_reference()` to use this function. This provides a big performance win in cases where we load references in a repository with lots of references pointing to commits. The following has been executed in a real-world repository with about 2.2 million refs: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.458 s ± 0.044 s [User: 4.115 s, System: 0.342 s] Range (min … max): 4.409 s … 4.534 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 3.089 s ± 0.015 s [User: 2.768 s, System: 0.321 s] Range (min … max): 3.061 s … 3.105 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-09commit-graph: split out function to search commit positionPatrick Steinhardt1-25/+30
The function `find_commit_in_graph()` assumes that the caller has passed an object which was already determined to be a commit given that it will access the commit's graph position, which is stored in a commit slab. In a subsequent patch, we want to search for an object ID though without knowing whether it is a commit or not, which is not currently possible. Split out the logic to search the commit graph for a given object ID to prepare for this change. This commit also renames the function to `find_commit_pos_in_graph()`, which more accurately reflects what this function does. Furthermore, in order to allow for the searched object ID to be const, we need to adjust `bsearch_graph()`'s signature to accept a constant object ID as input, too. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-28Merge branch 'ab/attribute-format'Junio C Hamano1-0/+1
Many "printf"-like helper functions we have have been annotated with __attribute__() to catch placeholder/parameter mismatches. * ab/attribute-format: advice.h: add missing __attribute__((format)) & fix usage *.h: add a few missing __attribute__((format)) *.c static functions: add missing __attribute__((format)) sequencer.c: move static function to avoid forward decl *.c static functions: don't forward-declare __attribute__
2021-07-13*.c static functions: add missing __attribute__((format))Ævar Arnfjörð Bjarmason1-0/+1
Add missing __attribute__((format)) function attributes to various "static" functions that take printf arguments. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28commit-graph: rewrite to use checksum_valid()Taylor Blau1-8/+6
Rewrite an existing caller in `git commit-graph verify` to take advantage of checksum_valid(). Note that the replacement isn't a verbatim cut-and-paste, since the new function avoids using hashfile at all and instead talks to the_hash_algo directly, but it is functionally equivalent. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27commit-graph: don't store file hashes as struct object_idbrian m. carlson1-6/+7
The idea behind struct object_id is that it is supposed to represent the identifier of a standard Git object or a special pseudo-object like the all-zeros object ID. In this case, we have file hashes, which, while similar, are distinct from the identifiers of objects. Switch these code paths to use an unsigned char array. This is both more logically consistent and it means that we need not set the algorithm identifier for the struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Always use oidread to read into struct object_idbrian m. carlson1-6/+6
In the future, we'll want oidread to automatically set the hash algorithm member for an object ID we read into it, so ensure we use oidread instead of hashcpy everywhere we're copying a hash value into a struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-22Merge branch 'ds/commit-graph-generation-config'Junio C Hamano1-11/+20
A new configuration variable has been introduced to allow choosing which version of the generation number gets used in the commit-graph file. * ds/commit-graph-generation-config: commit-graph: use config to specify generation type commit-graph: create local repository pointer
2021-03-13use CALLOC_ARRAYRené Scharfe1-2/+2
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-01Merge branch 'ds/chunked-file-api'Junio C Hamano1-200/+112
The common code to deal with "chunked file format" that is shared by the multi-pack-index and commit-graph files have been factored out, to help codepaths for both filetypes to become more robust. * ds/chunked-file-api: commit-graph.c: display correct number of chunks when writing chunk-format: add technical docs chunk-format: restore duplicate chunk checks midx: use 64-bit multiplication for chunk sizes midx: use chunk-format read API commit-graph: use chunk-format read API chunk-format: create read chunk API midx: use chunk-format API in write_midx_internal() midx: drop chunk progress during write midx: return success/failure in chunk write methods midx: add num_large_offsets to write_midx_context midx: add pack_perm to write_midx_context midx: add entries to write_midx_context midx: use context in write_midx_pack_names() midx: rename pack_info to write_midx_context commit-graph: use chunk-format write API chunk-format: create chunk format write API commit-graph: anonymize data in chunk_write_fn
2021-03-01Merge branch 'js/commit-graph-warning'Junio C Hamano1-11/+3
* js/commit-graph-warning: Revert "commit-graph: when incompatible with graphs, indicate why"
2021-03-01Revert "commit-graph: when incompatible with graphs, indicate why"Junio C Hamano1-11/+3
This reverts commit c85eec7fc37e1ca79072f263ae6ea1ee305ba38c, as it is a bit overzealous, we are in prerelease freeze, and we want to have enough time to get this right and cook in 'next'. cf. <8735xgkvuo.fsf@evledraar.gmail.com>
2021-02-25commit-graph: use config to specify generation typeDerrick Stolee1-7/+15
We have two established generation number versions: 1: topological levels 2: corrected commit dates The corrected commit dates are enabled by default, but they also write extra data in the GDAT and GDOV chunks. Services that host Git data might want to have more control over when this feature rolls out than just updating the Git binaries. Add a new "commitGraph.generationVersion" config option that specifies the intended generation number version. If this value is less than 2, then the GDAT chunk is never written _or read_ from an existing file. This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable in the test suite. Remove it. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-25commit-graph: create local repository pointerDerrick Stolee1-4/+5
The write_commit_graph() method uses 'the_repository' in a few places. A new need for a repository pointer is coming in the following change, so group these instances into a local variable 'r' that could eventually become part of the method signature, if so desired. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-24commit-graph.c: display correct number of chunks when writingTaylor Blau1-4/+3
When writing a commit-graph, a progress meter is shown which indicates the number of pieces of data to write (one per commit in each chunk). In 47410aa837 (commit-graph: use chunk-format write API, 2021-02-18), the number of chunks became tracked by the new chunk-format API. But a stray local variable was left behind from when write_commit_graph_file() used to keep track of the same. Since this was no longer updated after 47410aa837, the progress meter appeared broken: $ git commit-graph write --reachable Expanding reachable commits in commit graph: 837569, done. Writing out commit graph in 3 passes: 166% (4187845/2512707), done. Drop the local variable and rely instead on the chunk-format API to tell us the correct number of chunks. Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-22commit-graph: avoid leaking topo_levels slab in write_commit_graph()Andrzej Hunt1-0/+1
write_commit_graph initialises topo_levels using init_topo_level_slab(), next it calls compute_topological_levels() which can cause the slab to grow, we therefore need to clear the slab again using clear_topo_level_slab() when we're done. First introduced in 72a2bfca (commit-graph: add a slab to store topological levels, 2021-01-16). LeakSanitizer output: ==1026==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x498ae9 in realloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xafbed8 in xrealloc /src/git/wrapper.c:126:8 #2 0x7966d1 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1 #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1 #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12 #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2 #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6 #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11 #8 0x4cddb1 in run_builtin /src/git/git.c:453:11 #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3 #10 0x4cd084 in run_argv /src/git/git.c:771:4 #11 0x4ca424 in cmd_main /src/git/git.c:902:19 #12 0x707fb6 in main /src/git/common-main.c:52:11 #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f) Indirect leak of 524256 byte(s) in 1 object(s) allocated from: #0 0x498942 in calloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0xafc088 in xcalloc /src/git/wrapper.c:140:8 #2 0x796870 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1 #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1 #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12 #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2 #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6 #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11 #8 0x4cddb1 in run_builtin /src/git/git.c:453:11 #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3 #10 0x4cd084 in run_argv /src/git/git.c:771:4 #11 0x4ca424 in cmd_main /src/git/git.c:902:19 #12 0x707fb6 in main /src/git/common-main.c:52:11 #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f) SUMMARY: AddressSanitizer: 524264 byte(s) leaked in 2 allocation(s). Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18commit-graph: use chunk-format read APIDerrick Stolee1-105/+54
Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). While the current implementation loses the duplicate-chunk detection, that will be added in a future change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18commit-graph: use chunk-format write APIDerrick Stolee1-82/+37
The commit-graph write logic is ready to make use of the chunk-format write API. Each chunk write method is already in the correct prototype. We only need to use the 'struct chunkfile' pointer and the correct API calls. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-17Merge branch 'js/commit-graph-warning'Junio C Hamano1-3/+11
When certain features (e.g. grafts) used in the repository are incompatible with the use of the commit-graph, we used to silently turned commit-graph off; we now tell the user what we are doing. * js/commit-graph-warning: commit-graph: when incompatible with graphs, indicate why
2021-02-17Merge branch 'ds/commit-graph-genno-fix'Junio C Hamano1-37/+101
Fix incremental update of commit-graph file around corrected commit date data. * ds/commit-graph-genno-fix: commit-graph: prepare commit graph commit-graph: be extra careful about mixed generations commit-graph: compute generations separately commit-graph: validate layers for generation data commit-graph: always parse before commit_graph_data_at() commit-graph: use repo_parse_commit
2021-02-17Merge branch 'ak/corrected-commit-date'Junio C Hamano1-47/+204
The commit-graph learned to use corrected commit dates instead of the generation number to help topological revision traversal. * ak/corrected-commit-date: doc: add corrected commit date info commit-reach: use corrected commit dates in paint_down_to_common() commit-graph: use generation v2 only if entire chain does commit-graph: implement generation data chunk commit-graph: implement corrected commit date commit-graph: return 64-bit generation number commit-graph: add a slab to store topological levels t6600-test-reach: generalize *_three_modes commit-graph: consolidate fill_commit_graph_info revision: parse parent in indegree_walk_step() commit-graph: fix regression when computing Bloom filters
2021-02-11commit-graph: when incompatible with graphs, indicate whyJohannes Schindelin1-3/+11
When `gc.writeCommitGraph = true`, it is possible that the commit-graph is _still_ not written: replace objects, grafts and shallow repositories are incompatible with the commit-graph feature. Under such circumstances, we need to indicate to the user why the commit-graph was not written instead of staying silent about it. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-10Merge branch 'jk/use-oid-pos'Junio C Hamano1-15/+15
Code clean-up to ensure our use of hashtables using object names as keys use the "struct object_id" objects, not the raw hash values. * jk/use-oid-pos: oid_pos(): access table through const pointers hash_pos(): convert to oid_pos() rerere: use strmap to store rerere directories rerere: tighten rr-cache dirname check rerere: check dirname format while iterating rr_cache directory commit_graft_pos(): take an oid instead of a bare hash
2021-02-05commit-graph: anonymize data in chunk_write_fnDerrick Stolee1-10/+19
In preparation for creating an API around file formats using chunks and tables of contents, prepare the commit-graph write code to use prototypes that will match this new API. Specifically, convert chunk_write_fn to take a "void *data" parameter instead of the commit-graph-specific "struct write_commit_graph_context" pointer. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-03Merge branch 'jk/peel-iterated-oid'Junio C Hamano1-1/+1
The peel_ref() API has been replaced with peel_iterated_oid(). * jk/peel-iterated-oid: refs: switch peel_ref() to peel_iterated_oid()
2021-02-01commit-graph: prepare commit graphDerrick Stolee1-8/+2
Before checking if the repository has a commit-graph loaded, be sure to run prepare_commit_graph(). This is necessary because otherwise the topo_levels slab is not initialized. As we compute topo_levels for the new commits, we iterate further into the lower layers since the first visit to each commit looks as though the topo_level is not populated. By properly initializing the topo_slab, we fix the previously broken case of a split commit graph where a base layer has the generation_data_overflow chunk. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01commit-graph: be extra careful about mixed generationsDerrick Stolee1-2/+12
When upgrading to a commit-graph with corrected commit dates from one without, there are a few things that need to be considered. When computing generation numbers for the new commit-graph file that expects to add the generation_data chunk with corrected commit dates, we need to ensure that the 'generation' member of the commit_graph_data struct is set to zero for these commits. Unfortunately, the fallback to use topological level for generation number when corrected commit dates are not available are causing us harm here: parsing commits notices that read_generation_data is false and populates 'generation' with the topological level. The solution is to iterate through the commits, parse the commits to populate initial values, then reset the generation values to zero to trigger recalculation. This loop only occurs when the existing commit-graph data has no corrected commit dates. While this improves our situation somewhat, we have not completely solved the issue for correctly computing generation numbers for mixed layers. That follows in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01commit-graph: compute generations separatelyDerrick Stolee1-14/+56
The compute_generation_numbers() method was introduced by 3258c663 (commit-graph: compute generation numbers, 2018-05-01) to compute what is now known as "topological levels". These are still stored in the commit-graph file for compatibility sake while c1a09119 (commit-graph: implement corrected commit date, 2021-01-16) updated the method to also compute the new version of generation numbers: corrected commit date. It makes sense why these are grouped. They perform very similar walks of the necessary commits and compute similar maximums over each parent. However, having these two together conflates them in subtle ways that is hard to separate. In particular, the topo_level slab is used to store the topological levels in all cases, but the commit_graph_data_at(c)->generation member stores different values depending on the state of the existing commit-graph file. * If the existing commit-graph file has a "GDAT" chunk, then these values represent corrected commit dates. * If the existing commit-graph file doesn't have a "GDAT" chunk, then these values are actually the topological levels. This issue only occurs only when upgrading an existing commit-graph file into one that has the "GDAT" chunk. The current change does not resolve this upgrade problem, but splitting the implementation into two pieces here helps with that process, which will follow in the next change. The important thing this helps with is the case where the num_generation_data_overflows was being incremented incorrectly, triggering a write of the overflow chunk. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01commit-graph: validate layers for generation dataDerrick Stolee1-6/+16
We need to be extra careful that we don't use corrected commit dates from any layer of a commit-graph chain if there is a single commit-graph file that is missing the generation_data chunk. Update validate_mixed_generation_chain() to correctly update each layer to ignore the generation_data chunk in this case. It now also returns 1 if all layers have a generation_data chunk. This return value will be used in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01commit-graph: always parse before commit_graph_data_at()Derrick Stolee1-4/+12
There is a subtle failure happening when computing corrected commit dates with --split enabled. It requires a base layer needing the generation_data_overflow chunk. Then, the next layer on top erroneously thinks it needs an overflow chunk due to a bug leading to recalculating all reachable generation numbers. The output of the failure is BUG: commit-graph.c:1912: expected to write 8 bytes to chunk 47444f56, but wrote 0 instead These "expected" 8 bytes are due to re-computing the corrected commit date for the lower layer but the new layer does not need any overflow. Add a test to t5318-commit-graph.sh that demonstrates this bug. However, it does not trigger consistently with the existing code. The generation number data is stored in a slab and accessed by commit_graph_data_at(). This data is initialized when parsing a commit, but is otherwise used assuming it has been populated. The loop in compute_generation_numbers() did not enforce that all reachable commits were parsed and had correct values. This could lead to some problems when writing a commit-graph with corrected commit dates based on a commit-graph without them. It has been difficult to identify the issue here because it was so hard to reproduce. It relies on this uninitialized data having a non-zero value, but also on specifically in a way that overwrites the existing data. This patch adds the extra parse to ensure the data is filled before we compute the generation number of a commit. This triggers the new test to fail because the generation number overflow count does not match between this computation and the write for that chunk. The actual fix will follow as the next few changes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01commit-graph: use repo_parse_commitDerrick Stolee1-5/+5
The write_commit_graph_context has a repository pointer, so use it. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28oid_pos(): access table through const pointersJeff King1-2/+2
When we are looking up an oid in an array, we obviously don't need to write to the array. Let's mark it as const in the function interfaces, as well as in the local variables we use to derference the void pointer (note a few cases use pointers-to-pointers, so we mark everything const). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28hash_pos(): convert to oid_pos()Jeff King1-14/+14
All of our callers are actually looking up an object_id, not a bare hash. Likewise, the arrays they are looking in are actual arrays of object_id (not just raw bytes of hashes, as we might find in a pack .idx; those are handled by bsearch_hash()). Using an object_id gives us more type safety, and makes the callers slightly shorter. It also gets rid of the word "sha1" from several access functions, though we could obviously also rename those with s/sha1/hash/. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25Merge branch 'ma/more-opaque-lock-file'Junio C Hamano1-3/+3
Code clean-up. * ma/more-opaque-lock-file: read-cache: try not to peek into `struct {lock_,temp}file` refs/files-backend: don't peek into `struct lock_file` midx: don't peek into `struct lock_file` commit-graph: don't peek into `struct lock_file` builtin/gc: don't peek into `struct lock_file`
2021-01-21refs: switch peel_ref() to peel_iterated_oid()Jeff King1-1/+1
The peel_ref() interface is confusing and error-prone: - it's typically used by ref iteration callbacks that have both a refname and oid. But since they pass only the refname, we may load the ref value from the filesystem again. This is inefficient, but also means we are open to a race if somebody simultaneously updates the ref. E.g., this: int some_ref_cb(const char *refname, const struct object_id *oid, ...) { if (!peel_ref(refname, &peeled)) printf("%s peels to %s", oid_to_hex(oid), oid_to_hex(&peeled); } could print nonsense. It is correct to say "refname peels to..." (you may see the "before" value or the "after" value, either of which is consistent), but mentioning both oids may be mixing before/after values. Worse, whether this is possible depends on whether the optimization to read from the current iterator value kicks in. So it is actually not possible with: for_each_ref(some_ref_cb); but it _is_ possible with: head_ref(some_ref_cb); which does not use the iterator mechanism (though in practice, HEAD should never peel to anything, so this may not be triggerable). - it must take a fully-qualified refname for the read_ref_full() code path to work. Yet we routinely pass it partial refnames from callbacks to for_each_tag_ref(), etc. This happens to work when iterating because there we do not call read_ref_full() at all, and only use the passed refname to check if it is the same as the iterator. But the requirements for the function parameters are quite unclear. Instead of taking a refname, let's instead take an oid. That fixes both problems. It's a little funny for a "ref" function not to involve refs at all. The key thing is that it's optimizing under the hood based on having access to the ref iterator. So let's change the name to make it clear why you'd want this function versus just peel_object(). There are two other directions I considered but rejected: - we could pass the peel information into the each_ref_fn callback. However, we don't know if the caller actually wants it or not. For packed-refs, providing it is essentially free. But for loose refs, we actually have to peel the object, which would be wasteful in most cases. We could likewise pass in a flag to the callback indicating whether the peeled information is known, but that complicates those callbacks, as they then have to decide whether to manually peel themselves. Plus it requires changing the interface of every callback, whether they care about peeling or not, and there are many of them. - we could make a function to return the peeled value of the current iterated ref (computing it if necessary), and BUG() otherwise. I.e.: int peel_current_iterated_ref(struct object_id *out); Each of the current callers is an each_ref_fn callback, so they'd mostly be happy. But: - we use those callbacks with functions like head_ref(), which do not use the iteration code. So we'd need to handle the fallback case there, anyway. - it's possible that a caller would want to call into generic code that sometimes is used during iteration and sometimes not. This encapsulates the logic to do the fast thing when possible, and fallback when necessary. The implementation is mostly obvious, but I want to call out a few things in the patch: - the test-tool coverage for peel_ref() is now meaningless, as it all collapses to a single peel_object() call (arguably they were pretty uninteresting before; the tricky part of that function is the fast-path we see during iteration, but these calls didn't trigger that). I've just dropped it entirely, though note that some other tests relied on the tags we created; I've moved that creation to the tests where it matters. - we no longer need to take a ref_store parameter, since we'd never look up a ref now. We do still rely on a global "current iterator" variable which _could_ be kept per-ref-store. But in practice this is only useful if there are multiple recursive iterations, at which point the more appropriate solution is probably a stack of iterators. No caller used the actual ref-store parameter anyway (they all call the wrapper that passes the_repository). - the original only kicked in the optimization when the "refname" pointer matched (i.e., not string comparison). We do likewise with the "oid" parameter here, but fall back to doing an actual oideq() call. This in theory lets us kick in the optimization more often, though in practice no current caller cares. It should never be wrong, though (peeling is a property of an object, so two refs pointing to the same object would peel identically). - the original took care not to touch the peeled out-parameter unless we found something to put in it. But no caller cares about this, and anyway, it is enforced by peel_object() itself (and even in the optimized iterator case, that's where we eventually end up). We can shorten the code and avoid an extra copy by just passing the out-parameter through the stack. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-reach: use corrected commit dates in paint_down_to_common()Abhishek Kumar1-0/+14
091f4cf (commit: don't use generation numbers if not needed, 2018-08-30) changed paint_down_to_common() to use commit dates instead of generation numbers v1 (topological levels) as the performance regressed on certain topologies. With generation number v2 (corrected commit dates) implemented, we no longer have to rely on commit dates and can use generation numbers. For example, the command `git merge-base v4.8 v4.9` on the Linux repository walks 167468 commits, taking 0.135s for committer date and 167496 commits, taking 0.157s for corrected committer date respectively. While using corrected commit dates, Git walks nearly the same number of commits as commit date, the process is slower as for each comparision we have to access a commit-slab (for corrected committer date) instead of accessing struct member (for committer date). This change incidentally broke the fragile t6404-recursive-merge test. t6404-recursive-merge sets up a unique repository where all commits have the same committer date without a well-defined merge-base. While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer date as a heuristic in paint_down_to_common(). 6404.1 'combined merge conflicts' merges commits in the order: - Merge C with B to form an intermediate commit. - Merge the intermediate commit with A. With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently use the corrected committer date, which changes the order in which commits are merged: - Merge A with B to form an intermediate commit. - Merge the intermediate commit with C. While resulting repositories are equivalent, 6404.4 'virtual trees were processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting different merge-bases and thus have different object ids for the intermediate commits. As this has already causes problems (as noted in 859fdc0 (commit-graph: define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within t6404-recursive-merge. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: use generation v2 only if entire chain doesAbhishek Kumar1-2/+28
Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: implement generation data chunkAbhishek Kumar1-11/+103
As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: implement corrected commit dateAbhishek Kumar1-4/+17
With most of preparations done, let's implement corrected commit date. The corrected commit date for a commit is defined as: * A commit with no parents (a root commit) has corrected commit date equal to its committer date. * A commit with at least one parent has corrected commit date equal to the maximum of its commit date and one more than the largest corrected commit date among its parents. As a special case, a root commit with timestamp of zero (01.01.1970 00:00:00Z) has corrected commit date of one, to be able to distinguish from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date). To minimize the space required to store corrected commit date, Git stores corrected commit date offsets into the commit-graph file. The corrected commit date offset for a commit is defined as the difference between its corrected commit date and actual commit date. Storing corrected commit date requires sizeof(timestamp_t) bytes, which in most cases is 64 bits (uintmax_t). However, corrected commit date offsets can be safely stored using only 32-bits. This halves the size of GDAT chunk, which is a reduction of around 6% in the size of commit-graph file. However, using offsets be problematic if a commit is malformed but valid and has committer date of 0 Unix time, as the offset would be the same as corrected commit date and thus require 64-bits to be stored properly. While Git does not write out offsets at this stage, Git stores the corrected commit dates in member generation of struct commit_graph_data. It will begin writing commit date offsets with the introduction of generation data chunk. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: return 64-bit generation numberAbhishek Kumar1-11/+11
In a preparatory step for introducing corrected commit dates, let's return timestamp_t values from commit_graph_generation(), use timestamp_t for local variables and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead. We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to represent the largest topological level we can store in the commit data chunk. With corrected commit dates implemented, we will have two such *_MAX variables to denote the largest offset and largest topological level that can be stored. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: add a slab to store topological levelsAbhishek Kumar1-15/+30
In a later commit we will introduce corrected commit date as the generation number v2. Corrected commit dates will be stored in the new seperate Generation Data chunk. However, to ensure backwards compatibility with "Old" Git we need to continue to write generation number v1 (topological levels) to the commit data chunk. Thus, we need to compute and store both versions of generation numbers to write the commit-graph file. Therefore, let's introduce a commit-slab `topo_level_slab` to store topological levels; corrected commit date will be stored in the member `generation` of struct commit_graph_data. The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO` mark commits not in the commit-graph file and commits written by a version of Git that did not compute generation numbers respectively. Generation numbers are computed identically for both kinds of commits. A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit is not in the commit-graph file. However, since the slab is zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`). Thus, we no longer need to check if the topological level of a commit is `GENERATION_NUMBER_INFINITY`. We will add a pointer to the slab in `struct write_commit_graph_context` and `struct commit_graph` to populate the slab in `fill_commit_graph_info` if the commit has a pre-computed topological level as in case of split commit-graphs. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: consolidate fill_commit_graph_infoAbhishek Kumar1-17/+10
Both fill_commit_graph_info() and fill_commit_in_graph() parse information present in commit data chunk. Let's simplify the implementation by calling fill_commit_graph_info() within fill_commit_in_graph(). fill_commit_graph_info() used to not load committer data from commit data chunk. However, with the upcoming switch to using corrected committer date as generation number v2, we will have to load committer date to compute generation number value anyway. e51217e15 (t5000: test tar files that overflow ustar headers, 30-06-2016) introduced a test 'generate tar with future mtime' that creates a commit with committer date of (2^36 + 1) seconds since EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus committer time overflows into generation number (within CDAT chunk) and has undefined behavior. The test used to pass as fill_commit_graph_info() would not set struct member `date` of struct commit and load committer date from the object database, generating a tar file with the expected mtime. However, with corrected commit date, we will load the committer date from CDAT chunk (truncated to lower 34-bits to populate the generation number. Thus, Git sets date and generates tar file with the truncated mtime. The ustar format (the header format used by most modern tar programs) only has room for 11 (or 12, depending on some implementations) octal digits for the size and mtime of each file. As the CDAT chunk is overflow by 12-octal digits but not 11-octal digits, we split the existing tests to test both implementations separately and add a new explicit test for 11-digit implementation. To test the 11-octal digit implementation, we create a future commit with committer date of 2^34 - 1, which overflows 11-octal digits without overflowing 34-bits of the Commit Date chunks. To test the 12-octal digit implementation, the smallest committer date possible is 2^36 + 1, which overflows the CDAT chunk and thus commit-graph must be disabled for the test. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: fix regression when computing Bloom filtersAbhishek Kumar1-2/+6
Before computing Bloom filters, the commit-graph machinery uses commit_gen_cmp to sort commits by generation order for improved diff performance. 3d11275505 (commit-graph: examine commits by generation number, 2020-03-30) claims that this sort can reduce the time spent to compute Bloom filters by nearly half. But since c49c82aa4c (commit: move members graph_pos, generation to a slab, 2020-06-17), this optimization is broken, since asking for a 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY while writing. Not all hope is lost, though: 'commit_gen_cmp()' falls back to comparing commits by their date when they have equal generation number, and so since c49c82aa4c is purely a date comparison function. This heuristic is good enough that we don't seem to loose appreciable performance while computing Bloom filters. Applying this patch (compared with v2.30.0) speeds up computing Bloom filters by factors ranging from 0.40% to 5.19% on various repositories [1]. So, avoid the useless 'commit_graph_generation()' while writing by instead accessing the slab directly. This returns the newly-computed generation numbers, and allows us to avoid the heuristic by directly comparing generation numbers. [1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-06commit-graph: don't peek into `struct lock_file`Martin Ågren1-3/+3
Similar to the previous commit, avoid peeking into the `struct lock_file`. Use the lock file API instead. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04hash-lookup: rename from sha1-lookupMartin Ågren1-1/+1
Change all remnants of "sha1" in hash-lookup.c and .h and rename them to reflect that we're not just able to handle SHA-1 these days. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04sha1-lookup: rename `sha1_pos()` as `hash_pos()`Martin Ågren1-3/+3
Rename this function to reflect that we're not just able to handle SHA-1 these days. There are a few instances of "sha1" left in sha1-lookup.[ch] after this, but those will be addressed in the next commit. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07commit-graph: use size_t for array allocation and indexingJeff King1-2/+2
Our packed_commit_list is an array of pointers to commit structs. We use "int" for the allocation, which is 32-bit even on 64-bit platforms. This isn't likely to overflow in practice (we're writing commit graphs, so you'd need to actually have billions of unique commits in the repository). But it's good practice to use size_t for allocations. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07commit-graph: replace packed_oid_list with oid_arrayJeff King1-47/+15
Our custom packed_oid_list data structure is really just an oid_array in disguise. Let's switch to using the generic structure, which shortens and simplifies the code slightly. There's one slightly awkward part: in the old code we copied a hash straight from the mmap'd on-disk data into the final object_id. And now we'll copy to a temporary oid, which we'll then pass to oid_array_append(). But this is an operation we have to do all over the commit-graph code already, since it mostly uses object_id structs internally. I also measured "git commit-graph --append", which triggers this code path, and it showed no difference. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07commit-graph: drop count_distinct_commits() functionJeff King1-41/+2
When writing a commit graph, we collect a list of object ids in an array, which we'll eventually copy into an array of "struct commit" pointers. Before we do that, though, we count the number of distinct commit entries. There's a subtle bug in this step, though. We eliminate not only duplicate oids, but also in split mode, any oids which are not commits or which are already in a graph file. However, the loop starts at index 1, always counting index 0 as distinct. And indeed it can't be a duplicate, since we check for those by comparing against the previous entry, and there isn't one for index 0. But it could be a commit that's already in a graph file, and we'd overcount the number of commits by 1 in that case. That turns out not to be a problem, though. The only things we do with the count are: - check if our count will overflow our data structures. But the limit there is 2^31 commits, so while this is a useful check, the off-by-one is not likely to matter. - pre-allocate the array of commit pointers. But over-allocating by one isn't a problem; we'll just waste a few extra bytes. The bug would be easy enough to fix, but we can observe that neither of those steps is necessary. After building the actual commit array, we'll likewise check its count for overflow. So the extra check of the distinct commit count here is redundant. And likewise we use ALLOC_GROW() when building the commit array, so there's no need to preallocate it (it's possible that doing so is slightly more efficient, but if we care we can just optimistically allocate one slot for each oid; I didn't bother here). So count_distinct_commits() isn't doing anything useful. Let's just get rid of that step. Note that a side effect of the function was that we sorted the list of oids, which we do rely on in copy_oids_to_commits(), since it must also skip the duplicates. So we'll move the qsort there. I didn't copy the "TODO" about adding more progress meters. It's actually quite hard to make a repository large enough for this qsort would take an appreciable amount of time, so this doesn't seem like a useful note. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-02Merge branch 'ds/commit-graph-merging-fix'Junio C Hamano1-3/+18
When "git commit-graph" detects the same commit recorded more than once while it is merging the layers, it used to die. The code now ignores all but one of them and continues. * ds/commit-graph-merging-fix: commit-graph: don't write commit-graph when disabled commit-graph: ignore duplicates when merging layers
2020-10-09commit-graph: don't write commit-graph when disabledDerrick Stolee1-0/+5
The core.commitGraph config setting can be set to 'false' to prevent parsing commits from the commit-graph file(s). This causes an issue when trying to write with "--split" which needs to distinguish between commits that are in the existing commit-graph layers and commits that are not. The existing mechanism uses parse_commit() and follows by checking if there is a 'graph_pos' that shows the commit was parsed from the commit-graph file. When core.commitGraph=false, we do not parse the commits from the commit-graph and 'graph_pos' indicates that no commits are in the existing file. The --split logic moves forward creating a new layer on top that holds all reachable commits, then possibly merges down into those layers, resulting in duplicate commits. The previous change makes that merging process more robust to such a situation in case it happens in the written commit-graph data. The easy answer here is to avoid writing a commit-graph if reading the commit-graph is disabled. Since the resulting commit-graph will would not be read by subsequent Git processes. This is more natural than forcing core.commitGraph to be true for the 'write' process. Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de> Helped-by: Jeff King <peff@peff.net> Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09commit-graph: ignore duplicates when merging layersDerrick Stolee1-3/+13
Thomas reported [1] that a "git fetch" command was failing with an error saying "unexpected duplicate commit id". The root cause is that they had fetch.writeCommitGraph enabled which generates commit-graph chains, and this instance was merging two layers that both contained the same commit ID. [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/ The initial assumption is that Git would not write a commit ID into a commit-graph layer if it already exists in a lower commit-graph layer. Somehow, this specific case did get into that situation, leading to this error. While unexpected, this isn't actually invalid (as long as the two layers agree on the metadata for the commit). When we parse a commit that does not have a graph_pos in the commit_graph_data_slab, we use binary search in the commit-graph layers to find the commit and set graph_pos. That position is never used again in this case. However, when we parse a commit from the commit-graph file, we load its parents from the commit-graph and assign graph_pos at that point. If those parents were already parsed from the commit-graph, then nothing needs to be done. Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. Thus, this die() is too aggressive. The easiest thing to do would be to ignore the duplicates. If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. Thankfully, we can collapse the list to erase the duplicate commit pointers. This allows us to get the end result we want without extra memory costs and minimal CPU time. The root cause is due to disabling core.commitGraph, which prevents parsing commits from the lower layers during a 'git commit-graph write --split' command. Since we use the 'graph_pos' value to determine whether a commit is in a lower layer, we never discover that those commits are already in the commit-graph chain and add them to the top layer. This layer is then merged down, creating duplicates. The test added in t5324-split-commit-graph.sh fails without this change. However, we still have not completely removed the need for this duplicate check. That will come in a follow-up change. Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de> Helped-by: Taylor Blau <me@ttaylorr.com> Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-29Merge branch 'tb/bloom-improvements'Junio C Hamano1-42/+99
"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-25Merge branch 'ds/maintenance-part-1'Junio C Hamano1-4/+4
A "git gc"'s big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning. * ds/maintenance-part-1: maintenance: add trace2 regions for task execution maintenance: add auto condition for commit-graph task maintenance: use pointers to check --auto maintenance: create maintenance.<task>.enabled config maintenance: take a lock on the objects directory maintenance: add --task option maintenance: add commit-graph task maintenance: initialize task array maintenance: replace run_auto_gc() maintenance: add --quiet option maintenance: create basic maintenance runner
2020-09-18builtin/commit-graph.c: introduce '--max-new-filters=<n>'Taylor Blau1-2/+7
Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write' is willing to compute from scratch. Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed). This behavior can cause prohibitively-long commit-graph writes for a variety of reasons: * There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc). * Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large. This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths' to be rather unpredictable. To make this command behave more predictably, introduce '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters from scratch. This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17commit-graph: rename 'split_commit_graph_opts'Taylor Blau1-20/+20
In the subsequent commit, additional options will be added to the commit-graph API which have nothing to do with splitting. Rename the 'split_commit_graph_opts' structure to the more-generic 'commit_graph_opts' to encompass both. Likewise, rename the 'flags' member to instead be 'split_flags' to clarify that it only has to do with the behavior implied by '--split'. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17bloom: encode out-of-bounds filters as non-emptyTaylor Blau1-0/+5
When a changed-path Bloom filter has either zero, or more than a certain number (commonly 512) of entries, the commit-graph machinery encodes it as "missing". More specifically, it sets the indices adjacent in the BIDX chunk as equal to each other to indicate a "length 0" filter; that is, that the filter occupies zero bytes on disk. This has heretofore been fine, since the commit-graph machinery has no need to care about these filters with too few or too many changed paths. Both cases act like no filter has been generated at all, and so there is no need to store them. In a subsequent commit, however, the commit-graph machinery will learn to only compute Bloom filters for some commits in the current commit-graph layer. This is a change from the current implementation which computes Bloom filters for all commits that are in the layer being written. Critically for this patch, only computing some of the Bloom filters means adding a third state for length 0 Bloom filters: zero entries, too many entries, or "hasn't been computed". It will be important for that future patch to distinguish between "not representable" (i.e., zero or too-many changed paths), and "hasn't been computed". In particular, we don't want to waste time recomputing filters that have already been computed. To that end, change how we store Bloom filters in the "computed but not representable" category: - Bloom filters with no entries are stored as a single byte with all bits low (i.e., all queries to that Bloom filter will return "definitely not") - Bloom filters with too many entries are stored as a single byte with all bits set high (i.e., all queries to that Bloom filter will return "maybe"). These rules are sufficient to not incur a behavior change by changing the on-disk representation of these two classes. Likewise, no specification changes are necessary for the commit-graph format, either: - Filters that were previously empty will be recomputed and stored according to the new rules, and - old clients reading filters generated by new clients will interpret the filters correctly and be none the wiser to how they were generated. Clients will invoke the Bloom machinery in more cases than before, but this can be addressed by returning a NULL filter when all bits are set high. This can be addressed in a future patch. Note that this does increase the size of on-disk commit-graphs, but far less than other proposals. In particular, this is generally more efficient than storing a bitmap for which commits haven't computed their Bloom filters. Storing a bitmap incurs a penalty of one bit per commit, whereas storing explicit filters as above incurs a penalty of one byte per too-large or empty commit. In practice, these boundary commits likely occupy a small proportion of the overall number of commits, and so the size penalty is likely smaller than storing a bitmap for all commits. See, for example, these relative proportions of such boundary commits (collected by SZEDER Gábor): | Percentage of | commit-graph | | | commits modifying | file size | | ├────────┬──────────────┼───────────────────┤ pct. | | 0 path | >= 512 paths | before | after | change | ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤ | android-base | 13.20% | 0.13% | 37.468M | 37.534M | +0.1741 % | | cmssw | 0.15% | 0.23% | 17.118M | 17.119M | +0.0091 % | | cpython | 3.07% | 0.01% | 7.967M | 7.971M | +0.0423 % | | elasticsearch | 0.70% | 1.00% | 8.833M | 8.835M | +0.0128 % | | gcc | 0.00% | 0.08% | 16.073M | 16.074M | +0.0030 % | | gecko-dev | 0.14% | 0.64% | 59.868M | 59.874M | +0.0105 % | | git | 0.11% | 0.02% | 3.895M | 3.895M | +0.0020 % | | glibc | 0.02% | 0.10% | 3.555M | 3.555M | +0.0021 % | | go | 0.00% | 0.07% | 3.186M | 3.186M | +0.0018 % | | homebrew-cask | 0.40% | 0.02% | 7.035M | 7.035M | +0.0065 % | | homebrew-core | 0.01% | 0.01% | 11.611M | 11.611M | +0.0002 % | | jdk | 0.26% | 5.64% | 5.537M | 5.540M | +0.0590 % | | linux | 0.01% | 0.51% | 63.735M | 63.740M | +0.0073 % | | llvm-project | 0.12% | 0.03% | 25.515M | 25.516M | +0.0050 % | | rails | 0.10% | 0.10% | 6.252M | 6.252M | +0.0027 % | | rust | 0.07% | 0.17% | 9.364M | 9.364M | +0.0033 % | | tensorflow | 0.09% | 1.02% | 7.009M | 7.010M | +0.0158 % | | webkit | 0.05% | 0.31% | 17.405M | 17.406M | +0.0047 % | (where the above increase is determined by computing a non-split commit-graph before and after this patch). Given that these projects are all "large" by commit count, the storage cost by writing these filters explicitly is negligible. In the most extreme example, android-base (which has 494,848 commits at the time of writing) would have its commit-graph increase by a modest 68.4 KB. Finally, a test to exercise filters which contain too many changed path entries will be introduced in a subsequent patch. Suggested-by: SZEDER Gábor <szeder.dev@gmail.com> Suggested-by: Jakub Narębski <jnareb@gmail.com> Helped-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: SZEDER Gábor <szeder.dev@gmail.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17maintenance: add commit-graph taskDerrick Stolee1-4/+4
The first new task in the 'git maintenance' builtin is the 'commit-graph' task. This updates the commit-graph file incrementally with the command git commit-graph write --reachable --split By writing an incremental commit-graph file using the "--split" option we minimize the disruption from this operation. The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution. Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will verly likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.) If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17bloom: use provided 'struct bloom_filter_settings'Taylor Blau1-11/+10
When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter from scratch, it looks to the default 'struct bloom_filter_settings' in order to determine the maximum number of changed paths, number of bits per entry, and so on. All of these values have so far been constant, and so there was no need to pass in a pointer from the caller (eg., the one that is stored in the 'struct write_commit_graph_context'). Start passing in a 'struct bloom_filter_settings *' instead of using the default values to respect graph-specific settings (eg., in the case of setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS'). In order to have an initialized value for these settings, move its initialization to earlier in the commit-graph write. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17bloom: split 'get_bloom_filter()' in twoTaylor Blau1-3/+31
'get_bloom_filter' takes a flag to control whether it will compute a Bloom filter if the requested one is missing. In the next patch, we'll add yet another parameter to this method, which would force all but one caller to specify an extra 'NULL' parameter at the end. Instead of doing this, split 'get_bloom_filter' into two functions: 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only looks up a Bloom filter (and does not compute one if it's missing, thus dropping the 'compute_if_not_present' flag). The latter does compute missing Bloom filters, with an additional parameter to store whether or not it needed to do so. This simplifies many call-sites, since the majority of existing callers to 'get_bloom_filter' do not want missing Bloom filters to be computed (so they can drop the parameter entirely and use the simpler version of the function). While we're at it, instrument the new 'get_or_compute_bloom_filter()' with counters in the 'write_commit_graph_context' struct which store the number of filters that we did and didn't compute, as well as filters that were truncated. It would be nice to drop the 'compute_if_not_present' flag entirely, since all remaining callers of 'get_or_compute_bloom_filter' pass it as '1', but this will change in a future patch and hence cannot be removed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17commit-graph.c: store maximum changed pathsTaylor Blau1-0/+4
For now, we assume that there is a fixed constant describing the maximum number of changed paths we are willing to store in a Bloom filter. Prepare for that to (at least partially) not be the case by making it a member of the 'struct bloom_filter_settings'. This will be helpful in the subsequent patches by reducing the size of test cases that exercise storing too many changed paths, as well as preparing for an eventual future in which this value might change. This patch alone does not cause newly generated Bloom filters to use a custom upper-bound on the maximum number of changed paths a single Bloom filter can hold, that will occur in a later patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: respect 'commitGraph.readChangedPaths'Taylor Blau1-2/+4
Git uses the 'core.commitGraph' configuration value to control whether or not the commit graph is used when parsing commits or performing a traversal. Now that commit-graphs can also contain a section for changed-path Bloom filters, administrators that already have commit-graphs may find it convenient to use those graphs without relying on their changed-path Bloom filters. This can happen, for example, during a staged roll-out, or in the event of an incident. Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom filters are read. Note that this configuration is independent from both: - 'core.commitGraph', to allow flexibility in using all parts of a commit-graph _except_ for its Bloom filters. - The '--changed-paths' option for 'git commit-graph write', to allow reading and writing Bloom filters to be controlled independently. When the variable is set, pretend as if no Bloom data was specified at all. This avoids adding additional special-casing outside of the commit-graph internals. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: pass a 'struct repository *' in more placesTaylor Blau1-7/+10
In a future commit, some commit-graph internals will want access to 'r->settings', but we only have the 'struct object_directory *' corresponding to that repository. Add an additional parameter to pass the repository around in more places. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>