aboutsummaryrefslogtreecommitdiffstats
path: root/fsck.c
AgeCommit message (Collapse)AuthorFilesLines
2024-04-29Sync with 2.44.1Johannes Schindelin1-0/+56
* maint-2.44: (41 commits) Git 2.44.1 Git 2.43.4 Git 2.42.2 Git 2.41.1 Git 2.40.2 Git 2.39.4 fsck: warn about symlink pointing inside a gitdir core.hooksPath: add some protection while cloning init.templateDir: consider this config setting protected clone: prevent hooks from running during a clone Add a helper function to compare file contents init: refactor the template directory discovery into its own function find_hook(): refactor the `STRIP_EXTENSION` logic clone: when symbolic links collide with directories, keep the latter entry: report more colliding paths t5510: verify that D/F confusion cannot lead to an RCE submodule: require the submodule path to contain directories only clone_submodule: avoid using `access()` on directories submodules: submodule paths must not contain symlinks clone: prevent clashing git dirs when cloning submodule in parallel ...
2024-04-19Sync with 2.43.4Johannes Schindelin1-0/+56
* maint-2.43: (40 commits) Git 2.43.4 Git 2.42.2 Git 2.41.1 Git 2.40.2 Git 2.39.4 fsck: warn about symlink pointing inside a gitdir core.hooksPath: add some protection while cloning init.templateDir: consider this config setting protected clone: prevent hooks from running during a clone Add a helper function to compare file contents init: refactor the template directory discovery into its own function find_hook(): refactor the `STRIP_EXTENSION` logic clone: when symbolic links collide with directories, keep the latter entry: report more colliding paths t5510: verify that D/F confusion cannot lead to an RCE submodule: require the submodule path to contain directories only clone_submodule: avoid using `access()` on directories submodules: submodule paths must not contain symlinks clone: prevent clashing git dirs when cloning submodule in parallel t7423: add tests for symlinked submodule directories ...
2024-04-19Sync with 2.42.2Johannes Schindelin1-0/+56
* maint-2.42: (39 commits) Git 2.42.2 Git 2.41.1 Git 2.40.2 Git 2.39.4 fsck: warn about symlink pointing inside a gitdir core.hooksPath: add some protection while cloning init.templateDir: consider this config setting protected clone: prevent hooks from running during a clone Add a helper function to compare file contents init: refactor the template directory discovery into its own function find_hook(): refactor the `STRIP_EXTENSION` logic clone: when symbolic links collide with directories, keep the latter entry: report more colliding paths t5510: verify that D/F confusion cannot lead to an RCE submodule: require the submodule path to contain directories only clone_submodule: avoid using `access()` on directories submodules: submodule paths must not contain symlinks clone: prevent clashing git dirs when cloning submodule in parallel t7423: add tests for symlinked submodule directories has_dir_name(): do not get confused by characters < '/' ...
2024-04-19Sync with 2.41.1Johannes Schindelin1-0/+56
* maint-2.41: (38 commits) Git 2.41.1 Git 2.40.2 Git 2.39.4 fsck: warn about symlink pointing inside a gitdir core.hooksPath: add some protection while cloning init.templateDir: consider this config setting protected clone: prevent hooks from running during a clone Add a helper function to compare file contents init: refactor the template directory discovery into its own function find_hook(): refactor the `STRIP_EXTENSION` logic clone: when symbolic links collide with directories, keep the latter entry: report more colliding paths t5510: verify that D/F confusion cannot lead to an RCE submodule: require the submodule path to contain directories only clone_submodule: avoid using `access()` on directories submodules: submodule paths must not contain symlinks clone: prevent clashing git dirs when cloning submodule in parallel t7423: add tests for symlinked submodule directories has_dir_name(): do not get confused by characters < '/' docs: document security issues around untrusted .git dirs ...
2024-04-19Sync with 2.40.2Johannes Schindelin1-0/+56
* maint-2.40: (39 commits) Git 2.40.2 Git 2.39.4 fsck: warn about symlink pointing inside a gitdir core.hooksPath: add some protection while cloning init.templateDir: consider this config setting protected clone: prevent hooks from running during a clone Add a helper function to compare file contents init: refactor the template directory discovery into its own function find_hook(): refactor the `STRIP_EXTENSION` logic clone: when symbolic links collide with directories, keep the latter entry: report more colliding paths t5510: verify that D/F confusion cannot lead to an RCE submodule: require the submodule path to contain directories only clone_submodule: avoid using `access()` on directories submodules: submodule paths must not contain symlinks clone: prevent clashing git dirs when cloning submodule in parallel t7423: add tests for symlinked submodule directories has_dir_name(): do not get confused by characters < '/' docs: document security issues around untrusted .git dirs upload-pack: disable lazy-fetching by default ...
2024-04-19Sync with 2.39.4Johannes Schindelin1-0/+56
* maint-2.39: (38 commits) Git 2.39.4 fsck: warn about symlink pointing inside a gitdir core.hooksPath: add some protection while cloning init.templateDir: consider this config setting protected clone: prevent hooks from running during a clone Add a helper function to compare file contents init: refactor the template directory discovery into its own function find_hook(): refactor the `STRIP_EXTENSION` logic clone: when symbolic links collide with directories, keep the latter entry: report more colliding paths t5510: verify that D/F confusion cannot lead to an RCE submodule: require the submodule path to contain directories only clone_submodule: avoid using `access()` on directories submodules: submodule paths must not contain symlinks clone: prevent clashing git dirs when cloning submodule in parallel t7423: add tests for symlinked submodule directories has_dir_name(): do not get confused by characters < '/' docs: document security issues around untrusted .git dirs upload-pack: disable lazy-fetching by default fetch/clone: detect dubious ownership of local repositories ...
2024-04-19fsck: warn about symlink pointing inside a gitdirJohannes Schindelin1-0/+56
In the wake of fixing a vulnerability where `git clone` mistakenly followed a symbolic link that it had just written while checking out files, writing into a gitdir, let's add some defense-in-depth by teaching `git fsck` to report symbolic links stored in its trees that point inside `.git/`. Even though the Git project never made any promises about the exact shape of the `.git/` directory's contents, there are likely repositories out there containing symbolic links that point inside the gitdir. For that reason, let's only report these as warnings, not as errors. Security-conscious users are encouraged to configure `fsck.symlinkPointsToGitDir = error`. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-03-28Merge branch 'eb/hash-transition'Junio C Hamano1-2/+4
Work to support a repository that work with both SHA-1 and SHA-256 hash algorithms has started. * eb/hash-transition: (30 commits) t1016-compatObjectFormat: add tests to verify the conversion between objects t1006: test oid compatibility with cat-file t1006: rename sha1 to oid test-lib: compute the compatibility hash so tests may use it builtin/ls-tree: let the oid determine the output algorithm object-file: handle compat objects in check_object_signature tree-walk: init_tree_desc take an oid to get the hash algorithm builtin/cat-file: let the oid determine the output algorithm rev-parse: add an --output-object-format parameter repository: implement extensions.compatObjectFormat object-file: update object_info_extended to reencode objects object-file-convert: convert commits that embed signed tags object-file-convert: convert commit objects when writing object-file-convert: don't leak when converting tag objects object-file-convert: convert tag objects when writing object-file-convert: add a function to convert trees between algorithms object: factor out parse_mode out of fast-import and tree-walk into in object.h cache: add a function to read an OID of a specific algorithm tag: sign both hashes commit: export add_header_signature to support handling signatures on tags ...
2024-02-08Merge branch 'en/header-cleanup' into maint-2.43Junio C Hamano1-1/+0
Remove unused header "#include". * en/header-cleanup: treewide: remove unnecessary includes in source files treewide: add direct includes currently only pulled in transitively trace2/tr2_tls.h: remove unnecessary include submodule-config.h: remove unnecessary include pkt-line.h: remove unnecessary include line-log.h: remove unnecessary include http.h: remove unnecessary include fsmonitor--daemon.h: remove unnecessary includes blame.h: remove unnecessary includes archive.h: remove unnecessary include treewide: remove unnecessary includes in source files treewide: remove unnecessary includes from header files
2024-01-26Merge branch 'vd/fsck-submodule-url-test'Junio C Hamano1-133/+0
Tighten URL checks fsck makes in a URL recorded for submodules. * vd/fsck-submodule-url-test: submodule-config.c: strengthen URL fsck check t7450: test submodule urls test-submodule: remove command line handling for check-name submodule-config.h: move check_submodule_url
2024-01-18submodule-config.h: move check_submodule_urlVictoria Dye1-133/+0
Move 'check_submodule_url' out of 'fsck.c' and into 'submodule-config.h' as a public method, similar to 'check_submodule_name'. With the function now accessible outside of 'fsck', it can be used in a later commit to extend 'test-tool submodule' to check the validity of submodule URLs as it does with names in the 'check-name' subcommand. Other than its location, no changes are made to 'check_submodule_url' in this patch. Signed-off-by: Victoria Dye <vdye@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-08Merge branch 'en/header-cleanup'Junio C Hamano1-1/+0
Remove unused header "#include". * en/header-cleanup: treewide: remove unnecessary includes in source files treewide: add direct includes currently only pulled in transitively trace2/tr2_tls.h: remove unnecessary include submodule-config.h: remove unnecessary include pkt-line.h: remove unnecessary include line-log.h: remove unnecessary include http.h: remove unnecessary include fsmonitor--daemon.h: remove unnecessary includes blame.h: remove unnecessary includes archive.h: remove unnecessary include treewide: remove unnecessary includes in source files treewide: remove unnecessary includes from header files
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-1/+0
Each of these were checked with gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE} to ensure that removing the direct inclusion of the header actually resulted in that header no longer being included at all (i.e. that no other header pulled it in transitively). ...except for a few cases where we verified that although the header was brought in transitively, nothing from it was directly used in that source file. These cases were: * builtin/credential-cache.c * builtin/pull.c * builtin/send-pack.c Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09fsck: handle NULL value when parsing message configJeff King1-2/+6
When parsing fsck.*, receive.fsck.*, or fetch.fsck.*, we don't check for an implicit bool. So any of: [fsck] badTree [receive "fsck"] badTree [fetch "fsck"] badTree will cause us to segfault. We can fix it with config_error_nonbool() in the usual way, but we have to make a few more changes to get good error messages. The problem is that all three spots do: if (skip_prefix(var, "fsck.", &var)) to match and parse the actual message id. But that means that "var" now just says "badTree" instead of "receive.fsck.badTree", making the resulting message confusing. We can fix that by storing the parsed message id in its own separate variable. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02tree-walk: init_tree_desc take an oid to get the hash algorithmEric W. Biederman1-2/+4
To make it possible for git ls-tree to display the tree encoded in the hash algorithm of the oid specified to git ls-tree, update init_tree_desc to take as a parameter the oid of the tree object. Update all callers of init_tree_desc and init_tree_desc_gently to pass the oid of the tree object. Use the oid of the tree object to discover the hash algorithm of the oid and store that hash algorithm in struct tree_desc. Use the hash algorithm in decode_tree_entry and update_tree_entry_internal to handle reading a tree object encoded in a hash algorithm that differs from the repositories hash algorithm. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-31fsck: detect very large tree pathnamesJeff King1-1/+23
In general, Git tries not to arbitrarily limit what it will store, and there are currently no limits at all on the size of the path we find in a tree. In theory you could have one that is gigabytes long. But in practice this freedom is not really helping anybody, and is potentially harmful: 1. Most operating systems have much lower limits for the size of a single pathname component (e.g., on Linux you'll generally get ENAMETOOLONG for anything over 255 bytes). And while you _can_ use Git in a way that never touches the filesystem (manipulating the index and trees directly), it's still probably not a good idea to have gigantic tree names. Many operations load and traverse them, so any clever Git-as-a-database scheme is likely to perform poorly in that case. 2. We still have a lot of code which assumes strings are reasonably sized, and I won't be at all surprised if you can trigger some interesting integer overflows with gigantic pathnames. Stopping malicious trees from entering the repository provides an extra line of defense, protecting downstream code. This patch implements an fsck check so that such trees can be rejected by transfer.fsckObjects. I've picked a reasonably high maximum depth here (4096) that hopefully should not bother anybody in practice. I've also made it configurable, as an escape hatch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-25Merge branch 'jk/unused-parameter'Junio C Hamano1-2/+2
Mark-up unused parameters in the code so that we can eventually enable -Wunused-parameter by default. * jk/unused-parameter: t/helper: mark unused callback void data parameters tag: mark unused parameters in each_tag_name_fn callbacks rev-parse: mark unused parameter in for_each_abbrev callback replace: mark unused parameter in each_mergetag_fn callback replace: mark unused parameter in ref callback merge-tree: mark unused parameter in traverse callback fsck: mark unused parameters in various fsck callbacks revisions: drop unused "opt" parameter in "tweak" callbacks count-objects: mark unused parameter in alternates callback am: mark unused keep_cr parameters http-push: mark unused parameter in xml callback http: mark unused parameters in curl callbacks do_for_each_ref_helper(): mark unused repository parameter test-ref-store: drop unimplemented reflog-expire command
2023-07-17Merge branch 'cw/compat-util-header-cleanup'Junio C Hamano1-1/+0
Further shuffling of declarations across header files to streamline file dependencies. * cw/compat-util-header-cleanup: git-compat-util: move alloc macros to git-compat-util.h treewide: remove unnecessary includes for wrapper.h kwset: move translation table from ctype sane-ctype.h: create header for sane-ctype macros git-compat-util: move wrapper.c funcs to its header git-compat-util: move strbuf.c funcs to its header
2023-07-13fsck: mark unused parameters in various fsck callbacksJeff King1-2/+2
There are a few callback functions which are used with the fsck code, but it's natural that not all callbacks need all parameters. For reporting, even something as obvious as "the oid of the object which had a problem" is not always used, as some callers are only checking a single object in the first place. And for both reporting and walking, things like void data pointers and the fsck_options aren't always necessary. But since each such parameter is used by _some_ callback, we have to keep them in the interface. Mark the unused ones in specific callbacks to avoid triggering -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-06Merge branch 'gc/config-context'Junio C Hamano1-4/+8
Reduce reliance on a global state in the config reading API. * gc/config-context: config: pass source to config_parser_event_fn_t config: add kvi.path, use it to evaluate includes config.c: remove config_reader from configsets config: pass kvi to die_bad_number() trace2: plumb config kvi config.c: pass ctx with CLI config config: pass ctx with config files config.c: pass ctx in configsets config: add ctx arg to config_fn_t urlmatch.h: use config_fn_t type config: inline git_color_default_config
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan1-1/+0
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-28config: pass ctx with config filesGlen Choo1-1/+2
Pass config_context to config_callbacks when parsing config files. To provide the .kvi member, refactor out the configset logic that caches "struct config_source" and "enum config_scope" as a "struct key_value_info". Make the "enum config_scope" available to the config file machinery by plumbing an additional arg through git_config_from_file_with_options(). We do not exercise ctx yet because the remaining current_config_*() callers may be used with config_with_options(), which may read config from parameters, but parameters don't pass ctx yet. Signed-off-by: Glen Choo <chooglen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-28config: add ctx arg to config_fn_tGlen Choo1-3/+6
Add a new "const struct config_context *ctx" arg to config_fn_t to hold additional information about the config iteration operation. config_context has a "struct key_value_info kvi" member that holds metadata about the config source being read (e.g. what kind of config source it is, the filename, etc). In this series, we're only interested in .kvi, so we could have just used "struct key_value_info" as an arg, but config_context makes it possible to add/adjust members in the future without changing the config_fn_t signature. We could also consider other ways of organizing the args (e.g. moving the config name and value into config_context or key_value_info), but in my experiments, the incremental benefit doesn't justify the added complexity (e.g. a config_fn_t will sometimes invoke another config_fn_t but with a different config value). In subsequent commits, the .kvi member will replace the global "struct config_reader" in config.c, making config iteration a global-free operation. It requires much more work for the machinery to provide meaningful values of .kvi, so for now, merely change the signature and call sites, pass NULL as a placeholder value, and don't rely on the arg in any meaningful way. Most of the changes are performed by contrib/coccinelle/config_fn_ctx.pending.cocci, which, for every config_fn_t: - Modifies the signature to accept "const struct config_context *ctx" - Passes "ctx" to any inner config_fn_t, if needed - Adds UNUSED attributes to "ctx", if needed Most config_fn_t instances are easily identified by seeing if they are called by the various config functions. Most of the remaining ones are manually named in the .cocci patch. Manual cleanups are still needed, but the majority of it is trivial; it's either adjusting config_fn_t that the .cocci patch didn't catch, or adding forward declarations of "struct config_context ctx" to make the signatures make sense. The non-trivial changes are in cases where we are invoking a config_fn_t outside of config machinery, and we now need to decide what value of "ctx" to pass. These cases are: - trace2/tr2_cfg.c:tr2_cfg_set_fl() This is indirectly called by git_config_set() so that the trace2 machinery can notice the new config values and update its settings using the tr2 config parsing function, i.e. tr2_cfg_cb(). - builtin/checkout.c:checkout_main() This calls git_xmerge_config() as a shorthand for parsing a CLI arg. This might be worth refactoring away in the future, since git_xmerge_config() can call git_default_config(), which can do much more than just parsing. Handle them by creating a KVI_INIT macro that initializes "struct key_value_info" to a reasonable default, and use that to construct the "ctx" arg. Signed-off-by: Glen Choo <chooglen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21object-store-ll.h: split this header out of object-store.hElijah Newren1-1/+2
The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21repository: remove unnecessary include of path.hElijah Newren1-0/+1
This also made it clear that several .c files that depended upon path.h were missing a #include for it; add the missing includes while at it. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24commit.h: reduce unnecessary includesElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11treewide: remove cache.h inclusion due to object.h changesElijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-04Merge branch 'ab/remove-implicit-use-of-the-repository' into ↵Junio C Hamano1-3/+3
en/header-split-cache-h * ab/remove-implicit-use-of-the-repository: libs: use "struct repository *" argument, not "the_repository" post-cocci: adjust comments for recent repo_* migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"
2023-03-28cocci: apply the "object-store.h" part of "the_repository.pending"Ævar Arnfjörð Bjarmason1-1/+1
Apply the part of "the_repository.pending.cocci" pertaining to "object-store.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28cocci: apply the "commit.h" part of "the_repository.pending"Ævar Arnfjörð Bjarmason1-2/+2
Apply the part of "the_repository.pending.cocci" pertaining to "commit.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: remove unnecessary cache.h inclusion from a few headersElijah Newren1-1/+1
Ever since a64215b6cd ("object.h: stop depending on cache.h; make cache.h depend on object.h", 2023-02-24), we have a few headers that could have replaced their include of cache.h with an include of object.h. Make that change now. Some C files had to start including cache.h after this change (or some smaller header it had brought in), because the C files were depending on things from cache.h but were only formerly implicitly getting cache.h through one of these headers being modified in this patch. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren1-1/+2
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-19fsck: do not assume NUL-termination of buffersJeff King1-13/+54
The fsck code operates on an object buffer represented as a pointer/len combination. However, the parsing of commits and tags is a little bit loose; we mostly scan left-to-right through the buffer, without checking whether we've gone past the length we were given. This has traditionally been OK because the buffers we feed to fsck always have an extra NUL after the end of the object content, which ends any left-to-right scan. That has always been true for objects we read from the odb, and we made it true for incoming index-pack/unpack-objects checks in a1e920a0a7 (index-pack: terminate object buffers with NUL, 2014-12-08). However, we recently added an exception: hash-object asks index_fd() to do fsck checks. That _may_ have an extra NUL (if we read from a pipe into a strbuf), but it might not (if we read the contents from the file). Nor can we just teach it to always add a NUL. We may mmap the on-disk file, which will not have any extra bytes (if it's a multiple of the page size). Not to mention that this is a rather subtle assumption for the fsck code to make. Instead, let's make sure that the fsck parsers don't ever look past the size of the buffer they've been given. This _almost_ works already, thanks to earlier work in 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the buffer, 2014-09-11). The theory there is that we check up front whether we have the end of header double-newline separator. And then any left-to-right scanning we do is OK as long as it stops when it hits that boundary. However, we later softened that in 84d18c0bcf (fsck: it is OK for a tag and a commit to lack the body, 2015-06-28), which allows the double-newline header to be missing, but does require that the header ends in a newline. That was OK back then, because of the NUL-termination guarantees (including the one from a1e920a0a7 mentioned above). Because 84d18c0bcf guarantees that any header line does end in a newline, we are still OK with most of the left-to-right scanning. We only need to take care after completing a line, to check that there is another line (and we didn't run out of buffer). Most of these checks are just need to check "buffer < buffer_end" (where buffer is advanced as we parse) before scanning for the next header line. But here are a few notes: - we don't technically need to check for remaining buffer before parsing the very first line ("tree" for a commit, or "object" for a tag), because verify_headers() rejects a totally empty buffer. But we'll do so in the name of consistency and defensiveness. - there are some calls to strchr('\n'). These are actually OK by the "the final header line must end in a newline" guarantee from verify_headers(). They will always find that rather than run off the end of the buffer. Curiously, they do check for a NULL return and complain, but I believe that condition can never be reached. However, I converted them to use memchr() with a proper size and retained the NULL checks. Using memchr() is not much longer and makes it more obvious what is going on. Likewise, retaining the NULL checks serves as a defensive measure in case my analysis is wrong. - commit 9a1a3a4d4c (mktag: allow omitting the header/body \n separator, 2021-01-05), does check for the end-of-buffer condition, but does so with "!*buffer", relying explicitly on the NUL termination. We can accomplish the same thing with a pointer comparison. I also folded it into the follow-on conditional that checks the contents of the buffer, for consistency with the other checks. - fsck_ident() uses parse_timestamp(), which is based on strtoumax(). That function will happily skip past leading whitespace, including newlines, which makes it a risk. We can fix this by scanning to the first digit ourselves, and then using parse_timestamp() to do the actual numeric conversion. Note that as a side effect this fixes the fact that we missed zero-padded timestamps like "<email> 0123" (whereas we would complain about "<email> 0123"). I doubt anybody cares, but I mention it here for completeness. - fsck_tree() does not need any modifications. It relies on decode_tree_entry() to do the actual parsing, and that function checks both that there are enough bytes in the buffer to represent an entry, and that there is a NUL at the appropriate spot (one hash-length from the end; this may not be the NUL for the entry we are parsing, but we know that in the worst case, everything from our current position to that NUL is a filename, so we won't run out of bytes). In addition to fixing the code itself, we'd like to make sure our rather subtle assumptions are not violated in the future. So this patch does two more things: - add comments around verify_headers() documenting the link between what it checks and the memory safety of the callers. I don't expect this code to be modified frequently, but this may help somebody from accidentally breaking things. - add a thorough set of tests covering truncations at various key spots (e.g., for a "tree $oid" line, in the middle of the word "tree", right after it, after the space, in the middle of the $oid, and right at the end of the line. Most of these are fine already (it is only truncating right at the end of the line that is currently broken). And some of them are not even possible with the current code (we parse "tree " as a unit, so truncating before the space is equivalent). But I aimed here to consider the code a black box and look for any truncations that would be a problem for a left-to-right parser. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-18fsck: provide a function to fsck buffer without object structJeff King1-11/+18
The fsck code has been slowly moving away from requiring an object struct in commits like 103fb6d43b (fsck: accept an oid instead of a "struct tag" for fsck_tag(), 2019-10-18), c5b4269b57 (fsck: accept an oid instead of a "struct commit" for fsck_commit(), 2019-10-18), etc. However, the only external interface that fsck.c provides is fsck_object(), which requires an object struct, then promptly discards everything except its oid and type. Let's factor out the post-discard part of that function as fsck_buffer(), leaving fsck_object() as a thin wrapper around it. That will provide more flexibility for callers which may not have a struct. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-13Merge branch 'maint-2.36' into maint-2.37Junio C Hamano1-43/+94
2022-12-09fsck: implement checks for gitattributesPatrick Steinhardt1-1/+37
Recently, a vulnerability was reported that can lead to an out-of-bounds write when reading an unreasonably large gitattributes file. The root cause of this error are multiple integer overflows in different parts of the code when there are either too many lines, when paths are too long, when attribute names are too long, or when there are too many attributes declared for a pattern. As all of these are related to size, it seems reasonable to restrict the size of the gitattributes file via git-fsck(1). This allows us to both stop distributing known-vulnerable objects via common hosting platforms that have fsck enabled, and users to protect themselves by enabling the `fetch.fsckObjects` config. There are basically two checks: 1. We verify that size of the gitattributes file is smaller than 100MB. 2. We verify that the maximum line length does not exceed 2048 bytes. With the preceding commits, both of these conditions would cause us to either ignore the complete gitattributes file or blob in the first case, or the specific line in the second case. Now with these consistency checks added, we also grow the ability to stop distributing such files in the first place when `receive.fsckObjects` is enabled. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-09fsck: move checks for gitattributesPatrick Steinhardt1-5/+7
Move the checks for gitattributes so that they can be extended more readily. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-09fsck: pull out function to check a set of blobsPatrick Steinhardt1-12/+22
In `fsck_finish()` we check all blobs for consistency that we have found during the tree walk, but that haven't yet been checked. This is only required for gitmodules right now, but will also be required for a new check for gitattributes. Pull out a function `fsck_blobs()` that allows the caller to check a set of blobs for consistency. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-12-09fsck: refactor `fsck_blob()` to allow for more checksPatrick Steinhardt1-26/+29
In general, we don't need to validate blob contents as they are opaque blobs about whose content Git doesn't need to care about. There are some exceptions though when blobs are linked into trees so that they would be interpreted by Git. We only have a single such check right now though, which is the one for gitmodules that has been added in the context of CVE-2018-11235. Now we have found another vulnerability with gitattributes that can lead to out-of-bounds writes and reads. So let's refactor `fsck_blob()` so that it is more extensible and can check different types of blobs. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-10fsck: actually detect bad file modes in treesJeff King1-1/+1
We use the normal tree_desc code to iterate over trees in fsck, meaning we only see the canonicalized modes it returns. And hence we'd never see anything unexpected, since it will coerce literally any garbage into one of our normal and accepted modes. We can use the new RAW_MODES flag to see the real modes, and then use the existing code to actually analyze them. The existing code is written as allow-known-good, so there's not much point in testing a variety of breakages. The one tested here should be S_IFREG but with nonsense permissions. Do note that the error-reporting here isn't great. We don't mention the specific bad mode, but just that the tree has one or more broken modes. But when you go to look at it with "git ls-tree", we'll report the canonicalized mode! This isn't ideal, but given that this should come up rarely, and that any number of other tree corruptions might force you into looking at the binary bytes via "cat-file", it's not the end of the world. And it's something we can improve on top later if we choose. Reported-by: Xavier Morel <xavier.morel@masklinn.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-10tree-walk: add a mechanism for getting non-canonicalized modesJeff King1-2/+2
When using init_tree_desc() and tree_entry() to iterate over a tree, we always canonicalize the modes coming out of the tree. This is a good thing to prevent bugs or oddities in normal code paths, but it's counter-productive for tools like fsck that want to see the exact contents. We can address this by adding an option to avoid the extra canonicalization. A few notes on the implementation: - I've attached the new option to the tree_desc struct itself. The actual code change is in decode_tree_entry(), which is in turn called by the public update_tree_entry(), tree_entry(), and init_tree_desc() functions, plus their "gently" counterparts. By letting it ride along in the struct, we can avoid changing the signature of those functions, which are called many times. Plus it's conceptually simpler: you really want a particular iteration of a tree to be "raw" or not, rather than individual calls. - We still have to set the new option somewhere. The struct is initialized by init_tree_desc(). I added the new flags field only to the "gently" version. That avoids disturbing the much more numerous non-gentle callers, and it makes sense that anybody being careful about looking at raw modes would also be careful about bogus trees (i.e., the caller will be something like fsck in the first place). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-16dir API: add a generalized path_match_flags() functionÆvar Arnfjörð Bjarmason1-17/+6
Add a path_match_flags() function and have the two sets of starts_with_dot_{,dot_}slash() functions added in 63e95beb085 (submodule: port resolve_relative_url from shell to C, 2016-04-15) and a2b26ffb1a8 (fsck: convert gitmodules url to URL passed to curl, 2020-04-18) be thin wrappers for it. As the latter of those notes the fsck version was copied from the initial builtin/submodule--helper.c version. Since the code added in a2b26ffb1a8 was doing really doing the same as win32_is_dir_sep() added in 1cadad6f658 (git clone <url> C:\cygwin\home\USER\repo' is working (again), 2018-12-15) let's move the latter to git-compat-util.h is a is_xplatform_dir_sep(). We can then call either it or the platform-specific is_dir_sep() from this new function. Let's likewise change code in various other places that was hardcoding checks for "'/' || '\\'" with the new is_xplatform_dir_sep(). As can be seen in those callers some of them still concern themselves with ':' (Mac OS classic?), but let's leave the question of whether that should be consolidated for some other time. As we expect to make wider use of the "native" case in the future, define and use two starts_with_dot_{,dot_}slash_native() convenience wrappers. This makes the diff in builtin/submodule--helper.c much smaller. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-04fsck: warn about symlinked dotfiles we'll open with O_NOFOLLOWJeff King1-0/+18
In the commits merged in via 204333b015 (Merge branch 'jk/open-dotgitx-with-nofollow', 2021-03-22), we stopped following symbolic links for .gitattributes, .gitignore, and .mailmap files. Let's teach fsck to warn that these symlinks are not going to do anything. Note that this is just a warning, and won't block the objects via transfer.fsckObjects, since there are reported to be cases of this in the wild (and even once fixed, they will continue to exist in the commit history of those projects, but are not particularly dangerous). Note that we won't add these to the existing gitmodules block in the fsck code. The logic for gitmodules is a bit more complicated, as we also check the content of non-symlink instances we find. But for these new files, there is no content check; we're just looking at the name and mode of the tree entry (and we can avoid even the complicated name checks in the common case that the mode doesn't indicate a symlink). We can reuse the test helper function we defined for .gitmodules, though (it needs some slight adjustments for the fsck error code, and because we don't block these symlinks via verify_path()). Note that I didn't explicitly test the transfer.fsckObjects case here (nor does the existing .gitmodules test that it blocks a push). The translation of fsck severities to outcomes is covered in general in t5504. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-03fsck_tree(): wrap some long linesJeff King1-12/+36
Many calls to report() in fsck_tree() are kept on a single line and are quite long. Most were pretty big to begin with, but have gotten even longer over the years as we've added more parameters. Let's accept the churn of wrapping them in order to conform to our usual line limits. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-03fsck_tree(): fix shadowed variableJeff King1-20/+22
Commit b2f2039c2b (fsck: accept an oid instead of a "struct tree" for fsck_tree(), 2019-10-18) introduced a new "oid" parameter to fsck_tree(), and we pass it to the report() function when we find problems. However, that is shadowed within the tree-walking loop by the existing "oid" variable which we use to store the oid of each tree entry. As a result, we may report the wrong oid for some problems we detect within the loop (the entry oid, instead of the tree oid). Our tests didn't catch this because they checked only that we found the expected fsck problem, not that it was attached to the correct object. Let's rename both variables in the function to avoid confusion. This makes the diff a little noisy (e.g., all of the report() calls outside the loop were already correct but need to be touched), but makes sure we catch all cases and will avoid similar confusion in the future. And we can update the test to be a bit more specific and catch this problem. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fetch-pack: use new fsck API to printing dangling submodulesÆvar Arnfjörð Bjarmason1-5/+18
Refactor the check added in 5476e1efde (fetch-pack: print and use dangling .gitmodules, 2021-02-22) to make use of us now passing the "msg_id" to the user defined "error_func". We can now compare against the FSCK_MSG_GITMODULES_MISSING instead of parsing the generated message. Let's also replace register_found_gitmodules() with directly manipulating the "gitmodules_found" member. A recent commit moved it into "fsck_options" so we could do this here. I'm sticking this callback in fsck.c. Perhaps in the future we'd like to accumulate such callbacks into another file (maybe fsck-cb.c, similar to parse-options-cb.c?), but while we've got just the one let's just put it into fsck.c. A better alternative in this case would be some library some more obvious library shared by fetch-pack.c ad builtin/index-pack.c, but there isn't such a thing. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: move gitmodules_{found,done} into fsck_optionsÆvar Arnfjörð Bjarmason1-13/+10
Move the gitmodules_{found,done} static variables added in 159e7b080bf (fsck: detect gitmodules files, 2018-05-02) into the fsck_options struct. It makes sense to keep all the context in the same place. This requires changing the recently added register_found_gitmodules() function added in 5476e1efde (fetch-pack: print and use dangling .gitmodules, 2021-02-22) to take fsck_options. That function will be removed in a subsequent commit, but as it'll require the new gitmodules_found attribute of "fsck_options" we need this intermediate step first. An earlier version of this patch removed the small amount of duplication we now have between FSCK_OPTIONS_{DEFAULT,STRICT} with a FSCK_OPTIONS_COMMON macro. I don't think such de-duplication is worth it for this amount of copy/pasting. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: add an fsck_set_msg_type() API that takes enumsÆvar Arnfjörð Bjarmason1-10/+17
Change code I added in acf9de4c94e (mktag: use fsck instead of custom verify_tag(), 2021-01-05) to make use of a new API function that takes the fsck_msg_{id,type} types, instead of arbitrary strings that we'll (hopefully) parse into those types. At the time that the fsck_set_msg_type() API was introduced in 0282f4dced0 (fsck: offer a function to demote fsck errors to warnings, 2015-06-22) it was only intended to be used to parse user-supplied data. For things that are purely internal to the C code it makes sense to have the compiler check these arguments, and to skip the sanity checking of the data in fsck_set_msg_type() which is redundant to checks we get from the compiler. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: pass along the fsck_msg_id in the fsck_error callbackÆvar Arnfjörð Bjarmason1-2/+4
Change the fsck_error callback to also pass along the fsck_msg_id. Before this change the only way to get the message id was to parse it back out of the "message". Let's pass it down explicitly for the benefit of callers that might want to use it, as discussed in [1]. Passing the msg_type is now redundant, as you can always get it back from the msg_id, but I'm not changing that convention. It's really common to need the msg_type, and the report() function itself (which calls "fsck_error") needs to call fsck_msg_type() to discover it. Let's not needlessly re-do that work in the user callback. 1. https://lore.kernel.org/git/87blcja2ha.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.[ch]: move FOREACH_FSCK_MSG_ID & fsck_msg_id from *.c to *.hÆvar Arnfjörð Bjarmason1-66/+0
Move the FOREACH_FSCK_MSG_ID macro and the fsck_msg_id enum it helps define from fsck.c to fsck.h. This is in preparation for having non-static functions take the fsck_msg_id as an argument. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: give "FOREACH_MSG_ID" a more specific nameÆvar Arnfjörð Bjarmason1-3/+3
Rename the FOREACH_MSG_ID macro to FOREACH_FSCK_MSG_ID in preparation for moving it over to fsck.h. It's good convention to name macros in *.h files in such a way as to clearly not clash with any other names in other files. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: undefine temporary STR macro after useÆvar Arnfjörð Bjarmason1-0/+1
In f417eed8cde (fsck: provide a function to parse fsck message IDs, 2015-06-22) the "STR" macro was introduced, but that short macro name was not undefined after use as was done earlier in the same series for the MSG_ID macro in c99ba492f1c (fsck: introduce identifiers for fsck messages, 2015-06-22). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: call parse_msg_type() early in fsck_set_msg_type()Ævar Arnfjörð Bjarmason1-2/+1
There's no reason to defer the calling of parse_msg_type() until after we've checked if the "id < 0". This is not a hot codepath, and parse_msg_type() itself may die on invalid input. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.h: move FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} into an enumÆvar Arnfjörð Bjarmason1-11/+10
Move the FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} defines into a new fsck_msg_type enum. These defines were originally introduced in: - ba002f3b28a (builtin-fsck: move common object checking code to fsck.c, 2008-02-25) - f50c4407305 (fsck: disallow demoting grave fsck errors to warnings, 2015-06-22) - efaba7cc77f (fsck: optionally ignore specific fsck issues completely, 2015-06-22) - f27d05b1704 (fsck: allow upgrading fsck warnings to errors, 2015-06-22) The reason these were defined in two different places is because we use FSCK_{IGNORE,INFO,FATAL} only in fsck.c, but FSCK_{ERROR,WARN} are used by external callbacks. Untangling that would take some more work, since we expose the new "enum fsck_msg_type" to both. Similar to "enum object_type" it's not worth structuring the API in such a way that only those who need FSCK_{ERROR,WARN} pass around a different type. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: refactor fsck_msg_type() to limit scope of "int msg_type"Ævar Arnfjörð Bjarmason1-7/+5
Refactor "if options->msg_type" and other code added in 0282f4dced0 (fsck: offer a function to demote fsck errors to warnings, 2015-06-22) to reduce the scope of the "int msg_type" variable. This is in preparation for changing its type in a subsequent commit, only using it in the "!options->msg_type" scope makes that change This also brings the code in line with the fsck_set_msg_type() function (also added in 0282f4dced0), which does a similar check for "!options->msg_type". Another minor benefit is getting rid of the style violation of not having braces for the body of the "if". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: rename remaining fsck_msg_id "id" to "msg_id"Ævar Arnfjörð Bjarmason1-3/+3
Rename the remaining variables of type fsck_msg_id from "id" to "msg_id". This change is relatively small, and is worth the churn for a later change where we have different id's in the "report" function. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: remove (mostly) redundant append_msg_id() functionÆvar Arnfjörð Bjarmason1-19/+2
Remove the append_msg_id() function in favor of calling prepare_msg_ids(). We already have code to compute the camel-cased msg_id strings in msg_id_info, let's use it. When the append_msg_id() function was added in 71ab8fa840f (fsck: report the ID of the error/warning, 2015-06-22) the prepare_msg_ids() function didn't exist. When prepare_msg_ids() was added in a46baac61eb (fsck: factor out msg_id_info[] lazy initialization code, 2018-05-26) this code wasn't moved over to lazy initialization. This changes the behavior of the code to initialize all the messages instead of just camel-casing the one we need on the fly. Since the common case is that we're printing just one message this is mostly redundant work. But that's OK in this case, reporting this fsck issue to the user isn't performance-sensitive. If we were somehow doing so in a tight loop (in a hopelessly broken repository?) this would help, since we'd save ourselves from re-doing this work for identical messages, we could just grab the prepared string from msg_id_info after the first invocation. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: rename variables in fsck_set_msg_type() for less confusionÆvar Arnfjörð Bjarmason1-12/+12
Rename variables in a function added in 0282f4dced0 (fsck: offer a function to demote fsck errors to warnings, 2015-06-22). It was needlessly confusing that it took a "msg_type" argument, but then later declared another "msg_type" of a different type. Let's rename that to "severity", and rename "id" to "msg_id" and "msg_id" to "msg_id_str" etc. This will make a follow-up change smaller. While I'm at it properly indent the fsck_set_msg_type() argument list. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-17fsck.c: refactor and rename common config callbackÆvar Arnfjörð Bjarmason1-2/+2
Refactor code I recently changed in 1f3299fda9 (fsck: make fsck_config() re-usable, 2021-01-05) so that I could use fsck's config callback in mktag in 1f3299fda9 (fsck: make fsck_config() re-usable, 2021-01-05). I don't know what I was thinking in structuring the code this way, but it clearly makes no sense to have an fsck_config_internal() at all just so it can get a fsck_options when git_config() already supports passing along some void* data. Let's just make use of that instead, which gets us rid of the two wrapper functions, and brings fsck's common config callback in line with other such reusable config callbacks. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-01Merge branch 'jt/transfer-fsck-across-packs'Junio C Hamano1-0/+5
The approach to "fsck" the incoming objects in "index-pack" is attractive for performance reasons (we have them already in core, inflated and ready to be inspected), but fundamentally cannot be applied fully when we receive more than one pack stream, as a tree object in one pack may refer to a blob object in another pack as ".gitmodules", when we want to inspect blobs that are used as ".gitmodules" file, for example. Teach "index-pack" to emit objects that must be inspected later and check them in the calling "fetch-pack" process. * jt/transfer-fsck-across-packs: fetch-pack: print and use dangling .gitmodules fetch-pack: with packfile URIs, use index-pack arg http-fetch: allow custom index-pack args http: allow custom index-pack args
2021-02-22fetch-pack: print and use dangling .gitmodulesJonathan Tan1-0/+5
Teach index-pack to print dangling .gitmodules links after its "keep" or "pack" line instead of declaring an error, and teach fetch-pack to check such lines printed. This allows the tree side of the .gitmodules link to be in one packfile and the blob side to be in another without failing the fsck check, because it is now fetch-pack which checks such objects after all packfiles have been downloaded and indexed (and not index-pack on an individual packfile, as it is before this commit). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-17Merge branch 'js/fsck-name-objects-fix'Junio C Hamano1-0/+5
Fix "git fsck --name-objects" which apparently has not been used by anybody who is motivated enough to report breakage. * js/fsck-name-objects-fix: fsck --name-objects: be more careful parsing generation numbers t1450: robustify `remove_object()`
2021-02-10fsck --name-objects: be more careful parsing generation numbersJohannes Schindelin1-0/+5
In 7b35efd734e (fsck_walk(): optionally name objects on the go, 2016-07-17), the `fsck` machinery learned to optionally name the objects, so that it is easier to see what part of the repository is in a bad shape, say, when objects are missing. To save on complexity, this machinery uses a parser to determine the name of a parent given a commit's name: any `~<n>` suffix is parsed and the parent's name is formed from the prefix together with `~<n+1>`. However, this parser has a bug: if it finds a suffix `<n>` that is _not_ `~<n>`, it will mistake the empty string for the prefix and `<n>` for the generation number. In other words, it will generate a name of the form `~<bogus-number>`. Let's fix this. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-05Merge branch 'jk/forbid-lf-in-git-url' into maintJunio C Hamano1-1/+1
Newline characters in the host and path part of git:// URL are now forbidden. * jk/forbid-lf-in-git-url: fsck: reject .gitmodules git:// urls with newlines git_connect_git(): forbid newlines in host and path
2021-01-25Merge branch 'jk/forbid-lf-in-git-url'Junio C Hamano1-1/+1
Newline characters in the host and path part of git:// URL are now forbidden. * jk/forbid-lf-in-git-url: fsck: reject .gitmodules git:// urls with newlines git_connect_git(): forbid newlines in host and path
2021-01-07fsck: reject .gitmodules git:// urls with newlinesJeff King1-1/+1
The previous commit taught the clone/fetch client side to reject a git:// URL with a newline in it. Let's also catch these when fscking a .gitmodules file, which will give an earlier warning. Note that it would be simpler to just complain about newline in _any_ URL, but an earlier tightening for http/ftp made sure we kept allowing newlines for unknown protocols (and this is covered in the tests). So we'll stick to that precedent. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-05mktag: allow omitting the header/body \n separatorÆvar Arnfjörð Bjarmason1-0/+2
Change mktag's acceptance rules to accept an empty body without an empty line after the header again. This fixes an ancient unintended dregression in "mktag". When "mktag" was introduced in ec4465adb3 (Add "tag" objects that can be used to sign other objects., 2005-04-25) the input checks were much looser. When it was documented it 6cfec03680 (mktag: minimally update the description., 2007-06-10) it was clearly intended for this \n to be optional: The message, when [it] exists, is separated by a blank line from the header. But then in e0aaf781f6 (mktag.c: improve verification of tagger field and tests, 2008-03-27) this was made an error, seemingly by accident. It was just a result of the general header checks, and all the tests after that patch have a trailing empty line (but did not before). Let's allow this again, and tweak the test semantics changed in e0aaf781f6 to remove the redundant empty line. New tests added in previous commits of mine already added an explicit test for allowing the empty line between header and body. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-05fsck: make fsck_config() re-usableÆvar Arnfjörð Bjarmason1-0/+24
Move the fsck_config() function from builtin/fsck.c to fsck.[ch]. This allows for re-using it in other tools that expose fsck logic and want to support its configuration variables. A logical continuation of this change would be to use a common function for all of {fetch,receive}.fsck.* and fsck.*. See 5d477a334a6 (fsck (receive-pack): allow demoting errors to warnings, 2015-06-22) and my own 1362df0d413 (fetch: implement fetch.fsck.*, 2018-07-27) for the relevant code. However, those routines want to not parse the fsck.skipList into OIDs, but rather pass them along with the --strict option to another process. It would be possible to refactor that whole thing so we support e.g. a "fetch." prefix, then just keep track of the skiplist as a filename instead of parsing it, and learn to spew that all out from our internal structures into something we can append to the --strict option. But instead I'm planning to re-use this in "mktag", which'll just re-use these "fsck.*" variables as-is. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-05mktag: use fsck instead of custom verify_tag()Ævar Arnfjörð Bjarmason1-3/+29
Change the validation logic in "mktag" to use fsck's fsck_tag() instead of its own custom parser. Curiously the logic for both dates back to the same commit[1]. Let's unify them so we're not maintaining two sets functions to verify that a tag is OK. The behavior of fsck_tag() and the old "mktag" code being removed here is different in few aspects. I think it makes sense to remove some of those checks, namely: A. fsck only cares that the timezone matches [-+][0-9]{4}. The mktag code disallowed values larger than 1400. Yes there's currently no timezone with a greater offset[2], but since we allow any number of non-offical timezones (e.g. +1234) passing this through seems fine. Git also won't break in the future if e.g. French Polynesia decides it needs to outdo the Line Islands when it comes to timezone extravagance. B. fsck allows missing author names such as "tagger <email>", mktag wouldn't, but would allow e.g. "tagger [2 spaces] <email>" (but not "tagger [1 space] <email>"). Now we allow all of these. C. Like B, but "mktag" disallowed spaces in the <email> part, fsck allows it. In some ways fsck_tag() is stricter than "mktag" was, namely: D. fsck disallows zero-padded dates, but mktag didn't care. So e.g. the timestamp "0000000000 +0000" produces an error now. A test in "t1006-cat-file.sh" relied on this, it's been changed to use "hash-object" (without fsck) instead. There was one check I deemed worth keeping by porting it over to fsck_tag(): E. "mktag" did not allow any custom headers, and by extension (as an empty commit is allowed) also forbade an extra stray trailing newline after the headers it knew about. Add a new check in the "ignore" category to fsck and use it. This somewhat abuses the facility added in efaba7cc77f (fsck: optionally ignore specific fsck issues completely, 2015-06-22). This is somewhat of hack, but probably the least invasive change we can make here. The fsck command will shuffle these categories around, e.g. under --strict the "info" becomes a "warn" and "warn" becomes "error". Existing users of fsck's (and others, e.g. index-pack) --strict option rely on this. So we need to put something into a category that'll be ignored by all existing users of the API. Pretending that fsck.extraHeaderEntry=error ("ignore" by default) was set serves to do this for us. 1. ec4465adb38 (Add "tag" objects that can be used to sign other objects., 2005-04-25) 2. https://en.wikipedia.org/wiki/List_of_UTC_time_offsets Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08Merge branch 'rs/fsck-duplicate-names-in-trees'Junio C Hamano1-2/+2
The check in "git fsck" to ensure that the tree objects are sorted still had corner cases it missed unsorted entries. * rs/fsck-duplicate-names-in-trees: fsck: detect more in-tree d/f conflicts t1450: demonstrate undetected in-tree d/f conflict t1450: increase test coverage of in-tree d/f detection fsck: fix a typo in a comment
2020-05-21fsck: detect more in-tree d/f conflictsRené Scharfe1-1/+1
If the conflict candidate file name from the top of the stack is not a prefix of the current candiate directory then we can discard it as no matching directory can come up later. But we are not done checking the candidate directory -- the stack might still hold a matching file name, so stay in the loop and check the next candidate file name. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-21fsck: fix a typo in a commentRené Scharfe1-1/+1
Reported-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-14Merge branch 'rs/fsck-duplicate-names-in-trees'Junio C Hamano1-2/+70
"git fsck" ensures that the paths recorded in tree objects are sorted and without duplicates, but it failed to notice a case where a blob is followed by entries that sort before a tree with the same name. This has been corrected. * rs/fsck-duplicate-names-in-trees: fsck: report non-consecutive duplicate names in trees
2020-05-11fsck: report non-consecutive duplicate names in treesRené Scharfe1-2/+70
Tree entries are sorted in path order, meaning that directory names get a slash ('/') appended implicitly. Git fsck checks if trees contains consecutive duplicates, but due to that ordering there can be non-consecutive duplicates as well if one of them is a directory and the other one isn't. Such a tree cannot be fully checked out. Find these duplicates by recording candidate file names on a stack and check candidate directory names against that stack to find matches. Suggested-by: Brandon Williams <bwilliamseng@gmail.com> Original-test-by: Brandon Williams <bwilliamseng@gmail.com> Signed-off-by: René Scharfe <l.s.r@web.de> Reviewed-by: Luke Diamand <luke@diamand.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-28Merge branch 'jk/config-use-size-t'Junio C Hamano1-1/+1
The config API made mixed uses of int and size_t types to represent length of various pieces of text it parsed, which has been updated to use the correct type (i.e. size_t) throughout. * jk/config-use-size-t: config: reject parsing of files over INT_MAX config: use size_t to store parsed variable baselen git_config_parse_key(): return baselen as size_t config: drop useless length variable in write_pair() parse_config_key(): return subsection len as size_t remote: drop auto-strlen behavior of make_branch() and make_rewrite()
2020-04-19Git 2.25.4v2.25.4Jonathan Nieder1-5/+136
This merges up the security fix from v2.17.5. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
2020-04-19Git 2.23.3v2.23.3Jonathan Nieder1-5/+136
This merges up the security fix from v2.17.5. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
2020-04-19Git 2.22.4v2.22.4Jonathan Nieder1-5/+136
This merges up the security fix from v2.17.5. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
2020-04-19Git 2.21.3v2.21.3Jonathan Nieder1-5/+136
This merges up the security fix from v2.17.5. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
2020-04-19Git 2.20.4v2.20.4Jonathan Nieder1-5/+136
This merges up the security fix from v2.17.5. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
2020-04-19Git 2.19.5v2.19.5Jonathan Nieder1-5/+136
This merges up the security fix from v2.17.5. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
2020-04-19Git 2.18.4v2.18.4Jonathan Nieder1-5/+136
This merges up the security fix from v2.17.5. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
2020-04-19fsck: reject URL with empty host in .gitmodulesJonathan Nieder1-3/+7
Git's URL parser interprets https:///example.com/repo.git to have no host and a path of "example.com/repo.git". Curl, on the other hand, internally redirects it to https://example.com/repo.git. As a result, until "credential: parse URL without host as empty host, not unset", tricking a user into fetching from such a URL would cause Git to send credentials for another host to example.com. Teach fsck to block and detect .gitmodules files using such a URL to prevent sharing them with Git versions that are not yet protected. A relative URL in a .gitmodules file could also be used to trigger this. The relative URL resolver used for .gitmodules does not normalize sequences of slashes and can follow ".." components out of the path part and to the host part of a URL, meaning that such a relative URL can be used to traverse from a https://foo.example.com/innocent superproject to a https:///attacker.example.com/exploit submodule. Fortunately, redundant extra slashes in .gitmodules are rare, so we can catch this by detecting one after a leading sequence of "./" and "../" components. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Reviewed-by: Jeff King <peff@peff.net>
2020-04-19credential: treat URL without scheme as invalidJonathan Nieder1-2/+45
libcurl permits making requests without a URL scheme specified. In this case, it guesses the URL from the hostname, so I can run git ls-remote http::ftp.example.com/path/to/repo and it would make an FTP request. Any user intentionally using such a URL is likely to have made a typo. Unfortunately, credential_from_url is not able to determine the host and protocol in order to determine appropriate credentials to send, and until "credential: refuse to operate when missing host or protocol", this resulted in another host's credentials being leaked to the named host. Teach credential_from_url_gently to consider such a URL to be invalid so that fsck can detect and block gitmodules files with such URLs, allowing server operators to avoid serving them to downstream users running older versions of Git. This also means that when such URLs are passed on the command line, Git will print a clearer error so affected users can switch to the simpler URL that explicitly specifies the host and protocol they intend. One subtlety: .gitmodules files can contain relative URLs, representing a URL relative to the URL they were cloned from. The relative URL resolver used for .gitmodules can follow ".." components out of the path part and past the host part of a URL, meaning that such a relative URL can be used to traverse from a https://foo.example.com/innocent superproject to a https::attacker.example.com/exploit submodule. Fortunately a leading ':' in the first path component after a series of leading './' and '../' components is unlikely to show up in other contexts, so we can catch this by detecting that pattern. Reported-by: Jeff King <peff@peff.net> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Reviewed-by: Jeff King <peff@peff.net>
2020-04-19fsck: convert gitmodules url to URL passed to curlJonathan Nieder1-5/+89
In 07259e74ec1 (fsck: detect gitmodules URLs with embedded newlines, 2020-03-11), git fsck learned to check whether URLs in .gitmodules could be understood by the credential machinery when they are handled by git-remote-curl. However, the check is overbroad: it checks all URLs instead of only URLs that would be passed to git-remote-curl. In principle a git:// or file:/// URL does not need to follow the same conventions as an http:// URL; in particular, git:// and file:// protocols are not succeptible to issues in the credential API because they do not support attaching credentials. In the HTTP case, the URL in .gitmodules does not always match the URL that would be passed to git-remote-curl and the credential machinery: Git's URL syntax allows specifying a remote helper followed by a "::" delimiter and a URL to be passed to it, so that git ls-remote http::https://example.com/repo.git invokes git-remote-http with https://example.com/repo.git as its URL argument. With today's checks, that distinction does not make a difference, but for a check we are about to introduce (for empty URL schemes) it will matter. .gitmodules files also support relative URLs. To ensure coverage for the https based embedded-newline attack, urldecode and check them directly for embedded newlines. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Reviewed-by: Jeff King <peff@peff.net>
2020-04-10parse_config_key(): return subsection len as size_tJeff King1-1/+1
We return the length to a subset of a string using an "int *" out-parameter. This is fine most of the time, as we'd expect config keys to be relatively short, but it could behave oddly if we had a gigantic config key. A more appropriate type is size_t. Let's switch over, which lets our callers use size_t as appropriate (they are bound by our type because they must pass the out-parameter as a pointer). This is mostly just a cleanup to make it clear this code handles long strings correctly. In practice, our config parser already chokes on long key names (because of a similar int/size_t mixup!). When doing an int/size_t conversion, we have to be careful that nobody was trying to assign a negative value to the variable. I manually confirmed that for each case here. They tend to just feed the result to xmemdupz() or similar; in a few cases I adjusted the parameter types for helper functions to make sure the size_t is preserved. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-17Git 2.25.3v2.25.3Junio C Hamano1-1/+15
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-17Git 2.23.2v2.23.2Junio C Hamano1-1/+15
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-17Git 2.22.3v2.22.3Junio C Hamano1-1/+15
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-17Git 2.21.2v2.21.2Junio C Hamano1-1/+15
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-17Git 2.20.3v2.20.3Junio C Hamano1-1/+15
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-17Git 2.19.4v2.19.4Junio C Hamano1-1/+15
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-17Git 2.18.3v2.18.3Junio C Hamano1-1/+15
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-12fsck: detect gitmodules URLs with embedded newlinesJeff King1-1/+15
The credential protocol can't handle values with newlines. We already detect and block any such URLs from being used with credential helpers, but let's also add an fsck check to detect and block gitmodules files with such URLs. That will let us notice the problem earlier when transfer.fsckObjects is turned on. And in particular it will prevent bad objects from spreading, which may protect downstream users running older versions of Git. We'll file this under the existing gitmodulesUrl flag, which covers URLs with option injection. There's really no need to distinguish the exact flaw in the URL in this context. Likewise, I've expanded the description of t7416 to cover all types of bogus URLs.
2019-12-09Sync with Git 2.24.1Junio C Hamano1-1/+24
2019-12-06Sync with 2.22.2Johannes Schindelin1-1/+24
* maint-2.22: (43 commits) Git 2.22.2 Git 2.21.1 mingw: sh arguments need quoting in more circumstances mingw: fix quoting of empty arguments for `sh` mingw: use MSYS2 quoting even when spawning shell scripts mingw: detect when MSYS2's sh is to be spawned more robustly t7415: drop v2.20.x-specific work-around Git 2.20.2 t7415: adjust test for dubiously-nested submodule gitdirs for v2.20.x Git 2.19.3 Git 2.18.2 Git 2.17.3 Git 2.16.6 test-drop-caches: use `has_dos_drive_prefix()` Git 2.15.4 Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors ...
2019-12-06Sync with 2.21.1Johannes Schindelin1-1/+24
* maint-2.21: (42 commits) Git 2.21.1 mingw: sh arguments need quoting in more circumstances mingw: fix quoting of empty arguments for `sh` mingw: use MSYS2 quoting even when spawning shell scripts mingw: detect when MSYS2's sh is to be spawned more robustly t7415: drop v2.20.x-specific work-around Git 2.20.2 t7415: adjust test for dubiously-nested submodule gitdirs for v2.20.x Git 2.19.3 Git 2.18.2 Git 2.17.3 Git 2.16.6 test-drop-caches: use `has_dos_drive_prefix()` Git 2.15.4 Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors quote-stress-test: offer to test quoting arguments for MSYS2 sh ...
2019-12-06Sync with 2.20.2Johannes Schindelin1-1/+24
* maint-2.20: (36 commits) Git 2.20.2 t7415: adjust test for dubiously-nested submodule gitdirs for v2.20.x Git 2.19.3 Git 2.18.2 Git 2.17.3 Git 2.16.6 test-drop-caches: use `has_dos_drive_prefix()` Git 2.15.4 Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors quote-stress-test: offer to test quoting arguments for MSYS2 sh t6130/t9350: prepare for stringent Win32 path validation quote-stress-test: allow skipping some trials quote-stress-test: accept arguments to test via the command-line tests: add a helper to stress test argument quoting mingw: fix quoting of arguments Disallow dubiously-nested submodule git directories ...
2019-12-06Sync with 2.19.3Johannes Schindelin1-1/+24
* maint-2.19: (34 commits) Git 2.19.3 Git 2.18.2 Git 2.17.3 Git 2.16.6 test-drop-caches: use `has_dos_drive_prefix()` Git 2.15.4 Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors quote-stress-test: offer to test quoting arguments for MSYS2 sh t6130/t9350: prepare for stringent Win32 path validation quote-stress-test: allow skipping some trials quote-stress-test: accept arguments to test via the command-line tests: add a helper to stress test argument quoting mingw: fix quoting of arguments Disallow dubiously-nested submodule git directories protect_ntfs: turn on NTFS protection by default path: also guard `.gitmodules` against NTFS Alternate Data Streams ...
2019-12-06Sync with 2.18.2Johannes Schindelin1-1/+24
* maint-2.18: (33 commits) Git 2.18.2 Git 2.17.3 Git 2.16.6 test-drop-caches: use `has_dos_drive_prefix()` Git 2.15.4 Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors quote-stress-test: offer to test quoting arguments for MSYS2 sh t6130/t9350: prepare for stringent Win32 path validation quote-stress-test: allow skipping some trials quote-stress-test: accept arguments to test via the command-line tests: add a helper to stress test argument quoting mingw: fix quoting of arguments Disallow dubiously-nested submodule git directories protect_ntfs: turn on NTFS protection by default path: also guard `.gitmodules` against NTFS Alternate Data Streams is_ntfs_dotgit(): speed it up ...
2019-12-06Sync with 2.17.3Johannes Schindelin1-1/+24
* maint-2.17: (32 commits) Git 2.17.3 Git 2.16.6 test-drop-caches: use `has_dos_drive_prefix()` Git 2.15.4 Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors quote-stress-test: offer to test quoting arguments for MSYS2 sh t6130/t9350: prepare for stringent Win32 path validation quote-stress-test: allow skipping some trials quote-stress-test: accept arguments to test via the command-line tests: add a helper to stress test argument quoting mingw: fix quoting of arguments Disallow dubiously-nested submodule git directories protect_ntfs: turn on NTFS protection by default path: also guard `.gitmodules` against NTFS Alternate Data Streams is_ntfs_dotgit(): speed it up mingw: disallow backslash characters in tree objects' file names ...
2019-12-06fsck: reject submodule.update = !command in .gitmodulesJonathan Nieder1-0/+7
This allows hosting providers to detect whether they are being used to attack users using malicious 'update = !command' settings in .gitmodules. Since ac1fbbda2013 (submodule: do not copy unknown update mode from .gitmodules, 2013-12-02), in normal cases such settings have been treated as 'update = none', so forbidding them should not produce any collateral damage to legitimate uses. A quick search does not reveal any repositories making use of this construct, either. Reported-by: Joern Schneeweisz <jschneeweisz@gitlab.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2019-12-06Sync with 2.16.6Johannes Schindelin1-1/+17
* maint-2.16: (31 commits) Git 2.16.6 test-drop-caches: use `has_dos_drive_prefix()` Git 2.15.4 Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors quote-stress-test: offer to test quoting arguments for MSYS2 sh t6130/t9350: prepare for stringent Win32 path validation quote-stress-test: allow skipping some trials quote-stress-test: accept arguments to test via the command-line tests: add a helper to stress test argument quoting mingw: fix quoting of arguments Disallow dubiously-nested submodule git directories protect_ntfs: turn on NTFS protection by default path: also guard `.gitmodules` against NTFS Alternate Data Streams is_ntfs_dotgit(): speed it up mingw: disallow backslash characters in tree objects' file names path: safeguard `.git` against NTFS Alternate Streams Accesses ...
2019-12-06Sync with 2.14.6Johannes Schindelin1-1/+10
* maint-2.14: (28 commits) Git 2.14.6 mingw: handle `subst`-ed "DOS drives" mingw: refuse to access paths with trailing spaces or periods mingw: refuse to access paths with illegal characters unpack-trees: let merged_entry() pass through do_add_entry()'s errors quote-stress-test: offer to test quoting arguments for MSYS2 sh t6130/t9350: prepare for stringent Win32 path validation quote-stress-test: allow skipping some trials quote-stress-test: accept arguments to test via the command-line tests: add a helper to stress test argument quoting mingw: fix quoting of arguments Disallow dubiously-nested submodule git directories protect_ntfs: turn on NTFS protection by default path: also guard `.gitmodules` against NTFS Alternate Data Streams is_ntfs_dotgit(): speed it up mingw: disallow backslash characters in tree objects' file names path: safeguard `.git` against NTFS Alternate Streams Accesses clone --recurse-submodules: prevent name squatting on Windows is_ntfs_dotgit(): only verify the leading segment test-path-utils: offer to run a protectNTFS/protectHFS benchmark ...
2019-12-05is_ntfs_dotgit(): only verify the leading segmentJohannes Schindelin1-1/+10
The config setting `core.protectNTFS` is specifically designed to work not only on Windows, but anywhere, to allow for repositories hosted on, say, Linux servers to be protected against NTFS-specific attack vectors. As a consequence, `is_ntfs_dotgit()` manually splits backslash-separated paths (but does not do the same for paths separated by forward slashes), under the assumption that the backslash might not be a valid directory separator on the _current_ Operating System. However, the two callers, `verify_path()` and `fsck_tree()`, are supposed to feed only individual path segments to the `is_ntfs_dotgit()` function. This causes a lot of duplicate scanning (and very inefficient scanning, too, as the inner loop of `is_ntfs_dotgit()` was optimized for readability rather than for speed. Let's simplify the design of `is_ntfs_dotgit()` by putting the burden of splitting the paths by backslashes as directory separators on the callers of said function. Consequently, the `verify_path()` function, which already splits the path by directory separators, now treats backslashes as directory separators _explicitly_ when `core.protectNTFS` is turned on, even on platforms where the backslash is _not_ a directory separator. Note that we have to repeat some code in `verify_path()`: if the backslash is not a directory separator on the current Operating System, we want to allow file names like `\`, but we _do_ want to disallow paths that are clearly intended to cause harm when the repository is cloned on Windows. The `fsck_tree()` function (the other caller of `is_ntfs_dotgit()`) now needs to look for backslashes in tree entries' names specifically when `core.protectNTFS` is turned on. While it would be tempting to completely disallow backslashes in that case (much like `fsck` reports names containing forward slashes as "full paths"), this would be overzealous: when `core.protectNTFS` is turned on in a non-Windows setup, backslashes are perfectly valid characters in file names while we _still_ want to disallow tree entries that are clearly designed to exploit NTFS-specific behavior. This simplification will make subsequent changes easier to implement, such as turning `core.protectNTFS` on by default (not only on Windows) or protecting against attack vectors involving NTFS Alternate Data Streams. Incidentally, this change allows for catching malicious repositories that contain tree entries of the form `dir\.gitmodules` already on the server side rather than only on the client side (and previously only on Windows): in contrast to `is_ntfs_dotgit()`, the `is_ntfs_dotgitmodules()` function already expects the caller to split the paths by directory separators. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2019-10-28fsck: accept an oid instead of a "struct tree" for fsck_tree()Jeff King1-15/+15
We don't actually look at the tree struct in fsck_tree() beyond its oid and type (which is obviously OBJ_TREE). Just taking an oid gives our callers more flexibility to avoid creating a struct, and makes it clear that we are fscking just what is in the buffer, not any pre-parsed bits from the struct. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: accept an oid instead of a "struct commit" for fsck_commit()Jeff King1-15/+14
We don't actually look at the commit struct in fsck_commit() beyond its oid and type (which is obviously OBJ_COMMIT). Just taking an oid gives our callers more flexibility to avoid creating or parsing a struct, and makes it clear that we are fscking just what is in the buffer, not any pre-parsed bits from the struct. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: accept an oid instead of a "struct tag" for fsck_tag()Jeff King1-14/+13
We don't actually look at the tag struct in fsck_tag() beyond its oid and type (which is obviously OBJ_TAG). Just taking an oid gives our callers more flexibility to avoid creating or parsing a struct, and makes it clear that we are fscking just what is in the buffer, not any pre-parsed bits from the struct. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: rename vague "oid" local variablesJeff King1-4/+4
In fsck_commit() and fsck_tag(), we have local "oid" variables used for parsing parent and tagged-object oids. Let's give these more specific names in preparation for the functions taking an "oid" parameter for the object we're checking. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: don't require an object struct in verify_headers()Jeff King1-5/+7
We only need the oid and type to pass on to report(). Let's accept the broken-out parameters to give our callers more flexibility. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: don't require an object struct for fsck_ident()Jeff King1-14/+16
The only thing we do with the struct is pass its oid and type to report(). We can just take those explicitly, which gives our callers more flexibility. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: drop blob struct from fsck_finish()Jeff King1-14/+4
Since fsck_blob() no longer requires us to have a "struct blob", we don't need to create one. Which also means we don't need to worry about handling the case that lookup_blob() returns NULL (we'll still catch wrongly-identified blobs when we read the actual object contents and type from disk). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: accept an oid instead of a "struct blob" for fsck_blob()Jeff King1-13/+13
We don't actually need any information from the object struct except its oid (and the type, of course, but that's implicitly OBJ_BLOB). This gives our callers more flexibility to drop the object structs, too. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: don't require an object struct for report()Jeff King1-59/+69
The report() function really only cares about the oid and type of the object, not the full object struct. Let's convert it to take those two items separately, which gives our callers more flexibility. This makes some already-long lines even longer. I've mostly left them, as our eventual goal is to shrink these down as we continue refactoring (e.g., "&item->object" becomes "&item->object.oid, item->object.type", but will eventually shrink down to "oid, type"). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: only require an oid for skiplist functionsJeff King1-4/+5
The skiplist is inherently an oidset, so we don't need a full object struct. Let's take just the oid to give our callers more flexibility. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: only provide oid/type in fsck_error callbackJeff King1-4/+7
None of the callbacks actually care about having a "struct object"; they're happy with just the oid and type information. So let's give ourselves more flexibility to avoid having a "struct object" by just passing the broken-down fields. Note that the callback already takes a "type" field for the fsck message type. We'll rename that to "msg_type" (and use "object_type" for the object type) to make the distinction explicit. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: use oids rather than objects for object_name APIJeff King1-27/+34
We don't actually care about having object structs; we only need to look up decorations by oid. Let's accept this more limited form, which will give our callers more flexibility. Note that the decoration API we rely on uses object structs itself (even though it only looks at their oids). We can solve this by switching to a kh_oid_map (we could also use the hashmap oidmap, but it's more awkward for the simple case of just storing a void pointer). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck_describe_object(): build on our get_object_name() primitiveJeff King1-4/+1
This isolates the implementation detail of using the decoration code to our put/get functions. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: unify object-name codeJeff King1-31/+51
Commit 90cf590f53 (fsck: optionally show more helpful info for broken links, 2016-07-17) added a system for decorating objects with names. The code is split across builtin/fsck.c (which gives the initial names) and fsck.c (which adds to the names as it traverses the object graph). This leads to some duplication, where both sites have near-identical describe_object() functions (the difference being that the one in builtin/fsck.c uses a circular array of buffers to allow multiple calls in a single printf). Let's provide a unified object_name API for fsck. That lets us drop the duplication, as well as making the interface boundaries more clear (which will let us refactor the implementation more in a future patch). We'll leave describe_object() in builtin/fsck.c as a thin wrapper around the new API, as it relies on a static global to make its many callers a bit shorter. We'll also convert the bare add_decoration() calls in builtin/fsck.c to put_object_name(). This fixes two minor bugs: 1. We leak many small strings. add_decoration() has a last-one-wins approach: it updates the decoration to the new string and returns the old one. But we ignore the return value, leaking the old string. This is quite common to trigger, since we look at reflogs: the tip of any ref will be described both by looking at the actual ref, as well as the latest reflog entry. So we'd always end up leaking one of those strings. 2. The last-one-wins approach gives us lousy names. For instance, we first look at all of the refs, and then all of the reflogs. So rather than seeing "refs/heads/master", we're likely to overwrite it with "HEAD@{12345678}". We're generally better off using the first name we find. And indeed, the test in t1450 expects this ugly HEAD@{} name. After this patch, we've switched to using fsck_put_object_name()'s first-one-wins semantics, and we output the more human-friendly "refs/tags/julius" (and the test is updated accordingly). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: require an actual buffer for non-blobsJeff King1-42/+9
The fsck_object() function takes in a buffer, but also a "struct object". The rules for using these vary between types: - for a commit, we'll use the provided buffer; if it's NULL, we'll fall back to get_commit_buffer(), which loads from either an in-memory cache or from disk. If the latter fails, we'd die(), which is non-ideal for fsck. - for a tag, a NULL buffer will fall back to loading the object from disk (and failure would lead to an fsck error) - for a tree, we _never_ look at the provided buffer, and always use tree->buffer - for a blob, we usually don't look at the buffer at all, unless it has been marked as a .gitmodule file. In that case we check the buffer given to us, or assume a NULL buffer is a very large blob (and complain about it) This is much more complex than it needs to be. It turns out that nobody ever feeds a NULL buffer that isn't a blob: - git-fsck calls fsck_object() only from fsck_obj(). That in turn is called by one of: - fsck_obj_buffer(), which is a callback to verify_pack(), which unpacks everything except large blobs into a buffer (see pack-check.c, lines 131-141). - fsck_loose(), which hits a BUG() on non-blobs with a NULL buffer (builtin/fsck.c, lines 639-640) And in either case, we'll have just called parse_object_buffer() anyway, which would segfault on a NULL buffer for commits or tags (not for trees, but it would install a NULL tree->buffer which would later cause a segfault) - git-index-pack asserts that the buffer is non-NULL unless the object is a blob (see builtin/index-pack.c, line 832) - git-unpack-objects always writes a non-NULL buffer into its obj_buffer hash, which is then fed to fsck_object(). (There is actually a funny thing here where it does not store blob buffers at all, nor does it call fsck on them; it does check any needed blobs via fsck_finish() though). Let's make the rules simpler, which reduces the amount of code and gives us more flexibility in refactoring the fsck code. The new rules are: - only blobs are allowed to pass a NULL buffer - we always use the provided buffer, never pulling information from the object struct We don't have to adjust any callers, because they were already adhering to these. Note that we do drop a few fsck identifiers for missing tags, but that was all dead code (because nobody passed a NULL tag buffer). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: stop checking tag->taggedJeff King1-13/+2
Way back in 92d4c85d24 (fsck-cache: fix SIGSEGV on bad tag object, 2005-05-03), we added an fsck check that the "tagged" field of a tag struct isn't NULL. But that was mainly protecting the printing code for "--tags", and that code wasn't moved along with the check as part of ba002f3b28 (builtin-fsck: move common object checking code to fsck.c, 2008-02-25). It could also serve to detect type mismatch problems (where a tag points to object X as a commit, but really X is a blob), but it couldn't do so reliably (we'd call lookup_commit(X), but it will only notice the problem if we happen to have previously called lookup_blob(X) in the same process). And as of a commit earlier in this series, we'd consider that a parse error and complain about the object even before getting to this point anyway. So let's drop this "tag->tagged" check. It's not helping anything, and getting rid of it makes the function conceptually cleaner, as it really is just checking the buffer we feed it. In fact, we can get rid of our one-line wrapper and just unify fsck_tag() and fsck_tag_buffer(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: stop checking commit->parent countsJeff King1-22/+1
In 4516338243 (builtin-fsck: reports missing parent commits, 2008-02-25), we added code to check that fsck found the same number of parents from parsing the commit itself as we see in the commit struct we got from parse_commit_buffer(). Back then the rationale was that the normal commit parser might skip some bad parents. But earlier in this series, we started treating that reliably as a parsing error, meaning that we'd complain about it before we even hit the code in fsck.c. Let's drop this code, which now makes fsck_commit_buffer() completely independent of any parsed values in the commit struct (that's conceptually cleaner, and also opens up more refactoring options). Note that we can also drop the MISSING_PARENT and MISSING_GRAFT fsck identifiers. This is no loss, as these would not trigger reliably anyway. We'd hit them only when lookup_commit() failed, which occurs only if we happen to have seen the object with another type already in the same process. In most cases, we'd actually run into the problem during the connectivity walk, not here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28fsck: stop checking commit->tree valueJeff King1-5/+0
We check in fsck_commit_buffer() that commit->tree isn't NULL, which in turn generally comes from a previous parse by parse_commit(). But this isn't really accomplishing anything. The two things we might care about are: - was there a syntactically valid "tree <oid>" line in the object? But we've just done our own parse in fsck_commit_buffer() to check this. - does it point to a valid tree object? But checking the "tree" pointer here doesn't actually accomplish that; it just shows that lookup_tree() didn't return NULL, which only means that we haven't yet seen that oid as a non-tree in this process. A real connectivity check would exhaustively walk all graph links, and we do that already in a separate function. So this code isn't helping anything. And it makes the fsck code slightly more confusing and rigid (e.g., it requires that any commit structs have already been parsed). Let's drop it. As a bit of history, the presence of this code looks like a leftover from early fsck code (which did rely on parse_commit() to do most of the parsing). The check comes from ff5ebe39b0 (Port fsck-cache to use parsing functions, 2005-04-18), but we later added an explicit walk in 355885d531 (add generic, type aware object chain walker, 2008-02-25). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-19Merge branch 'br/blame-ignore'Junio C Hamano1-36/+1
"git blame" learned to "ignore" commits in the history, whose effects (as well as their presence) get ignored. * br/blame-ignore: t8014: remove unnecessary braces blame: drop some unused function parameters blame: add a test to cover blame_coalesce() blame: use the fingerprint heuristic to match ignored lines blame: add a fingerprint heuristic to match ignored lines blame: optionally track line fingerprints during fill_blame_origin() blame: add config options for the output of ignored or unblamable lines blame: add the ability to ignore commits and their changes blame: use a helper function in blame_chunk() Move oidset_parse_file() to oidset.c fsck: rename and touch up init_skiplist()
2019-06-20object: convert lookup_unknown_object() to use object_idJeff King1-1/+1
There are no callers left of lookup_unknown_object() that aren't just passing us the "hash" member of a "struct object_id". Let's take the whole struct, which gets us closer to removing all raw sha1 variables. It also matches the existing conversions of lookup_blob(), etc. The conversions of callers were done by hand, but they're all mechanical one-liners. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-16Move oidset_parse_file() to oidset.cBarret Rhoden1-35/+0
Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-16fsck: rename and touch up init_skiplist()Barret Rhoden1-9/+9
init_skiplist() took a file consisting of SHA-1s and comments and added the objects to an oidset. This functionality is useful for other commands and will be moved to oidset.c in a future commit. In preparation for that move, this commit renames it to oidset_parse_file() to reflect its more generic usage and cleans up a few of the names. Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-08Use 'unsigned short' for mode, like diff_filespec doesElijah Newren1-1/+1
struct diff_filespec defines mode to be an 'unsigned short'. Several other places in the API which we'd like to interact with using a diff_filespec used a plain unsigned (or unsigned int). This caused problems when taking addresses, so switch to unsigned short. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-15tree-walk: store object_id in a separate memberbrian m. carlson1-2/+2
When parsing a tree, we read the object ID directly out of the tree buffer. This is normally fine, but such an object ID cannot be used with oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1, there may not be that many bytes to copy. Instead, store the object ID in a separate struct member. Since we can no longer efficiently compute the path length, store that information as well in struct name_entry. Ensure we only copy the object ID into the new buffer if the path length is nonzero, as some callers will pass us an empty path with no object ID following it, and we will not want to read past the end of the buffer. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-30Merge branch 'jc/cocci-preincr'Junio C Hamano1-1/+1
Code cleanup. * jc/cocci-preincr: fsck: s/++i > 1/i++/ cocci: simplify "if (++u > 1)" to "if (u++)"
2018-10-24fsck: s/++i > 1/i++/Junio C Hamano1-1/+1
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-10Merge branch 'ab/fsck-skiplist'Junio C Hamano1-33/+27
Update fsck.skipList implementation and documentation. * ab/fsck-skiplist: fsck: support comments & empty lines in skipList fsck: use oidset instead of oid_array for skipList fsck: use strbuf_getline() to read skiplist file fsck: add a performance test for skipList fsck: add a performance test fsck: document that skipList input must be unabbreviated fsck: document and test commented & empty line skipList input fsck: document and test sorted skipList input fsck tests: add a test for no skipList input fsck tests: setup of bogus commit object
2018-09-27Sync with 2.18.1Junio C Hamano1-0/+14
* maint-2.18: Git 2.18.1 Git 2.17.2 fsck: detect submodule paths starting with dash fsck: detect submodule urls starting with dash Git 2.16.5 Git 2.15.3 Git 2.14.5 submodule-config: ban submodule paths that start with a dash submodule-config: ban submodule urls that start with dash submodule--helper: use "--" to signal end of clone options
2018-09-27Sync with 2.17.2Junio C Hamano1-0/+14
* maint-2.17: Git 2.17.2 fsck: detect submodule paths starting with dash fsck: detect submodule urls starting with dash Git 2.16.5 Git 2.15.3 Git 2.14.5 submodule-config: ban submodule paths that start with a dash submodule-config: ban submodule urls that start with dash submodule--helper: use "--" to signal end of clone options
2018-09-27fsck: detect submodule paths starting with dashJeff King1-0/+7
As with urls, submodule paths with dashes are ignored by git, but may end up confusing older versions. Detecting them via fsck lets us prevent modern versions of git from being a vector to spread broken .gitmodules to older versions. Compared to blocking leading-dash urls, though, this detection may be less of a good idea: 1. While such paths provide confusing and broken results, they don't seem to actually work as option injections against anything except "cd". In particular, the submodule code seems to canonicalize to an absolute path before running "git clone" (so it passes /your/clone/-sub). 2. It's more likely that we may one day make such names actually work correctly. Even after we revert this fsck check, it will continue to be a hassle until hosting servers are all updated. On the other hand, it's not entirely clear that the behavior in older versions is safe. And if we do want to eventually allow this, we may end up doing so with a special syntax anyway (e.g., writing "./-sub" in the .gitmodules file, and teaching the submodule code to canonicalize it when comparing). So on balance, this is probably a good protection. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-27fsck: detect submodule urls starting with dashJeff King1-0/+7
Urls with leading dashes can cause mischief on older versions of Git. We should detect them so that they can be rejected by receive.fsckObjects, preventing modern versions of git from being a vector by which attacks can spread. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-12fsck: support comments & empty lines in skipListÆvar Arnfjörð Bjarmason1-0/+14
It's annoying not to be able to put comments and empty lines in the skipList, when e.g. keeping a big central list of commits to skip in /etc/gitconfig, which was my motivation for 1362df0d41 ("fetch: implement fetch.fsck.*", 2018-07-27). Implement that, and document what version of Git this was changed in, since this on-disk format can be expected to be used by multiple versions of git. There is no notable performance impact from this change, using the test setup described a couple of commits back: Test HEAD~ HEAD ---------------------------------------------------------------------------------------- 1450.3: fsck with 0 skipped bad commits 7.69(7.27+0.42) 7.86(7.48+0.37) +2.2% 1450.5: fsck with 1 skipped bad commits 7.69(7.30+0.38) 7.83(7.47+0.36) +1.8% 1450.7: fsck with 10 skipped bad commits 7.76(7.38+0.38) 7.79(7.38+0.41) +0.4% 1450.9: fsck with 100 skipped bad commits 7.76(7.38+0.38) 7.74(7.36+0.38) -0.3% 1450.11: fsck with 1000 skipped bad commits 7.71(7.30+0.41) 7.72(7.34+0.38) +0.1% 1450.13: fsck with 10000 skipped bad commits 7.74(7.34+0.40) 7.72(7.34+0.38) -0.3% 1450.15: fsck with 100000 skipped bad commits 7.75(7.40+0.35) 7.70(7.29+0.40) -0.6% 1450.17: fsck with 1000000 skipped bad commits 7.12(6.86+0.26) 7.13(6.87+0.26) +0.1% Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-12fsck: use oidset instead of oid_array for skipListRené Scharfe1-21/+2
Change the implementation of the skipList feature to use oidset instead of oid_array to store SHA-1s for later lookup. This list is parsed once on startup by fsck, fetch-pack or receive-pack depending on the *.skipList config in use. I.e. only once per invocation, but note that for "clone --recurse-submodules" each submodule will re-parse the list, in addition to the main project, and it will be re-parsed when checking .gitmodules blobs, see fb16287719 ("fsck: check skiplist for object in fsck_blob()", 2018-06-27). Memory usage is a bit higher, but we don't need to keep track of the sort order anymore. Embed the oidset into struct fsck_options to make its ownership clear (no hidden sharing) and avoid unnecessary pointer indirection. The cumulative impact on performance of this & the preceding change, using the test setup described in the previous commit: Test HEAD~2 HEAD~ HEAD ---------------------------------------------------------------------------------------------------------------- 1450.3: fsck with 0 skipped bad commits 7.70(7.31+0.38) 7.72(7.33+0.38) +0.3% 7.70(7.30+0.40) +0.0% 1450.5: fsck with 1 skipped bad commits 7.84(7.47+0.37) 7.69(7.32+0.36) -1.9% 7.71(7.29+0.41) -1.7% 1450.7: fsck with 10 skipped bad commits 7.81(7.40+0.40) 7.94(7.57+0.36) +1.7% 7.92(7.55+0.37) +1.4% 1450.9: fsck with 100 skipped bad commits 7.81(7.42+0.38) 7.95(7.53+0.41) +1.8% 7.83(7.42+0.41) +0.3% 1450.11: fsck with 1000 skipped bad commits 7.99(7.62+0.36) 7.90(7.50+0.40) -1.1% 7.86(7.49+0.37) -1.6% 1450.13: fsck with 10000 skipped bad commits 7.98(7.57+0.40) 7.94(7.53+0.40) -0.5% 7.90(7.45+0.44) -1.0% 1450.15: fsck with 100000 skipped bad commits 7.97(7.57+0.39) 8.03(7.67+0.36) +0.8% 7.84(7.43+0.41) -1.6% 1450.17: fsck with 1000000 skipped bad commits 7.72(7.22+0.50) 7.28(7.07+0.20) -5.7% 7.13(6.87+0.25) -7.6% Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-12fsck: use strbuf_getline() to read skiplist fileRené Scharfe1-13/+12
The buffer is unlikely to contain a NUL character, so printing its contents using %s in a die() format is unsafe (detected with ASan). Use an idiomatic strbuf_getline() loop instead, which ensures the buffer is always NUL-terminated, supports CRLF files as well, accepts files without a newline after the last line, supports any hash length automatically, and is shorter. This fixes a bug where emitting an error about an invalid line on say line 1 would continue printing subsequent lines, and usually continue into uninitialized memory. The performance impact of this, on a CentOS 7 box with RedHat GCC 4.8.5-28: $ GIT_PERF_REPEAT_COUNT=5 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run HEAD~ HEAD p1451-fsck-skip-list.sh Test HEAD~ HEAD ---------------------------------------------------------------------------------------- 1450.3: fsck with 0 skipped bad commits 7.75(7.39+0.35) 7.68(7.29+0.39) -0.9% 1450.5: fsck with 1 skipped bad commits 7.70(7.30+0.40) 7.80(7.42+0.37) +1.3% 1450.7: fsck with 10 skipped bad commits 7.77(7.37+0.40) 7.87(7.47+0.40) +1.3% 1450.9: fsck with 100 skipped bad commits 7.82(7.41+0.40) 7.88(7.43+0.44) +0.8% 1450.11: fsck with 1000 skipped bad commits 7.88(7.49+0.39) 7.84(7.43+0.40) -0.5% 1450.13: fsck with 10000 skipped bad commits 8.02(7.63+0.39) 8.07(7.67+0.39) +0.6% 1450.15: fsck with 100000 skipped bad commits 8.01(7.60+0.41) 8.08(7.70+0.38) +0.9% 1450.17: fsck with 1000000 skipped bad commits 7.60(7.10+0.50) 7.37(7.18+0.19) -3.0% Helped-by: Jeff King <peff@peff.net> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-02Merge branch 'sb/object-store-lookup'Junio C Hamano1-4/+5
lookup_commit_reference() and friends have been updated to find in-core object for a specific in-core repository instance. * sb/object-store-lookup: (32 commits) commit.c: allow lookup_commit_reference to handle arbitrary repositories commit.c: allow lookup_commit_reference_gently to handle arbitrary repositories tag.c: allow deref_tag to handle arbitrary repositories object.c: allow parse_object to handle arbitrary repositories object.c: allow parse_object_buffer to handle arbitrary repositories commit.c: allow get_cached_commit_buffer to handle arbitrary repositories commit.c: allow set_commit_buffer to handle arbitrary repositories commit.c: migrate the commit buffer to the parsed object store commit-slabs: remove realloc counter outside of slab struct commit.c: allow parse_commit_buffer to handle arbitrary repositories tag: allow parse_tag_buffer to handle arbitrary repositories tag: allow lookup_tag to handle arbitrary repositories commit: allow lookup_commit to handle arbitrary repositories tree: allow lookup_tree to handle arbitrary repositories blob: allow lookup_blob to handle arbitrary repositories object: allow lookup_object to handle arbitrary repositories object: allow object_as_type to handle arbitrary repositories tag: add repository argument to deref_tag tag: add repository argument to parse_tag_buffer tag: add repository argument to lookup_tag ...
2018-08-02Merge branch 'jk/fsck-gitmodules-gently'Junio C Hamano1-3/+6
Recent "security fix" to pay attention to contents of ".gitmodules" while accepting "git push" was a bit overly strict than necessary, which has been adjusted. * jk/fsck-gitmodules-gently: fsck: downgrade gitmodulesParse default to "info" fsck: split ".gitmodules too large" error from parse failure fsck: silence stderr when parsing .gitmodules config: add options parameter to git_config_from_mem config: add CONFIG_ERROR_SILENT handler config: turn die_on_error into caller-facing enum
2018-07-24Merge branch 'rj/submodule-fsck-skip'Junio C Hamano1-2/+11
"fsck.skipList" did not prevent a blob object listed there from being inspected for is contents (e.g. we recently started to inspect the contents of ".gitmodules" for certain malicious patterns), which has been corrected. * rj/submodule-fsck-skip: fsck: check skiplist for object in fsck_blob()
2018-07-18Merge branch 'sb/object-store-grafts'Junio C Hamano1-1/+2
The conversion to pass "the_repository" and then "a_repository" throughout the object access API continues. * sb/object-store-grafts: commit: allow lookup_commit_graft to handle arbitrary repositories commit: allow prepare_commit_graft to handle arbitrary repositories shallow: migrate shallow information into the object parser path.c: migrate global git_path_* to take a repository argument cache: convert get_graft_file to handle arbitrary repositories commit: convert read_graft_file to handle arbitrary repositories commit: convert register_commit_graft to handle arbitrary repositories commit: convert commit_graft_pos() to handle arbitrary repositories shallow: add repository argument to is_repository_shallow shallow: add repository argument to check_shallow_file_for_update shallow: add repository argument to register_shallow shallow: add repository argument to set_alternate_shallow_file commit: add repository argument to lookup_commit_graft commit: add repository argument to prepare_commit_graft commit: add repository argument to read_graft_file commit: add repository argument to register_commit_graft commit: add repository argument to commit_graft_pos object: move grafts to object parser object-store: move object access functions to object-store.h
2018-07-16fsck: downgrade gitmodulesParse default to "info"Jeff King1-1/+1
We added an fsck check in ed8b10f631 (fsck: check .gitmodules content, 2018-05-02) as a defense against the vulnerability from 0383bbb901 (submodule-config: verify submodule names as paths, 2018-04-30). With the idea that up-to-date hosting sites could protect downstream unpatched clients that fetch from them. As part of that defense, we reject any ".gitmodules" entry that is not syntactically valid. The theory is that if we cannot even parse the file, we cannot accurately check it for vulnerabilities. And anybody with a broken .gitmodules file would eventually want to know anyway. But there are a few reasons this is a bad tradeoff in practice: - for this particular vulnerability, the client has to be able to parse the file. So you cannot sneak an attack through using a broken file, assuming the config parsers for the process running fsck and the eventual victim are functionally equivalent. - a broken .gitmodules file is not necessarily a problem. Our fsck check detects .gitmodules in _any_ tree, not just at the root. And the presence of a .gitmodules file does not necessarily mean it will be used; you'd have to also have gitlinks in the tree. The cgit repository, for example, has a file named .gitmodules from a pre-submodule attempt at sharing code, but does not actually have any gitlinks. - when the fsck check is used to reject a push, it's often hard to work around. The pusher may not have full control over the destination repository (e.g., if it's on a hosting server, they may need to contact the hosting site's support). And the broken .gitmodules may be too far back in history for rewriting to be feasible (again, this is an issue for cgit). So we're being unnecessarily restrictive without actually improving the security in a meaningful way. It would be more convenient to downgrade this check to "info", which means we'd still comment on it, but not reject a push. Site admins can already do this via config, but we should ship sensible defaults. There are a few counterpoints to consider in favor of keeping the check as an error: - the first point above assumes that the config parsers for the victim and the fsck process are equivalent. This is pretty true now, but as time goes on will become less so. Hosting sites are likely to upgrade their version of Git, whereas vulnerable clients will be stagnant (if they did upgrade, they'd cease to be vulnerable!). So in theory we may see drift over time between what two config parsers will accept. In practice, this is probably OK. The config format is pretty established at this point and shouldn't change a lot. And the farther we get from the announcement of the vulnerability, the less interesting this extra layer of protection becomes. I.e., it was _most_ valuable on day 0, when everybody's client was still vulnerable and hosting sites could protect people. But as time goes on and people upgrade, the population of vulnerable clients becomes smaller and smaller. - In theory this could protect us from other vulnerabilities in the future. E.g., .gitmodules are the only way for a malicious repository to feed data to the config parser, so this check could similarly protect clients from a future (to-be-found) bug there. But that's trading a hypothetical case for real-world pain today. If we do find such a bug, the hosting site would need to be updated to fix it, too. At which point we could figure out whether it's possible to detect _just_ the malicious case without hurting existing broken-but-not-evil cases. - Until recently, we hadn't made any restrictions on .gitmodules content. So now in tightening that we're hitting cases where certain things used to work, but don't anymore. There's some moderate pain now. But as time goes on, we'll see more (and more varied) cases that will make tightening harder in the future. So there's some argument for putting rules in place _now_, before users grow more cases that violate them. Again, this is trading pain now for hypothetical benefit in the future. And if we try hard in the future to keep our tightening to a minimum (i.e., rejecting true maliciousness without hurting broken-but-not-evil repos), then that reduces even the hypothetical benefit. Considering both sets of arguments, it makes sense to loosen this check for now. Note that we have to tweak the test in t7415 since fsck will no longer consider this a fatal error. But we still check that it reports the warning, and that we don't get the spurious error from the config code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-16fsck: split ".gitmodules too large" error from parse failureJeff King1-1/+2
Since ed8b10f631 (fsck: check .gitmodules content, 2018-05-02), we'll report a gitmodulesParse error for two conditions: - a .gitmodules entry is not syntactically valid - a .gitmodules entry is larger than core.bigFileThreshold with the intent that we can detect malicious files and protect downstream clients. E.g., from the issue in 0383bbb901 (submodule-config: verify submodule names as paths, 2018-04-30). But these conditions are actually quite different with respect to that bug: - a syntactically invalid file cannot trigger the problem, as the victim would barf before hitting the problematic code - a too-big .gitmodules _can_ trigger the problem. Even though it is obviously silly to have a 500MB .gitmodules file, the submodule code will happily parse it if you have enough memory. So it may be reasonable to configure their severity separately. Let's add a new class for the "too large" case to allow that. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-03fsck: check skiplist for object in fsck_blob()Ramsay Jones1-2/+11
Since commit ed8b10f631 ("fsck: check .gitmodules content", 2018-05-02), fsck will issue an error message for '.gitmodules' content that cannot be parsed correctly. This is the case, even when the corresponding blob object has been included on the skiplist. For example, using the cgit repository, we see the following: $ git fsck Checking object directories: 100% (256/256), done. error: bad config line 5 in blob .gitmodules error in blob 51dd1eff1edc663674df9ab85d2786a40f7ae3a5: gitmodulesParse: could not parse gitmodules blob Checking objects: 100% (6626/6626), done. $ $ git config fsck.skiplist '.git/skip' $ echo 51dd1eff1edc663674df9ab85d2786a40f7ae3a5 >.git/skip $ $ git fsck Checking object directories: 100% (256/256), done. error: bad config line 5 in blob .gitmodules Checking objects: 100% (6626/6626), done. $ Note that the error message issued by the config parser is still present, despite adding the object-id of the blob to the skiplist. One solution would be to provide a means of suppressing the messages issued by the config parser. However, given that (logically) we are asking fsck to ignore this object, a simpler approach is to just not call the config parser if the object is to be skipped. Add a check to the 'fsck_blob()' processing function, to determine if the object is on the skiplist and, if so, exit the function early. Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-03fsck: silence stderr when parsing .gitmodulesJeff King1-1/+3
If there's a parsing error we'll already report it via the usual fsck report() function (or not, if the user has asked to skip this object or warning type). The error message from the config parser just adds confusion. Let's suppress it. Note that we didn't test this case at all, so I've added coverage in t7415. We may end up toning down or removing this fsck check in the future. So take this test as checking what happens now with a focus on stderr, and not any ironclad guarantee that we must detect and report parse failures in the future. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-03config: add options parameter to git_config_from_memJeff King1-1/+1
The underlying config parser knows how to handle a config_options struct, but git_config_from_mem() always passes NULL. Let's allow our callers to specify the options struct. We could add a "_with_options" variant, but since there are only a handful of callers, let's just update them to pass NULL. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29tree: add repository argument to lookup_treeStefan Beller1-1/+1
Add a repository argument to allow the callers of lookup_tree to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29blob: add repository argument to lookup_blobStefan Beller1-2/+2
Add a repository argument to allow the callers of lookup_blob to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29object: add repository argument to parse_objectStefan Beller1-1/+2
Add a repository argument to allow the callers of parse_object to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-29Merge branch 'sb/object-store-grafts' into sb/object-store-lookupJunio C Hamano1-1/+2
* sb/object-store-grafts: commit: allow lookup_commit_graft to handle arbitrary repositories commit: allow prepare_commit_graft to handle arbitrary repositories shallow: migrate shallow information into the object parser path.c: migrate global git_path_* to take a repository argument cache: convert get_graft_file to handle arbitrary repositories commit: convert read_graft_file to handle arbitrary repositories commit: convert register_commit_graft to handle arbitrary repositories commit: convert commit_graft_pos() to handle arbitrary repositories shallow: add repository argument to is_repository_shallow shallow: add repository argument to check_shallow_file_for_update shallow: add repository argument to register_shallow shallow: add repository argument to set_alternate_shallow_file commit: add repository argument to lookup_commit_graft commit: add repository argument to prepare_commit_graft commit: add repository argument to read_graft_file commit: add repository argument to register_commit_graft commit: add repository argument to commit_graft_pos object: move grafts to object parser object-store: move object access functions to object-store.h
2018-06-25Merge branch 'nd/complete-config-vars'Junio C Hamano1-17/+51
Continuing with the idea to programatically enumerate various pieces of data required for command line completion, teach the codebase to report the list of configuration variables subcommands care about to help complete them. * nd/complete-config-vars: completion: complete general config vars in two steps log-tree: allow to customize 'grafted' color completion: support case-insensitive config vars completion: keep other config var completion in camelCase completion: drop the hard coded list of config vars am: move advice.amWorkDir parsing back to advice.c advice: keep config name in camelCase in advice_config[] fsck: produce camelCase config key names help: add --config to list all available config fsck: factor out msg_id_info[] lazy initialization code grep: keep all colors in an array Add and use generic name->id mapping code for color slot parsing
2018-06-13Merge branch 'jk/submodule-fsck-loose-fixup'Junio C Hamano1-1/+2
Finishing touches to a topic that already is in 'maint'. * jk/submodule-fsck-loose-fixup: fsck: avoid looking at NULL blob->object t7415: don't bother creating commit for symlink test
2018-06-11fsck: avoid looking at NULL blob->objectJeff King1-1/+2
Commit 159e7b080b (fsck: detect gitmodules files, 2018-05-02) taught fsck to look at the content of .gitmodules files. If the object turns out not to be a blob at all, we just complain and punt on checking the content. And since this was such an obvious and trivial code path, I didn't even bother to add a test. Except it _does_ do one non-trivial thing, which is call the report() function, which wants us to pass a pointer to a "struct object". Which we don't have (we have only a "struct object_id"). So we erroneously pass a NULL object to report(), which gets dereferenced and causes a segfault. It seems like we could refactor report() to just take the object_id itself. But we pass the object pointer along to a callback function, and indeed this ends up in builtin/fsck.c's objreport() which does want to look at other parts of the object (like the type). So instead, let's just use lookup_unknown_object() to get the real "struct object", and pass that. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-30Merge branch 'bc/object-id'Junio C Hamano1-9/+11
Conversion from uchar[20] to struct object_id continues. * bc/object-id: (42 commits) merge-one-file: compute empty blob object ID add--interactive: compute the empty tree value Update shell scripts to compute empty tree object ID sha1_file: only expose empty object constants through git_hash_algo dir: use the_hash_algo for empty blob object ID sequencer: use the_hash_algo for empty tree object ID cache-tree: use is_empty_tree_oid sha1_file: convert cached object code to struct object_id builtin/reset: convert use of EMPTY_TREE_SHA1_BIN builtin/receive-pack: convert one use of EMPTY_TREE_SHA1_HEX wt-status: convert two uses of EMPTY_TREE_SHA1_HEX submodule: convert several uses of EMPTY_TREE_SHA1_HEX sequencer: convert one use of EMPTY_TREE_SHA1_HEX merge: convert empty tree constant to the_hash_algo builtin/merge: switch tree functions to use object_id builtin/am: convert uses of EMPTY_TREE_SHA1_BIN to the_hash_algo sha1-file: add functions for hex empty tree and blob OIDs builtin/receive-pack: avoid hard-coded constants for push certs diff: specify abbreviation size in terms of the_hash_algo upload-pack: replace use of several hard-coded constants ...
2018-05-29Sync with Git 2.17.1Junio C Hamano1-4/+134
* maint: (25 commits) Git 2.17.1 Git 2.16.4 Git 2.15.2 Git 2.14.4 Git 2.13.7 fsck: complain when .gitmodules is a symlink index-pack: check .gitmodules files with --strict unpack-objects: call fsck_finish() after fscking objects fsck: call fsck_finish() after fscking objects fsck: check .gitmodules content fsck: handle promisor objects in .gitmodules check fsck: detect gitmodules files fsck: actually fsck blob data fsck: simplify ".git" check index-pack: make fsck error message more specific verify_path: disallow symlinks in .gitmodules update-index: stat updated files earlier verify_dotfile: mention case-insensitivity in comment verify_path: drop clever fallthrough skip_prefix: add case-insensitive variant ...
2018-05-29fsck: produce camelCase config key namesNguyễn Thái Ngọc Duy1-4/+18
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-29help: add --config to list all available configNguyễn Thái Ngọc Duy1-0/+12
Sometimes it helps to list all available config vars so the user can search for something they want. The config man page can also be used but it's harder to search if you want to focus on the variable name, for example. This is not the best way to collect the available config since it's not precise. Ideally we should have a centralized list of config in C code (pretty much like 'struct option'), but that's a lot more work. This will do for now. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-29fsck: factor out msg_id_info[] lazy initialization codeNguyễn Thái Ngọc Duy1-16/+24
This array will be used by some other function than parse_msg_id() in the following commit. Factor out this prep code so it could be called from that one. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-23Merge branch 'ds/lazy-load-trees'Junio C Hamano1-3/+5
The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so. * ds/lazy-load-trees: coccinelle: avoid wrong transformation suggestions from commit.cocci commit-graph: lazy-load trees for commits treewide: replace maybe_tree with accessor methods commit: create get_commit_tree() method treewide: rename tree to maybe_tree
2018-05-21fsck: complain when .gitmodules is a symlinkJeff King1-2/+9
We've recently forbidden .gitmodules to be a symlink in verify_path(). And it's an easy way to circumvent our fsck checks for .gitmodules content. So let's complain when we see it. Signed-off-by: Jeff King <peff@peff.net>
2018-05-21fsck: check .gitmodules contentJeff King1-1/+59
This patch detects and blocks submodule names which do not match the policy set forth in submodule-config. These should already be caught by the submodule code itself, but putting the check here means that newer versions of Git can protect older ones from malicious entries (e.g., a server with receive.fsckObjects will block the objects, protecting clients which fetch from it). As a side effect, this means fsck will also complain about .gitmodules files that cannot be parsed (or were larger than core.bigFileThreshold). Signed-off-by: Jeff King <peff@peff.net>
2018-05-21fsck: handle promisor objects in .gitmodules checkJeff King1-0/+3
If we have a tree that points to a .gitmodules blob but don't have that blob, we can't check its contents. This produces an fsck error when we encounter it. But in the case of a promisor object, this absence is expected, and we must not complain. Note that this can technically circumvent our transfer.fsckObjects check. Imagine a client fetches a tree, but not the matching .gitmodules blob. An fsck of the incoming objects will show that we don't have enough information. Later, we do fetch the actual blob. But we have no idea that it's a .gitmodules file. The only ways to get around this would be to re-scan all of the existing trees whenever new ones enter (which is expensive), or to somehow persist the gitmodules_found set between fsck runs (which is complicated). In practice, it's probably OK to ignore the problem. Any repository which has all of the objects (including the one serving the promisor packs) can perform the checks. Since promisor packs are inherently about a hierarchical topology in which clients rely on upstream repositories, those upstream repositories can protect all of their downstream clients from broken objects. Signed-off-by: Jeff King <peff@peff.net>
2018-05-21fsck: detect gitmodules filesJeff King1-0/+58
In preparation for performing fsck checks on .gitmodules files, this commit plumbs in the actual detection of the files. Note that unlike most other fsck checks, this cannot be a property of a single object: we must know that the object is found at a ".gitmodules" path at the root tree of a commit. Since the fsck code only sees one object at a time, we have to mark the related objects to fit the puzzle together. When we see a commit we mark its tree as a root tree, and when we see a root tree with a .gitmodules file, we mark the corresponding blob to be checked. In an ideal world, we'd check the objects in topological order: commits followed by trees followed by blobs. In that case we can avoid ever loading an object twice, since all markings would be complete by the time we get to the marked objects. And indeed, if we are checking a single packfile, this is the order in which Git will generally write the objects. But we can't count on that: 1. git-fsck may show us the objects in arbitrary order (loose objects are fed in sha1 order, but we may also have multiple packs, and we process each pack fully in sequence). 2. The type ordering is just what git-pack-objects happens to write now. The pack format does not require a specific order, and it's possible that future versions of Git (or a custom version trying to fool official Git's fsck checks!) may order it differently. 3. We may not even be fscking all of the relevant objects at once. Consider pushing with transfer.fsckObjects, where one push adds a blob at path "foo", and then a second push adds the same blob at path ".gitmodules". The blob is not part of the second push at all, but we need to mark and check it. So in the general case, we need to make up to three passes over the objects: once to make sure we've seen all commits, then once to cover any trees we might have missed, and then a final pass to cover any .gitmodules blobs we found in the second pass. We can simplify things a bit by loosening the requirement that we find .gitmodules only at root trees. Technically a file like "subdir/.gitmodules" is not parsed by Git, but it's not unreasonable for us to declare that Git is aware of all ".gitmodules" files and make them eligible for checking. That lets us drop the root-tree requirement, which eliminates one pass entirely. And it makes our worst case much better: instead of potentially queueing every root tree to be re-examined, the worst case is that we queue each unique .gitmodules blob for a second look. This patch just adds the boilerplate to find .gitmodules files. The actual content checks will come in a subsequent commit. Signed-off-by: Jeff King <peff@peff.net>
2018-05-21fsck: actually fsck blob dataJeff King1-1/+7
Because fscking a blob has always been a noop, we didn't bother passing around the blob data. In preparation for content-level checks, let's fix up a few things: 1. The fsck_object() function just returns success for any blob. Let's a noop fsck_blob(), which we can fill in with actual logic later. 2. The fsck_loose() function in builtin/fsck.c just threw away blob content after loading it. Let's hold onto it until after we've called fsck_object(). The easiest way to do this is to just drop the parse_loose_object() helper entirely. Incidentally, this also fixes a memory leak: if we successfully loaded the object data but did not parse it, we would have left the function without freeing it. 3. When fsck_loose() loads the object data, it does so with a custom read_loose_object() helper. This function streams any blobs, regardless of size, under the assumption that we're only checking the sha1. Instead, let's actually load blobs smaller than big_file_threshold, as the normal object-reading code-paths would do. This lets us fsck small files, and a NULL return is an indication that the blob was so big that it needed to be streamed, and we can pass that information along to fsck_blob(). Signed-off-by: Jeff King <peff@peff.net>
2018-05-21fsck: simplify ".git" checkJeff King1-3/+1
There's no need for us to manually check for ".git"; it's a subset of the other filesystem-specific tests. Dropping it makes our code slightly shorter. More importantly, the existing code may make a reader wonder why ".GIT" is not covered here, and whether that is a bug (it isn't, as it's also covered in the filesystem-specific tests). Signed-off-by: Jeff King <peff@peff.net>
2018-05-18commit: add repository argument to lookup_commit_graftJonathan Nieder1-1/+1
Add a repository argument to allow callers of lookup_commit_graft to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-16object-store: move object access functions to object-store.hStefan Beller1-0/+1
This should make these functions easier to find and cache.h less overwhelming to read. In particular, this moves: - read_object_file - oid_object_info - write_object_file As a result, most of the codebase needs to #include object-store.h. In this patch the #include is only added to files that would fail to compile otherwise. It would be better to #include wherever identifiers from the header are used. That can happen later when we have better tooling for it. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-02fsck: convert static functions to struct object_idbrian m. carlson1-9/+11
Convert two static functions to use struct object_id and parse_oid_hex, instead of relying on harcoded 20 and 40-based constants. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11treewide: replace maybe_tree with accessor methodsDerrick Stolee1-3/+5
In anticipation of making trees load lazily, create a Coccinelle script (contrib/coccinelle/commit.cocci) to ensure that all references to the 'maybe_tree' member of struct commit are either mutations or accesses through get_commit_tree() or get_commit_tree_oid(). Apply the Coccinelle script to create the rest of the patch. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11treewide: rename tree to maybe_treeDerrick Stolee1-3/+3
Using the commit-graph file to walk commit history removes the large cost of parsing commits during the walk. This exposes a performance issue: lookup_tree() takes a large portion of the computation time, even when Git never uses those trees. In anticipation of lazy-loading these trees, rename the 'tree' member of struct commit to 'maybe_tree'. This serves two purposes: it hints at the future role of possibly being NULL even if the commit has a valid tree, and it allows for unambiguous transformation from simple member access (i.e. commit->maybe_tree) to method access. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14sha1_file: convert read_sha1_file to struct object_idbrian m. carlson1-1/+1
Convert read_sha1_file to take a pointer to struct object_id and rename it read_object_file. Do the same for read_sha1_file_extended. Convert one use in grep.c to use the new function without any other code change, since the pointer being passed is a void pointer that is already initialized with a pointer to struct object_id. Update the declaration and definitions of the modified functions, and apply the following semantic patch to convert the remaining callers: @@ expression E1, E2, E3; @@ - read_sha1_file(E1.hash, E2, E3) + read_object_file(&E1, E2, E3) @@ expression E1, E2, E3; @@ - read_sha1_file(E1->hash, E2, E3) + read_object_file(E1, E2, E3) @@ expression E1, E2, E3, E4; @@ - read_sha1_file_extended(E1.hash, E2, E3, E4) + read_object_file_extended(&E1, E2, E3, E4) @@ expression E1, E2, E3, E4; @@ - read_sha1_file_extended(E1->hash, E2, E3, E4) + read_object_file_extended(E1, E2, E3, E4) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-14object: rename function 'typename' to 'type_name'Brandon Williams1-1/+1
Rename C++ keyword in order to bring the codebase closer to being able to be compiled with a C++ compiler. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-11Merge branch 'rs/fsck-null-return-from-lookup'Junio C Hamano1-4/+4
Improve behaviour of "git fsck" upon finding a missing object. * rs/fsck-null-return-from-lookup: fsck: handle NULL return of lookup_blob() and lookup_tree()
2017-10-06fsck: handle NULL return of lookup_blob() and lookup_tree()René Scharfe1-4/+4
lookup_blob() and lookup_tree() can return NULL if they find an object of an unexpected type. Accessing the object member is undefined in that case. Cast the result to a struct object pointer instead; we can do that because object is the first member of all object types. This trick is already used in other places in the code. An error message is already shown by object_as_type(), which is called by the lookup functions. The walk callback functions are expected to handle NULL object pointers passed to them, but put_object_name() needs a valid object, so avoid calling it without one. Suggested-by: SZEDER Gábor <szeder.dev@gmail.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-22consistently use "fallthrough" comments in switchesJeff King1-0/+1
Gcc 7 adds -Wimplicit-fallthrough, which can warn when a switch case falls through to the next case. The general idea is that the compiler can't tell if this was intentional or not, so you should annotate any intentional fall-throughs as such, leaving it to complain about any unannotated ones. There's a GNU __attribute__ which can be used for annotation, but of course we'd have to #ifdef it away on non-gcc compilers. Gcc will also recognize specially-formatted comments, which matches our current practice. Let's extend that practice to all of the unannotated sites (which I did look over and verify that they were behaving as intended). Ideally in each case we'd actually give some reasons in the comment about why we're falling through, or what we're falling through to. And gcc does support that with -Wimplicit-fallthrough=2, which relaxes the comment pattern matching to anything that contains "fallthrough" (or a variety of spelling variants). However, this isn't the default for -Wimplicit-fallthrough, nor for -Wextra. In the name of simplicity, it's probably better for us to support the default level, which requires "fallthrough" to be the only thing in the comment (modulo some window dressing like "else" and some punctuation; see the gcc manual for the complete set of patterns). This patch suppresses all warnings due to -Wimplicit-fallthrough. We might eventually want to add that to the DEVELOPER Makefile knob, but we should probably wait until gcc 7 is more widely adopted (since earlier versions will complain about the unknown warning type). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-13commit: convert lookup_commit_graft to struct object_idStefan Beller1-1/+1
With this patch, commit.h doesn't contain the string 'sha1' any more. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-29Merge branch 'bc/object-id'Junio C Hamano1-3/+3
Conversion from uchar[20] to struct object_id continues. * bc/object-id: (53 commits) object: convert parse_object* to take struct object_id tree: convert parse_tree_indirect to struct object_id sequencer: convert do_recursive_merge to struct object_id diff-lib: convert do_diff_cache to struct object_id builtin/ls-tree: convert to struct object_id merge: convert checkout_fast_forward to struct object_id sequencer: convert fast_forward_to to struct object_id builtin/ls-files: convert overlay_tree_on_cache to object_id builtin/read-tree: convert to struct object_id sha1_name: convert internals of peel_onion to object_id upload-pack: convert remaining parse_object callers to object_id revision: convert remaining parse_object callers to object_id revision: rename add_pending_sha1 to add_pending_oid http-push: convert process_ls_object and descendants to object_id refs/files-backend: convert many internals to struct object_id refs: convert struct ref_update to use struct object_id ref-filter: convert some static functions to struct object_id Convert struct ref_array_item to struct object_id Convert the verify_pack callback to struct object_id Convert lookup_tag to struct object_id ...
2017-05-08object: convert parse_object* to take struct object_idbrian m. carlson1-1/+1
Make parse_object, parse_object_or_die, and parse_object_buffer take a pointer to struct object_id. Remove the temporary variables inserted earlier, since they are no longer necessary. Transform all of the callers using the following semantic patch: @@ expression E1; @@ - parse_object(E1.hash) + parse_object(&E1) @@ expression E1; @@ - parse_object(E1->hash) + parse_object(E1) @@ expression E1, E2; @@ - parse_object_or_die(E1.hash, E2) + parse_object_or_die(&E1, E2) @@ expression E1, E2; @@ - parse_object_or_die(E1->hash, E2) + parse_object_or_die(E1, E2) @@ expression E1, E2, E3, E4, E5; @@ - parse_object_buffer(E1.hash, E2, E3, E4, E5) + parse_object_buffer(&E1, E2, E3, E4, E5) @@ expression E1, E2, E3, E4, E5; @@ - parse_object_buffer(E1->hash, E2, E3, E4, E5) + parse_object_buffer(E1, E2, E3, E4, E5) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08Convert lookup_tree to struct object_idbrian m. carlson1-1/+1
Convert the lookup_tree function to take a pointer to struct object_id. The commit was created with manual changes to tree.c, tree.h, and object.c, plus the following semantic patch: @@ @@ - lookup_tree(EMPTY_TREE_SHA1_BIN) + lookup_tree(&empty_tree_oid) @@ expression E1; @@ - lookup_tree(E1.hash) + lookup_tree(&E1) @@ expression E1; @@ - lookup_tree(E1->hash) + lookup_tree(E1) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08Convert lookup_blob to struct object_idbrian m. carlson1-1/+1
Convert lookup_blob to take a pointer to struct object_id. The commit was created with manual changes to blob.c and blob.h, plus the following semantic patch: @@ expression E1; @@ - lookup_blob(E1.hash) + lookup_blob(&E1) @@ expression E1; @@ - lookup_blob(E1->hash) + lookup_blob(E1) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-23parse_timestamp(): specify explicitly where we parse timestampsJohannes Schindelin1-1/+1
Currently, Git's source code represents all timestamps as `unsigned long`. In preparation for using a more appropriate data type, let's introduce a symbol `parse_timestamp` (currently being defined to `strtoul`) where appropriate, so that we can later easily switch to, say, use `strtoull()` instead. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-31Rename sha1_array to oid_arraybrian m. carlson1-3/+3
Since this structure handles an array of object IDs, rename it to struct oid_array. Also rename the accessor functions and the initialization constant. This commit was produced mechanically by providing non-Documentation files to the following Perl one-liners: perl -pi -E 's/struct sha1_array/struct oid_array/g' perl -pi -E 's/\bsha1_array_/oid_array_/g' perl -pi -E 's/SHA1_ARRAY_INIT/OID_ARRAY_INIT/g' Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-31Convert sha1_array_lookup to take struct object_idbrian m. carlson1-1/+1
Convert this function by changing the declaration and definition and applying the following semantic patch to update the callers: @@ expression E1, E2; @@ - sha1_array_lookup(E1, E2.hash) + sha1_array_lookup(E1, &E2) @@ expression E1, E2; @@ - sha1_array_lookup(E1, E2->hash) + sha1_array_lookup(E1, E2) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-31Make sha1_array_append take a struct object_id *brian m. carlson1-1/+1
Convert the callers to pass struct object_id by changing the function declaration and definition and applying the following semantic patch: @@ expression E1, E2; @@ - sha1_array_append(E1, E2.hash) + sha1_array_append(E1, &E2) @@ expression E1, E2; @@ - sha1_array_append(E1, E2->hash) + sha1_array_append(E1, E2) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-28sha1-array: convert internal storage for struct sha1_array to object_idbrian m. carlson1-2/+2
Make the internal storage for struct sha1_array use an array of struct object_id internally. Update the users of this struct which inspect its internals. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-28fsck: convert init_skiplist to struct object_idbrian m. carlson1-5/+6
Convert a hardcoded constant buffer size to a use of GIT_MAX_HEXSZ, and use parse_oid_hex to reduce the dependency on the size of the hash. This function is a caller of sha1_array_append, which will be converted later. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-26fsck: lazily load types under --connectivity-onlyJeff King1-0/+4
The recent fixes to "fsck --connectivity-only" load all of the objects with their correct types. This keeps the connectivity-only code path close to the regular one, but it also introduces some unnecessary inefficiency. While getting the type of an object is cheap compared to actually opening and parsing the object (as the non-connectivity-only case would do), it's still not free. For reachable non-blob objects, we end up having to parse them later anyway (to see what they point to), making our type lookup here redundant. For unreachable objects, we might never hit them at all in the reachability traversal, making the lookup completely wasted. And in some cases, we might have quite a few unreachable objects (e.g., when alternates are used for shared object storage between repositories, it's normal for there to be objects reachable from other repositories but not the one running fsck). The comment in mark_object_for_connectivity() claims two benefits to getting the type up front: 1. We need to know the types during fsck_walk(). (And not explicitly mentioned, but we also need them when printing the types of broken or dangling commits). We can address this by lazy-loading the types as necessary. Most objects never need this lazy-load at all, because they fall into one of these categories: a. Reachable from our tips, and are coerced into the correct type as we traverse (e.g., a parent link will call lookup_commit(), which converts OBJ_NONE to OBJ_COMMIT). b. Unreachable, but not at the tip of a chunk of unreachable history. We only mention the tips as "dangling", so an unreachable commit which links to hundreds of other objects needs only report the type of the tip commit. 2. It serves as a cross-check that the coercion in (1a) is correct (i.e., we'll complain about a parent link that points to a blob). But we get most of this for free already, because right after coercing, we'll parse any non-blob objects. So we'd notice then if we expected a commit and got a blob. The one exception is when we expect a blob, in which case we never actually read the object contents. So this is a slight weakening, but given that the whole point of --connectivity-only is to sacrifice some data integrity checks for speed, this seems like an acceptable tradeoff. Here are before and after timings for an extreme case with ~5M reachable objects and another ~12M unreachable (it's the torvalds/linux repository on GitHub, connected to shared storage for all of the other kernel forks): [before] $ time git fsck --no-dangling --connectivity-only real 3m4.323s user 1m25.121s sys 1m38.710s [after] $ time git fsck --no-dangling --connectivity-only real 0m51.497s user 0m49.575s sys 0m1.776s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-27fsck: handle bad trees like other errorsDavid Turner1-6/+12
Instead of dying when fsck hits a malformed tree object, log the error like any other and continue. Now fsck can tell the user which tree is bad, too. Signed-off-by: David Turner <dturner@twosigma.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-18fsck: optionally show more helpful info for broken linksJohannes Schindelin1-4/+17
When reporting broken links between commits/trees/blobs, it would be quite helpful at times if the user would be told how the object is supposed to be reachable. With the new --name-objects option, git-fsck will try to do exactly that: name the objects in a way that shows how they are reachable. For example, when some reflog got corrupted and a blob is missing that should not be, the user might want to remove the corresponding reflog entry. This option helps them find that entry: `git fsck` will now report something like this: broken link from tree b5eb6ff... (refs/stash@{<date>}~37:) to blob ec5cf80... Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-18fsck: give the error function a chance to see the fsck_optionsJohannes Schindelin1-2/+3
We will need this in the next commit, where fsck will be taught to optionally name the objects when reporting issues about them. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-18fsck_walk(): optionally name objects on the goJohannes Schindelin1-4/+83
If fsck_options->name_objects is initialized, and if it already has name(s) for the object(s) that are to be the starting point(s) for fsck_walk(), then that function will now add names for the objects that were walked. This will be highly useful for teaching git-fsck to identify root causes for broken links, which is the task for the next patch in this series. Note that this patch opts for decorating the objects with plain strings instead of full-blown structs (à la `struct rev_name` in the code of the `git name-rev` command), for several reasons: - the code is much simpler than if it had to work with structs that describe arbitrarily long names such as "master~14^2~5:builtin/am.c", - the string processing is actually quite light-weight compared to the rest of fsck's operation, - the caller of fsck_walk() is expected to provide names for the starting points, and using plain and simple strings is just the easiest way to do that. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-05-17Merge branch 'jc/fsck-nul-in-commit'Junio C Hamano1-3/+13
"git fsck" learned to catch NUL byte in a commit object as potential error and warn. * jc/fsck-nul-in-commit: fsck: detect and warn a commit with embedded NUL fsck_commit_buffer(): do not special case the last validation
2016-05-10fsck: detect and warn a commit with embedded NULJunio C Hamano1-0/+8
Even though a Git commit object is designed to be capable of storing any binary data as its payload, in practice people use it to describe the changes in textual form, and tools like "git log" are designed to treat the payload as text. Detect and warn when we see any commit object with a NUL byte in it. Note that a NUL byte in the header part is already detected as a grave error. This change is purely about the message part. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-04-25tree-walk: convert tree_entry_extract() to use struct object_idbrian m. carlson1-3/+3
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-04-25struct name_entry: use struct object_id instead of unsigned char sha1[20]brian m. carlson1-2/+2
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-04-14fsck_commit_buffer(): do not special case the last validationJunio C Hamano1-3/+5
The pattern taken by all the validations in this function is: if (notice a violation exists) { err = report(... VIOLATION_KIND ...); if (err) return err; } where report() returns zero if specified kind of violation is set to be ignored, and otherwise shows an error message and returns non-zero. The last validation in the function immediately before the function returns 0 to declare "all good" can cheat and directly return the return value from report(), and the current code does so, i.e. if (notice a violation exists) return report(... VIOLATION_KIND ...); return 0; But that is a selfish code that declares it is the ultimate and final form of the function, never to be enhanced later. To allow and invite future enhancements, make the last test follow the same pattern. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-22convert trivial cases to ALLOC_ARRAYJeff King1-1/+2
Each of these cases can be converted to use ALLOC_ARRAY or REALLOC_ARRAY, which has two advantages: 1. It automatically checks the array-size multiplication for overflow. 2. It always uses sizeof(*array) for the element-size, so that it can never go out of sync with the declared type of the array. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>