-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silencing 404s on gvfs/config #504
Conversation
Signed-off-by: Kevin Willford <[email protected]>
Hydrate missing loose objects in check_and_freshen() when running virtualized. Add test cases to verify read-object hook works when running virtualized. This hook is called in check_and_freshen() rather than check_and_freshen_local() to make the hook work also with alternates. Helped-by: Kevin Willford <[email protected]> Signed-off-by: Ben Peart <[email protected]>
This code change makes sure that the config value for core_gvfs is always loaded before checking it. Signed-off-by: Kevin Willford <[email protected]>
If we are going to write an object there is no use in calling the read object hook to get an object from a potentially remote source. We would rather just write out the object and avoid the potential round trip for an object that doesn't exist. This change adds a flag to the check_and_freshen() and freshen_loose_object() functions' signatures so that the hook is bypassed when the functions are called before writing loose objects. The check for a local object is still performed so we don't overwrite something that has already been written to one of the objects directories. Based on a patch by Kevin Willford. Signed-off-by: Johannes Schindelin <[email protected]>
On index load, clear/set the skip worktree bits based on the virtual file system data. Use virtual file system data to update skip-worktree bit in unpack-trees. Use virtual file system data to exclude files and folders not explicitly requested. Signed-off-by: Ben Peart <[email protected]>
String formatting can be a performance issue when there are hundreds of thousands of trees. Change to stop using the strbuf_addf and just add the strings or characters individually. There are a limited number of modes so added a switch for the known ones and a default case if something comes through that are not a known one for git. In one scenario regarding a huge worktree, this reduces the time required for a `git checkout <branch>` from 44 seconds to 38 seconds, i.e. it is a non-negligible performance improvement. Signed-off-by: Kevin Willford <[email protected]>
…x has been redirected Fixes microsoft#13 Some git commands spawn helpers and redirect the index to a different location. These include "difftool -d" and the sequencer (i.e. `git rebase -i`, `git cherry-pick` and `git revert`) and others. In those instances we don't want to update their temporary index with our virtualization data. Helped-by: Johannes Schindelin <[email protected]> Signed-off-by: Ben Peart <[email protected]>
The following commands and options are not currently supported when working in a GVFS repo. Add code to detect and block these commands from executing. 1) fsck 2) gc 4) prune 5) repack 6) submodule 8) update-index --split-index 9) update-index --index-version (other than 4) 10) update-index --[no-]skip-worktree 11) worktree Signed-off-by: Ben Peart <[email protected]>
The virtual file system code incorrectly treated symlinks as directories instead of regular files. This meant symlinks were not included even if they are listed in the list of files returned by the core.virtualFilesystem hook proc. Fixes microsoft#25 Signed-off-by: Ben Peart <[email protected]>
We found a user who had set "core.gvfs = false" in their global config. This should not have been necessary, but it also should not have caused a problem. However, it did. The reason is that gvfs_load_config_value() is called from config.c when reading config key/value pairs from all the config files. The local config should override the global config, and this is done by config.c reading the global config first then reading the local config. However, our logic only allowed writing the core_gvfs variable once. Put the guards against multiple assignments of core_gvfs into gvfs_config_is_set() instead, because that will fix the problem _and_ keep multiple calls to gvfs_config_is_set() from slowing down. Signed-off-by: Derrick Stolee <[email protected]>
Add check to see if a directory is included in the virtualfilesystem before checking the directory hashmap. This allows a directory entry like foo/ to find all untracked files in subdirectories.
Teach STATUS to optionally serialize the results of a status computation to a file. Teach STATUS to optionally read an existing serialization file and simply print the results, rather than actually scanning. This is intended for immediate status results on extremely large repos and assumes the use of a service/daemon to maintain a fresh current status snapshot. 2021-10-30: packet_read() changed its prototype in ec9a37d (pkt-line.[ch]: remove unused packet_read_line_buf(), 2021-10-14). 2021-10-30: sscanf() now does an extra check that "%d" goes into an "int" and complains about "uint32_t". Replacing with "%u" fixes the compile-time error. 2021-10-30: string_list_init() was removed by abf897b (string-list.[ch]: remove string_list_init() compatibility function, 2021-09-28), so we need to initialize manually. Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
Teach status serialization to take an optional pathname on the command line to direct that cache data be written there rather than to stdout. When used this way, normal status results will still be written to stdout. When no path is given, only binary serialization data is written to stdout. Usage: git status --serialize[=<path>] Signed-off-by: Jeff Hostetler <[email protected]>
Teach status deserialize code to reject status cache when printing in porcelain V2 and there are unresolved conflicts in the cache file. A follow-on task might extend the cache format to include this additiona data. See code for longer explanation. Signed-off-by: Jeff Hostetler <[email protected]>
Fix "git status --deserialize" to correctly report both pathnames for renames. Add a test case to verify. A change was made upstream that added an additional "rename_status" field to the "struct wt_status_change_data" structure. It is used during the various print routines to decide if 2 pathnames need to be printed. 5134ccd wt-status.c: rename rename-related fields in wt_status_change_data The fix here is to add that field to the status cache data. Signed-off-by: Jeff Hostetler <[email protected]>
Changes to the global or repo-local excludes files can change the results returned by "git status" for untracked files. Therefore, it is important that the exclude-file values used during serialization are still current at the time of deserialization. Teach "git status --serialize" to report metadata on the user's global exclude file (which defaults to "$XDG_HOME/git/ignore") and for the repo-local excludes file (which is in ".git/info/excludes"). Serialize will record the pathnames and mtimes for these files in the serialization header (next to the mtime data for the .git/index file). Teach "git status --deserialize" to validate this new metadata. If either exclude file has changed since the serialization-cache-file was written, then deserialize will reject the cache file and force a full/normal status run. Signed-off-by: Jeff Hostetler <[email protected]>
Teach `git status --deserialize` to either wait indefintely or immediately fail if the status serialization cache file is stale. Signed-off-by: Jeff Hostetler <[email protected]>
It took this developer more than a moment to verify that was_dirty() really returns 0 (i.e. "false") if the file was not even tracked. In other words, the `dirty` variable that was initialized to 1 (i.e. "true") and then negated to be returned was not helping readability. The same holds for the final return: rather than assigning the value to return to `dirty` and then *immediately* returning that, we can simplify it to a single statement.
It took this developer quite a good while to understand why the current code cannot get a `NULL` returned by `index_file_exists()`. To un-confuse readers (and future-proof the code), let's just be safe and check before we dereference the returned pointer.
The idea of the virtual file system really is to tell Git to avoid accessing certain paths. This fixes the case where a given path is not yet included in the virtual file system and we are about to write a conflicted version of it.
The vfs does not correctly handle the case when there is a file that begins with the same prefix as a directory. For example, the following setup would encounter this issue: A directory contains a file named `dir1.sln` and a directory named `dir1/`. The directory `dir1` contains other files. The directory `dir1` is in the virtual file system list The contents of `dir1` should be in the virtual file system, but it is not. The contents of this directory do not have the skip worktree bit cleared as expected. The problem is in the `apply_virtualfilesystem(...)` function where it does not include the trailing slash of the directory name when looking up the position in the index to start clearing the skip worktree bit. This fix is it include the trailing slash when finding the first index entry from `index_name_pos(...)`.
With the "--untracked-files=complete" option status computes a superset of the untracked files. We use this when writing the status cache. If subsequent deserialize commands ask for either the complete set or one of the "no", "normal", or "all" subsets, it can still use the cache file because of filtering in the deserialize parser. When running status with the "-uno" option, the long format status would print a "(use -u to show untracked files)" hint. When deserializing with the "-uno" option and using a cache computed with "-ucomplete", the "nothing to commit, working tree clean" message would be printed instead of the hint. It was easy to miss because the correct hint message was printed if the cache was rejected for any reason (and status did the full fallback). The "struct wt_status des" structure was initialized with the content of the status cache (and thus defaulted to "complete"). This change sets "des.show_untracked_files" to the requested subset from the command-line or config. This allows the long format to print the hint. Signed-off-by: Jeff Hostetler <[email protected]>
When our patches to support that hook were upstreamed, the hook's name was eliciting some reviewer suggestions, and it was renamed to `post-index-change`. These patches (with the new name) made it into v2.22.0. However, VFSforGit users may very well have checkouts with that hook installed under the original name. To support this, let's just introduce a hack where we look a bit more closely when we just failed to find the `post-index-change` hook, and allow any `post-indexchanged` hook to run instead (if it exists).
Add trace2 region around read_object_process to collect time spent waiting for missing objects to be dynamically fetched. Signed-off-by: Jeff Hostetler <[email protected]>
Add trace2 region and data events describing attempts to deserialize status data using a status cache. A category:status, label:deserialize region is pushed around the deserialize code. Deserialization results when reading from a file are: category:status, path = <path> category:status, polled = <number_of_attempts> category:status, result = "ok" | "reject" When reading from STDIN are: category:status, path = "STDIN" category:status, result = "ok" | "reject" Status will fallback and run a normal status scan when a "reject" is reported (unless "--deserialize-wait=fail"). If "ok" is reported, status was able to use the status cache and avoid scanning the workdir. Additionally, a cmd_mode is emitted for each step: collection, deserialization, and serialization. For example, if deserialization is attempted and fails and status falls back to actually computing the status, a cmd_mode message containing "deserialize" is issued and then a cmd_mode for "collect" is issued. Also, if deserialization fails, a data message containing the rejection reason is emitted. Signed-off-by: Jeff Hostetler <[email protected]>
Add trace information around status serialization. Signed-off-by: Jeff Hostetler <[email protected]>
Report virtual filesystem summary data. Signed-off-by: Jeff Hostetler <[email protected]>
sparse-checkout: avoid crash when switching between cone and non-cone
Update `git archive` tree-ish argument from `HEAD^{tree}` to `HEAD`. By using a commit (rather than tree) reference, the commit hash will be stored as an extended pax header, extractable git `git get-tar-commit-id`. The intended use-case for this change is building `git` from the output of `make dist` - in combination with the ability to specify a fallback `GIT_BUILT_FROM_COMMIT`, a user can extract the commit ID used to build the archive and set it as `GIT_BUILT_FROM_COMMIT`. The result is fully-populated information for the commit hash in `git version --build-options`. Signed-off-by: Victoria Dye <[email protected]>
The 'git show' command can take an input to request the state of an object in the index. This can lead to parsing the index in order to load a specific file entry. Without the change presented here, a sparse index would expand to a full one, taking much longer than usual to access a simple file. There is one behavioral change that happens here, though: we now can find a sparse directory entry within the index! Commands that previously failed because we could not find an entry in the worktree or index now succeed because we _do_ find an entry in the index. There might be more work to do to make other situations succeed when looking for an indexed tree, perhaps by looking at or updating the cache-tree extension as needed. These situations include having a full index or asking for a directory that is within the sparse-checkout cone (and hence is not a sparse directory entry in the index). Signed-off-by: Derrick Stolee <[email protected]>
…s, fix loose-objects task The maintenance.lock file exists to prevent concurrent maintenance processes from writing to a repository at the same time. However, it has the downside of causing maintenance to start failing without recovery if there is any reason why the maintenance command failed without cleaning up the lock-file. This change makes it such that maintenance will delete a lock file that was modified over 6 hours ago. This will auto-heal repositories that are stuck with failed maintenance (and maybe it will fail again, but we will get a message other than the lock file exists). The intention here is to get users out of a bad state without weakening the maintenance lock _too_ much. I'm open to other thoughts, including expanding the timeframe from 6 to 24 hours. --- The `fixup!` commit fixes the use of the `gvfs.sharedcache` config value around the loose-objects task. Specifically, the loose-objects step might fail because it is calling `git pack-objects {garbage}/pack/loose` which fails.
Set the `GIT_BUILT_FROM_COMMIT` based on the version specified in the `make dist` output archive header. This ensures the commit hash is shown in `git version --build-options`. Signed-off-by: Victoria Dye <[email protected]>
…how` As we roll out the sparse index feature, here is a command that I see being used more frequently than other non-integrated commands. (For example, `git mv` is used much less frequently.) Since the index expansion only happens when using `git show :<path>` to indicate that we are looking for a cached value, this is very likely being used by a tool and not directly by users. However, the performance slowdown is around 10x (0.4s to 4.1s at p50). The change here is simple, and we can catch up the performance in the next release. There is a slightly odd behavior change when asking for a directory that now "works" but did not work at all before. See the test change in the second commit for details.
Fixes for MacOS release build & build options
The link was broken because of an extra dot.
We have been using the default issue template from git-for-windows/git, but we should ask different questions than Git for Windows. Update the issue template to ask these helpful questions. Signed-off-by: Derrick Stolee <[email protected]>
We have long inherited the pull request template from git-for-windows/git, but we should probably do a better job of specifying the need for why a PR in microsoft/git exists instead of an upstream contribution. Signed-off-by: Derrick Stolee <[email protected]>
This fork inherited the issue and pull request templates from `git-for-windows/git`, but `microsoft/git` could use different templates to help external contributors.
During the 2.35.0 rebase, we ejected 570f64b (Fix reset when using the sparse-checkout feature., 2017-03-15) because of a similar change upstream that actually works with the expected behavior of sparse-checkout. That commit only ever existed in microsoft/git, but when it was considered for upstream we realized that it behaved strangely for a sparse-checkout scenario. The root problem is that during a mixed reset, 'git reset <commit>' updates the index to aggree with <commit> but leaves the worktree the same as it was before. The issue with sparse-checkout is that some files might not be in the worktree and thus the information from those files would be "lost". The upstream decision was to leave these files as ignored, because that's what the SKIP_WORKTREE bit means: don't put these files in the worktree and ignore their contents. If there already were files in the worktree, then Git does not change them. The case for "losing" data is if a committed change outside of the sparse-checkout was in the previous HEAD position. However, this information could be recovered from the reflog. The case where this is different is in a virtualized filesystem. The virtualization is projecting the index contents onto the filesystem, so we need to do something different here. In a virtual environment, every file is considered "important" and we abuse the SKIP_WORKTREE bit to indicate that Git does not need to process a projected file. When a file is populated, the virtual filesystem hook provides the information for removing the SKIP_WORKTREE bit. In the case of these mixed resets, we have the issue where we change the projection of the worktree for these cache entries that change. If a file is populated in the worktree, then the populated file will persist and appear in a follow-up 'git status'. However, if the file is not populated and only projected, we change the projection from the current value to the new value, leaving a clean 'git status'. The previous version of this commit includes a call to checkout_entry(), which populates the file. This causes the file to be actually in the working tree and no longer projected. To make this work with the upstream changes, stop setting the skip-worktree bit for the new cache entry. This seemed to work fine without this change, but it's likely due to some indirection with the virtual filesystem. Better to do the best-possible thing here so we don't hide a corner-case bug by accident. Helped-by: Victoria Dye <[email protected]> Signed-off-by: Kevin Willford <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
…irtual filesystem This replaces microsoft#493 (can't reopen a PR after a force-push...). I updated this commit with a more firm version of the fix. This hopefully answers Victoria's excellent concerns with the previous approach. I did not manage to get an automated test for this, but I did carefully verify this manually with a few commits in a VFS for Git enlistment (with different files every time). I updated the commit message with more details about why this works. --- This fork contains changes specific to monorepo scenarios. If you are an external contributor, then please detail your reason for submitting to this fork: * [X] This change only applies to the virtualization hook and VFS for Git. Resolves microsoft#490.
Use https
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
As mentioned in my comment, we need to identify the call chain where gvfs/config
is accessed in scalar clone
, and suppress the warning only in that scenario, and keep complaining about 404s in all other instances.
BTW could you please re-target this PR to vfs-2.36.1
?
This should also address the failures we see, as that target branch seems to have no problems.
@@ -3600,7 +3600,7 @@ static enum gh__error_code do_sub_cmd__config(int argc, const char **argv) | |||
|
|||
if (ec == GH__ERROR_CODE__OK) | |||
printf("%s\n", config_data.buf); | |||
else | |||
else if (ec != GH__ERROR_CODE__HTTP_404) | |||
error("config: %s", status.error_message.buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to add a flag to allow us to suppress 404 error message, and then pass that flag only from the caller where a 404 should be handled silently.
@@ -3629,7 +3629,7 @@ static enum gh__error_code do_sub_cmd__endpoint(int argc, const char **argv) | |||
|
|||
if (ec == GH__ERROR_CODE__OK) | |||
printf("%s\n", data.buf); | |||
else | |||
else if (ec != GH__ERROR_CODE__HTTP_404) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, we do care about 404s. It's just when we call scalar clone
and target a remote repository that isn't on Azure Repos, we want to be silent about it.
Therefore, I think, we need to identify the exact call chain that leads to said unwanted error message, and ensure that the error message is suppressed only in that call chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, we do care about 404s.
That does sound quite obvious when you put it like that.
exact call chain that leads to said unwanted error message
I'm going to look for the caller in scalar clone
ensure that the error message is suppressed only in that call chain.
and then try to pass in that flag when we want to be silent about it
target a remote repository that isn't on Azure Repos
How would we detect that it's trying to access gvfs/config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
Line 663 in 34f7708
0 : error(_("Could not access gvfs/config endpoint")); |
I'm pretty sure this is the culprit.
Lines 615 to 625 in 34f7708
static int supports_gvfs_protocol(const char *url, char **cache_server_url) | |
{ | |
struct child_process cp = CHILD_PROCESS_INIT; | |
struct strbuf out = STRBUF_INIT; | |
/* | |
* The GVFS protocol is only supported via https://; For testing, we | |
* also allow http://. | |
*/ | |
if (!can_url_support_gvfs(url)) | |
return 0; |
Checking out that function:
Lines 602 to 607 in 34f7708
static int can_url_support_gvfs(const char *url) | |
{ | |
return starts_with(url, "https://") || | |
(git_env_bool("GIT_TEST_ALLOW_GVFS_VIA_HTTP", 0) && | |
starts_with(url, "http://")); | |
} |
How do you suggest modifying this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
I've noticed that Azure Repos URLS either start with "dev.azure.com" or have ".visualstudio.com" in the domain.
I think starts_with(url, "https://)
is too generous.
We could restrict it to starts_with(url, "https://dev.azure.com)
but that'd miss the ones with .visualstudio.com.
So we could probably split the url on /
so that the following example URL
https://o365exchange.visualstudio.com/O365%20Core/_git/Griffin
Would be split into
[https: , , o365exchange.visualstudio.com , O365%20Core , _git , /Griffin]
Then we could OR with a third condition that checks if the third entry to that array ended with visualstudio.com. But I'm wondering if there's a simpler way to parse for that.
If we split the same URL on .
, we get:
[ https://o365exchange, visualstudio, com/O365%20Core/_git/Griffin]
This seems a lot simpler to me. Just check if arr[1] == 'visualstudio'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are totally correct: Since nobody but Azure Repos will even implement the GVFS protocol on the server side (the core Git project rejected the VFS for Git idea and we have partial clones instead, which in turn probably won't be supported by Azure Repos), we can totally hard-code the respective host name requirement.
Instead of splitting and exact string comparisons, we could also simply use a regular expression. And we could first strip the protocol so that it is easier to validate the host name. I have something like this in mind:
static int can_url_support_gvfs(const char *url)
{
regex_t vs;
int ret;
if (!skip_prefix(url, "https://", &url) ||
(git_env_bool("GIT_TEST_ALLOW_GVFS_VIA_HTTP", 0) &&
skip_prefix(url, "http://", &url))
return 0;
/*
* Only Azure Repos supports the `gvfs/config` endpoint.
*
* The URLs are of the form https://dev.azure.com/... or
* https://<account>.visualstudio.com/...
*/
if (starts_with(url, "dev.azure.com/"))
return 1;
if (regcomp(&vs, "^[^/]*\\.visualstudio\\.com/", REG_NOSUB))
BUG("could not compile `.visualstudio.com` regex");
ret = regexec(&vs, url, 0, NULL, 0) != REG_NOMATCH;
regfree(&vs);
return ret;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute... regarding hard-coding URL requirements, there is a problem with on-prem solutions: Their URLs will not have dev.azure.com nor *.visualstudio.com domains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot hard-code the domain name because BitBucket Server has a plug-in to support the GVFS Protocol. (In addition to on-prem Azure DevOps Server instances, yes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. So what we need to do is to start a git clone
in gdb
, set a breakpoint on the code producing the error message, then look at the backtrace (command: bt
) to identify the call chain, then add a flag to silence this warning, if called via the clone
call chain.
I've changed the base branch to I might try doing that if the bot still complains about failures. |
Yes, please rebase your branch. You can do something like We are also going to release our rebase onto |
Ack. I'll rebase to |
@abdurraheemali: We have completed our updates to Git 2.37.0. Please rebase your change onto |
Ack. I'll also try to get the call chain today. |
Update: was too busy today to start, unfortunately. Trying again tomorrow. |
I can put a few hours into it now! Struggling with the general github workflow since I'm more used to ADO. Fumbling around in that fun way you do when trying to do something new for the first time, but being patient with myself. I used Codespaces to make this PR, but you can't really debug with it, so I cloned the repo locally. But I can't push (403 Forbidden) since I don't have the authority to create new branches. So I went over to my fork https://github.com/abdurraheemali/git, but then the "fetch upstream" is grayed out. That happens because the vfs-2.35.1 branch doesn't have any commits, but I can't use the UX itself to select vfs-2.37.1 since that branch didn't exist when I made my fork two months ago. The most expedient solution would be to delete my fork and create a new one. But that seems hacky. Thankfully, stackoverflow to the rescue, https://stackoverflow.com/questions/33488085/how-to-get-a-new-branch-in-my-fork-from-the-original-repository. My local clone is from microsoft/git, so I'll delete that source code and clone from abdurraheemali/git, and then execute the steps from that link to get the new branch into my fork. Then I'll rebase the commits from the old branch into the new one. |
A few inches of progress before bed. Decided to cherry pick the commit to my branch and start a new PR #516. |
Meant to address #384
Thanks for taking the time to contribute to Git!
This fork contains changes specific to monorepo scenarios. If you are an
external contributor, then please detail your reason for submitting to
this fork:
GVFS Protocol.