Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshotter: use syncfs system call #2816

Merged
merged 1 commit into from
Oct 30, 2023

Conversation

zhouhaibing089
Copy link
Contributor

@zhouhaibing089 zhouhaibing089 commented Oct 26, 2023

sync system call triggers a full page cache sync which may not always work, especially in kubernetes environment where it is very easy to be interfered by others. I have seen several cases where a broken nfs mount is blocking kaniko from doing its job.

With syncfs, it only writes cache back to disk for the current filesystem that is used by kaniko which is supposed to be more reliable.

Here is the details that I captured recently. From kaniko executor logs:

/workspace # executor --no-push
INFO[0000] Retrieving image manifest ubuntu:22.04
INFO[0000] Retrieving image ubuntu:22.04 from registry index.docker.io
INFO[0001] Built cross stage deps: map[]
INFO[0001] Retrieving image manifest ubuntu:22.04
INFO[0001] Returning cached image manifest
INFO[0001] Executing 0 build triggers
INFO[0001] Building stage 'ubuntu:22.04' [idx: '0', base-idx: '-1']
INFO[0001] Unpacking rootfs as cmd RUN apt-get update requires it.
INFO[0003] RUN apt-get update
INFO[0003] Initializing snapshotter ...
INFO[0003] Taking snapshot of full filesystem...

It hangs there forever in Taking snapshot of full filesystem - see #1333 as well.

I looked the stack of kaniko process and see this:

root@<my-host-name>:/proc/180716/task# cat 182585/stack
[<0>] sync_inodes_sb+0x10b/0x2b0
[<0>] sync_inodes_one_sb+0x15/0x20
[<0>] iterate_supers+0xa0/0x100
[<0>] ksys_sync+0x42/0xb0
[<0>] __do_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x59/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

where 180716 is the pid of kaniko executor, and one of its subtask is waiting in ksys_sync forever.

At that point of time, running sync (command) on the host was also stuck forever. We further checked and then identified the system call was waiting for one of nfs mount to flush its superblock (but due to connectivity issues, it can't complete).

root@<my-host-name>:/# dmesg -T | grep nfs
[Thu Oct 26 04:05:18 2023] FS-Cache: Netfs 'nfs' registered for caching
[Thu Oct 26 04:12:00 2023]  ? nfs_ctx_key_to_expire+0x69/0xf0 [nfs]
[Thu Oct 26 04:12:00 2023]  nfs_wb_all+0x28/0xf0 [nfs]
[Thu Oct 26 04:12:00 2023]  nfs4_file_flush+0x73/0xb0 [nfsv4]
[Thu Oct 26 04:12:28 2023] nfs: server <nfs-server-host-name> not responding, still trying
[Thu Oct 26 04:12:43 2023] nfs: server <nfs-server-host-name> not responding, still trying
[Thu Oct 26 04:14:00 2023]  ? nfs_ctx_key_to_expire+0x69/0xf0 [nfs]
[Thu Oct 26 04:14:00 2023]  nfs_wb_all+0x28/0xf0 [nfs]
[Thu Oct 26 04:14:00 2023]  nfs4_file_flush+0x73/0xb0 [nfsv4]

While this is technically not related to kaniko, but by switching to syncfs with fd from current directory, it is going to only sync the current file system which holds the provided fd, and this is not affected by external filesystems.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes unit tests
  • Adds integration tests if needed.

See the contribution guide for more details.

Reviewer Notes

  • The code flow looks good.
  • Unit tests and or integration tests added.

@JeromeJu JeromeJu self-assigned this Oct 30, 2023
@aaron-prindle
Copy link
Collaborator

Thanks for the PR here @zhouhaibing089, appreciate the fix! Left one nit comment, once that is addressed we can get this merged

@zhouhaibing089 zhouhaibing089 force-pushed the sync-fs branch 4 times, most recently from 71c6800 to 2f5ac5d Compare October 30, 2023 22:20
`sync` system call triggers a full page cache sync which may not always
work, especially in kubernetes environment where it is easy to be
interfered by others. I have seen several cases where a broken nfs mount
is blocking kaniko from doing its job.

With `syncfs`, it only writes cache back to disk for the current
filesystem that is used by kaniko which is supposed to be more reliable.
Copy link
Collaborator

@aaron-prindle aaron-prindle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @zhouhaibing089!

@aaron-prindle aaron-prindle merged commit e65bce1 into GoogleContainerTools:main Oct 30, 2023
10 checks passed
@zhouhaibing089 zhouhaibing089 deleted the sync-fs branch October 31, 2023 16:27
@rayunstop
Copy link

great job, help a lot. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants