Requesting Temporary ignoring of default.go oci-runtime-validation test #721

YJDoc2 · 2022-02-19T19:49:34Z

After two weeks of testing and a lot of digging of git commit history, I think I have finally found what has been causing CI failure for #694 's OCI validation test (there are some other failing tests, but this is not related to them).

The surprising reason behind that test failing is a knock-on effect of the way -

Our CI is set up (I have partial blame in this 😓 )
We recently adding these capabilities
runtime-tools not validating there tests on runc (?)

If you see in commet I have mentioned that the CI is faioing on youki for these three capabilities, although if you check the integration test CI of any other PR there is no issue. The reason behind this is :

We added caching mechanism to the CI for integration tests on 7th Sep Here. This caches the build artifacts of the runtime-tests builds, which is the compiled tests, and has a very good reason of doing so, as without it the CI takes 7-8 mins and with caching it takes about 3-4 mins. The downside is that the tests are not built each time, and we always use the version that was cached on 7th Sep.
runtime-tools itself updated one of their dependencies go-capabilities on 26th of October here
The go-capabilities itself was updated on Aug 5 here to include the three new capabilities, and thus the runtime tools itself as well inherited these updates.
runc released version 1.1 here on 18th Jan , which included the support for these capabilities. I think we also have only recently implemented these as well.
As our CI was using the cached version of tests, they did not check for these and thus tests were passed. But when I moved the git module in the refactor, the cache was invalidated, and youki was tested with new version and failed the tests.
The probable reason is that the CI runner's kernel might not be updated or compiled with correct flags or has permissions to set these capabilities, so runc cannot use them OR the oci-tests has bug in it, (as far as I checked the do not validate the tests on runc or some other runtime)

It is too late here, so I'm not sure what approach might be best, my suggestion is to either comment the default.go test and open an issue if not yet opened on oci-runtime-tools

I do not have runc 1.1 yet, so cannot verify if the tests are also failing on it as well , but I would guess that it is so. I will try to get the 1.1 and check on it if I can.

@Furisto @utam0k Please take a look. Thank you.

utam0k · 2022-02-20T05:38:19Z

@YJDoc2
um... we got into trouble 😭
First of all thanks for the awesome research. If it was for you, I probably wouldn't have noticed it.
I don't want to remove as many tests as possible because default includes many tests. it gets a little weird, but how about hacking from stdout only when default.go run? e.g. ignore only the relevant test cans in default.go.

YJDoc2 · 2022-02-20T08:39:04Z

Hey @utam0k
I agree that removing the test should be the very last option, as we have already commented few tests as they do not work, and removing tests will make more bugs slip through.

Currently my suggested approach is this :

Build runc 1.1 locally and run oci tests on it
If runc also fail the tests, then the tests themselves need to update :
then I will open an issue on the runtime-tools repo notifying this, and wait for few days to see if they reply.
If they do, then we can wait till they update, and then use that new version for us, till then we can keep using the cached version, as even though capabilities tests are old, other seems to have no major changes since our cache, and are currently working
In case they do not reply, or won't be fixing the issue in near future, we can work on the way you suggested, and use grep to ignore these three specific capabilities related test failures, until we find some other good way to deal with it.
In case runc does pass the tests, that would mean that it is our implementation that has some errors, and then we can fix them on our side, which would mean that the CI is not broken

I'll update here once I have ran the runc check.

@utam0k What do you think?

YJDoc2 · 2022-02-20T08:51:46Z

Hmm I cloned and built runc from main branch, and it seems that it is passing the default.go tests, but failing the delete.go tests 🤔 🤔 🤔
That should not be happening right? If you get time, can you also get runc 1.1 and verify this on your end as well?

If that is right, it would mean that is is some bug from our end, likely that we need to check the kernel version and enable the BPF and other features according to that.

~~EDIT : ok, I ran the tests on my local version youki once again to make sure that I was getting error here as well, but~~
~~apparently it has the same behavior, where default.go is passing but delete.go is failing, 😭 not sure why, investigating.~~

Update : So I have run this several times, and with different combinations of options, and what I'm almost certain of is :

runc 1.1 / 1.0 both are passing default.go, but failing delete.go for some reasons (likely due to some setup issue of my system?)
youki is failing default.go

YJDoc2 · 2022-02-20T10:19:06Z

I think I figured out what is causing the issue, and I think solution is simple if I understand it correctly :
As far as I know, we currently support CAP_PERFMON, CAP_CHECKPOINT_RESTORE, and CAP_BPF but in the capabilities code, we still ignore them when setting capabilities, Here : https://github.com/containers/youki/blob/main/crates/libcontainer/src/syscall/linux.rs#L136-L140

That is why I think even though youki reports that they are supported, they are ignored above when setting capabilities, and thus the error occurs.

I tried commenting that part, and unconditionally dropping the cap (the _ arm), and that seemed to pass the default and delete tests as well.

@Furisto if possible can you confirm what I am thinking is correct or not? This seemed to slip by originally due to the CI cahcing I have mentioned.

@utam0k What do you think?

Furisto · 2022-02-20T18:52:52Z

@YJDoc2 Makes sense to me. We should support these capabilities in Youki as well.

(see youki-dev#721)

utam0k · 2022-02-23T04:54:51Z

@YJDoc2 Sounds good to me to support them. But If we will support these capabilities, we have to check the linux kernel version. That's the only thing I want to pay attention to.

yihuaf · 2023-07-21T05:05:29Z

@YJDoc2 is there still pending action items for this issue?

YJDoc2 · 2023-07-21T12:45:40Z

No I think everything mentioned here was covered in some or other PR/issue, so we can close this one out. Thanks for pinging!

YJDoc2 added a commit to YJDoc2/youki that referenced this issue Feb 21, 2022

update CI to not cache oci tests

e46d0f2

(see youki-dev#721)

YJDoc2 mentioned this issue Feb 21, 2022

Remove caching of OCI tests in CI #727

Merged

YJDoc2 closed this as completed Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting Temporary ignoring of default.go oci-runtime-validation test #721

Requesting Temporary ignoring of default.go oci-runtime-validation test #721

YJDoc2 commented Feb 19, 2022

utam0k commented Feb 20, 2022 •

edited

Loading

YJDoc2 commented Feb 20, 2022

YJDoc2 commented Feb 20, 2022 •

edited

Loading

YJDoc2 commented Feb 20, 2022

Furisto commented Feb 20, 2022

utam0k commented Feb 23, 2022

yihuaf commented Jul 21, 2023

YJDoc2 commented Jul 21, 2023

Requesting Temporary ignoring of default.go oci-runtime-validation test #721

Requesting Temporary ignoring of default.go oci-runtime-validation test #721

Comments

YJDoc2 commented Feb 19, 2022

utam0k commented Feb 20, 2022 • edited Loading

YJDoc2 commented Feb 20, 2022

YJDoc2 commented Feb 20, 2022 • edited Loading

YJDoc2 commented Feb 20, 2022

Furisto commented Feb 20, 2022

utam0k commented Feb 23, 2022

yihuaf commented Jul 21, 2023

YJDoc2 commented Jul 21, 2023

utam0k commented Feb 20, 2022 •

edited

Loading

YJDoc2 commented Feb 20, 2022 •

edited

Loading