-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Mellanox CX4 seemingly caused by 8b14652 #1091
Comments
The tested revert was specifically fbc3b4a |
This sounds like a possible duplicate of #1048 which should be fixed in current master. Can you verify which commit you have checked out? |
I may not have tested on the latest master. Thank you for the pointer. |
Based on your |
My test fails on the latest commit of master (98dd25a)
|
What is the card connected to, and what do you see on the wire? |
Relates to: tinkerbell#115, ipxe/ipxe#1091 This would update iPXE to the latest master plus a revert of a commit identified to break Mellanox NICs. This is a temporary measure that might need to be repeated on future updates of iPXE until ipxe/ipxe#1091 has been fixed correctly.
Relates to: tinkerbell#115, ipxe/ipxe#1091 This would update iPXE to the latest master plus a revert of a commit identified to break Mellanox NICs. This is a temporary measure that might need to be repeated on future updates of iPXE until ipxe/ipxe#1091 has been fixed correctly. Signed-off-by: Nahum Shalman <[email protected]>
Relates to: tinkerbell#115, ipxe/ipxe#1091 This would update iPXE to the latest master plus a revert of a commit identified to break Mellanox NICs. This is a temporary measure that might need to be repeated on future updates of iPXE until ipxe/ipxe#1091 has been fixed correctly. Signed-off-by: Nahum Shalman <[email protected]>
Relates to: tinkerbell#115, ipxe/ipxe#1091 This would update iPXE to the latest master plus a revert of a commit identified to break Mellanox NICs. This is a temporary measure that might need to be repeated on future updates of iPXE until ipxe/ipxe#1091 has been fixed correctly. Signed-off-by: Nahum Shalman <[email protected]>
Can confirm, that 8b14652 breaks it also for The NICs are connected through 100GBASE-CR4 QSFP28 cables through LAG to the switch. tcpdump done on the switch: Switch A
Switch B:
hope that helps |
## Description This would update iPXE to the latest master plus a revert of a commit identified to break Mellanox NICs. This is a temporary measure that might need to be repeated on future updates of iPXE until ipxe/ipxe#1091 has been fixed correctly. ## Why is this needed Relates to: #115, ipxe/ipxe#1091 ## How Has This Been Tested? I did repeated testing of iPXE binaries from my branches [mainline-ipxe](https://github.com/nshalman/ipxedust/tree/mainline-ipxe) [test-revert](https://github.com/nshalman/ipxedust/tree/test-revert) to confirm that the issue exists and is addressed by the revert. ## How are existing users impacted? What migration steps/scripts do we need? This would unbreak users of certain Mellanox NICs (Definitely Mellanox CX4, possibly others too) ## Checklist: I have: - [ ] updated the documentation and/or roadmap (if required) - [ ] added unit or e2e tests - [ ] provided instructions on how to upgrade
I can also verify that reverting that commit fixes failure to boot with Mellanox CX5 nics as well. In my tests, I'm booting using snponly.efi, FWIW |
I tested the latest version again, as I saw some more commits related to eapol went in earlier today, but this is still broken. It appears that the code in eapol.c, where it says "Ignore non-EAPol devices" isn't ignoring these Mellanox cards, because if I just add another unconditional "return 0;" before the "Initialize structure" comment, then my hosts w/ Mellanox boot interfaces will work. |
Hello, I'm working for a relatively big hosting company and we also noticed that iPXE is broken for a while on Mellanox cards. As an example we have new HP RL300 ARM servers and these chassis have an onboard Mellanox card. This issue is not limited to this specific model, we also have 25GbE+ Mellanox cards that are acting in the same way. We are still on commit |
To add a data point to the reports: We had the same issue with |
Ran into this issue with Mellanox CX5 and CX6, rebasing fbc3b4a this onto main got them booting again. |
@Smithx10 can you try the workaround suggested by @Cornelicorn and report back if it helped as it's a much less invasive workaround to tweak a define than backing the code out entirely. I haven't had a chance to test for myself.
|
From the
I as non mellanox hardware owner, are with the mellanox hardware owners: Somebody else should provide a merge request |
I don't think my revert commit is a good solution. I believe @mcb30 is working on a better solution. I am going to update my description of this bug to include the suggestion that folks attempt #1091 (comment) before patching the source. As I have said before, I haven't had time to test that myself, but it seems very likely to me that it is a much simpler workaround. |
Adding a comment so I can watch. I have a downstream and was planning on updating with |
Resolves tinkerbell#117 Relates to tinkerbell#115 This release includes ipxe/ipxe#1152 which resolves tinkerbell#117 I've also included the workaround for tinkerbell#115 as specified from ipxe/ipxe#1091 (comment)
Resolves tinkerbell#117 Relates to tinkerbell#115 This release includes ipxe/ipxe#1152 which resolves tinkerbell#117 I've also included the workaround for tinkerbell#115 as specified from ipxe/ipxe#1091 (comment) Signed-off-by: Nahum Shalman <[email protected]>
## Description Update to the latest iPXE This release includes ipxe/ipxe#1152 which resolves #117. I've also included the workaround for #115 as specified from ipxe/ipxe#1091 (comment) to keep Mellanox cards working. ## Why is this needed Fixes: #117 Relates to: #115 ## How Has This Been Tested? I am drafting this PR as the first step towards testing. The plan is: - [x] Test on an affected Mellanox card - [x] Confirm that LetsEncrypt is working
Fixing the issue ipxe#1091 Fix Eth inline header size to 14 bytes instead of 18 bytes. because the eapol packet is 18 bytes and all the packet inserted to the inline header, so it appears as empty packet and the driver cant handle it.
Could we get confirmation if this is fixed by the merge of #1174, thanks |
I'm going to ask our ops team to try it out on an affected box. We have, in the interim, removed EAPOL support from Triton's downstream of ipxe, since we don't use it anyway currently. See TritonDataCenter/ipxe#25 . |
Still failed for me when building with this latest change and eapol enabled
again, booting from snponly.
…On Sun, Mar 17, 2024 at 7:18 PM Christian I. Nilsson < ***@***.***> wrote:
Could we get confirmation if this is fixed by the merge of #1174
<#1174>, thanks
—
Reply to this email directly, view it on GitHub
<#1091 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZIKENFOFU6L32BLNASJWRDYYYQEBAVCNFSM6AAAAABADIPO2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGY2TGNBZGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
If you are using snponly then there's a good chance that the underlying SNP driver provided by Mellanox has the same bug, since Mellanox uses a shared driver codebase for both iPXE and their UEFI SNP driver. There's nothing we can do about the bug being present in the underlying SNP driver. Please try using |
With ipxe.efi this does work for my test host, though we can’t use that
bootloader due to issues that occur when many different nics are installed
in a system.
|
Thank you for testing. Your result indicates that the issue is fixed in iPXE, so I will close this issue now. If you want to continue using snponly.efi, you will need to contact your UEFI BIOS vendor to get a BIOS update that includes the equivalent fix in the BIOS-provided SNP driver. You can also open a separate issue to cover whatever problem you are seeing that prevents you from using ipxe.efi when many different NICs are installed. |
Triton Data Center needs snponly.efi (and undionly.kpxe for BIOS) as well, so our testing would likely fail as well. (To that our, our downstream will maintain excluding EAPOL for now.) |
The 18-byte packet will be zero-padded to 60 bytes on the wire anyway (64 bytes including the Ethernet FCS), since that is the minimum length Ethernet packet. We could possibly work around the underlying SNP driver bug by pointlessly zero-padding the packet to 60 bytes ourselves. That would be sufficient to avoid the underlying bug in the SNP driver (assuming that it is using code identical to that fixed in commit c11734eee). @ech68 could you please retest snponly.efi built from #1177 ? |
On Mon, Mar 18, 2024 at 11:40 AM Michael Brown ***@***.***> wrote:
@ech68 <https://github.com/ech68> could you please retest snponly.efi
built from #1177 <#1177> ?
Same failure mode unfortunately.
|
Thanks for testing. Does |
On Mon, Mar 18, 2024 at 12:41 PM Michael Brown ***@***.***> wrote:
Thanks for testing. Does ifstat report the driver as SNP or NII when you
are using snponly.efi?
NII
… |
Thanks. I've generalised the PR to cover both SNP and NII, and force-pushed PR #1177. Could you please retest with this commit? |
On Mon, Mar 18, 2024 at 1:13 PM Michael Brown ***@***.***> wrote:
Fantastic, thank you! Could you let me know your name and email for the
commit log testing credit?
Eric Hagberg, ***@***.***
… |
I think there may be some kind of automated censorship system at work here. 🙃 |
In https://github.com/ipxe/ipxe/pull/1177/commits is an email address ( |
Merged with the Tested-by credit: thank you! |
Hi, The As per DHCP server logs, the DHCP request is never received (guessing it is never sent out of machine). Setting Would it be possible to have similar fix in Here are the iPXE logs when compiling with
Shall I open a new issue about the UNDI driver case ? |
Your best fix might be to get updated firmware, or use ipxe.pxe instead. |
For an updated workaround suggestion, please see #1091 (comment)
which suggest that you add
#undef NET_PROTO_EAPOL
to your build config.If that works for you, please add an appropriate reaction to the corresponding comment (I recommend 👍)
On the other hand, if you absolutely need EAPOL support and have affected Mellanox cards, please let us know in a reply so that you can help test a better fix in the future.
I was running into weird boot failures on Mellanox Connect-X cards using the latest builds of iPXE including issues using netboot.xyz.
I have bisected the issue down to 8b14652
Reverting that commit seems to fix the issue.
I'm happy to help further debug the issue including providing an environment for testing any proposed fixes.
Output captured from the console of a machine demonstrating the issue
Verbose report from the Dell BIOS
The text was updated successfully, but these errors were encountered: