Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shim occasionally crashes in _relocate() on AArch64 #371

Open
dannf opened this issue May 13, 2021 · 23 comments
Open

shim occasionally crashes in _relocate() on AArch64 #371

dannf opened this issue May 13, 2021 · 23 comments

Comments

@dannf
Copy link

dannf commented May 13, 2021

I've been trying to track down an issue with Ubuntu's shim, that I've now found is reproducible with latest upstream shim (as of commit 4068fd4) as well.

If I put a guest in a reboot loop, it will eventually crash:

BdsDxe: starting Boot0003 "ubuntu" from HD(15,GPT,F3395D88-1F07-48B3-AF35-4BF4BC88021F,0x800,0x31801)/\EFI\ubuntu\shimaa64.efi


Synchronous Exception at 0x00000000BB993498


Synchronous Exception at 0x00000000BB993498
PC 0x0000BB993498
PC 0x0000BB92F024
PC 0x0000BF56D8A4 (0x0000BF566000+0x000078A4) [ 1] DxeCore.dll
[...]

After it has crashed once, it will enter a watchdog reset/shim crash infinite loop.

I attached gdb to the QEMU process, and I'm pretty confident that it is crashing in _relocate() - specifically here:

			case R_AARCH64_RELATIVE:
				addr = (unsigned long *)
					(ldbase + rel->r_offset); 
				*addr = ldbase + rel->r_addend; <<< HERE
				break;

			default:

Object file layouts are out of my wheelhouse, but I did try to see if I could understand what is going on. First off, it seems odd that _relocate() is finding any relocations to apply in the first place. If I understand this correctly, this is just a gnu-efi's generic loader for EFI binaries, and executables shouldn't have relocations. Indeed, objdump shows no relocations in the ELF shim binary (I don't know how to scan the PE/COFF one).

The faulting address appears to be 0x00000000BB9752C8, which should fall within the .text section. If I understand this correctly, it's setting up the attributes for the text and data sections:

SetUefiImageMemoryAttributes - 0x00000000BB92E000 - 0x0000000000001000 (0x0000000000004008)
SetUefiImageMemoryAttributes - 0x00000000BB92F000 - 0x0000000000065000 (0x0000000000020008)
SetUefiImageMemoryAttributes - 0x00000000BB994000 - 0x0000000000065000 (0x0000000000004008)

The text section would be the 2nd range there, which has the READONLY attribute set. So, that explains why we're getting the exception. But if relocations need to occur, seems like this would need to be writeable though, again, it doesn't seem like an executable would need relocations...

@dannf
Copy link
Author

dannf commented May 13, 2021

console log of a linux reboot -> no crash -> linux reboot -> crash -> watchdog reboot -> crash -> watchdog reboot -> crash

@lcp
Copy link
Collaborator

lcp commented Jun 9, 2021

I also experienced the crash of AArch64 build. Based on your finding, I looked into crt0-efi-aarch64.S which invokes _relocate():
https://github.com/rhboot/gnu-efi/blob/f0f98248649b4b219764bd46854697bcec858081/gnuefi/crt0-efi-aarch64.S#L149-L163

When checking _DYNAMIC, it's mentioned in the gnuefi document:
https://github.com/rhboot/gnu-efi/blob/f0f98248649b4b219764bd46854697bcec858081/README.gnuefi#L300-L357

Per the document, _DYNAMIC is supposed to be the offset to .dynamic. Unlike x86_64, objcopy doesn't support AArch64 EFI, so those pe-coff section header are crafted manually in crt0-efi-aarch64.S, and .dynamic was merged into .data. To track _DYNAMIC, I first checked objdump -x shimaa64.so | grep _DYNAMIC:

000000000007a910 l O *ABS* 0000000000000000 _DYNAMIC

and then dumped 0x07a910 of shimaa64.so:

0007a910  ff 00 00 00 0b 00 00 00  44 20 0a 00 00 00 00 00  |........D ......|
0007a920  00 00 00 00 00 00 00 00  58 48 0a 00 00 00 00 00  |........XH......|
0007a930  58 48 0a 00 00 00 00 00  00 01 00 00 0b 00 00 00  |XH..............|
0007a940  4f 20 0a 00 00 00 00 00  00 00 00 00 00 00 00 00  |O ..............|
0007a950  76 52 0a 00 00 00 00 00  79 48 0a 00 00 00 00 00  |vR......yH......|
0007a960  01 01 00 00 08 00 00 00  5a 20 0a 00 00 00 00 00  |........Z ......|
0007a970  00 00 00 00 00 00 00 00  7d 48 0a 00 00 00 00 00  |........}H......|

On the other hand, the same content in shimaa64.efi was in a different offset:

0006a910  ff 00 00 00 0b 00 00 00  44 20 0a 00 00 00 00 00  |........D ......|
0006a920  00 00 00 00 00 00 00 00  58 48 0a 00 00 00 00 00  |........XH......|
0006a930  58 48 0a 00 00 00 00 00  00 01 00 00 0b 00 00 00  |XH..............|
0006a940  4f 20 0a 00 00 00 00 00  00 00 00 00 00 00 00 00  |O ..............|
0006a950  76 52 0a 00 00 00 00 00  79 48 0a 00 00 00 00 00  |vR......yH......|
0006a960  01 01 00 00 08 00 00 00  5a 20 0a 00 00 00 00 00  |........Z ......|
0006a970  00 00 00 00 00 00 00 00  7d 48 0a 00 00 00 00 00  |........}H......|

But, there is no 10 a9 06 00 00 00 00 00 in shimaa64.efi but 10 a9 07 00 00 00 00 00.

0007a200  00 00 00 00 00 00 00 00  10 a9 07 00 00 00 00 00  |................|
0007a210  e0 37 07 00 00 00 00 00  70 57 07 00 00 00 00 00  |.7......pW......|
0007a220  b8 22 07 00 00 00 00 00  58 89 07 00 00 00 00 00  |."......X.......|
0007a230  00 c0 07 00 00 00 00 00  b0 61 07 00 00 00 00 00  |.........a......|
0007a240  d8 21 07 00 00 00 00 00  48 b7 05 00 00 00 00 00  |.!......H.......|
0007a250  48 5f 07 00 00 00 00 00  28 60 07 00 00 00 00 00  |H_......(`......|
0007a260  60 70 06 00 00 00 00 00  08 56 07 00 00 00 00 00  |`p.......V......|

The same content in shimaa64.so:

0008a200  00 00 00 00 00 00 00 00  10 a9 07 00 00 00 00 00  |................|
0008a210  e0 37 07 00 00 00 00 00  70 57 07 00 00 00 00 00  |.7......pW......|
0008a220  b8 22 07 00 00 00 00 00  58 89 07 00 00 00 00 00  |."......X.......|
0008a230  00 c0 07 00 00 00 00 00  b0 61 07 00 00 00 00 00  |.........a......|
0008a240  d8 21 07 00 00 00 00 00  48 b7 05 00 00 00 00 00  |.!......H.......|
0008a250  48 5f 07 00 00 00 00 00  28 60 07 00 00 00 00 00  |H_......(`......|
0008a260  60 70 06 00 00 00 00 00  08 56 07 00 00 00 00 00  |`p.......V......|

So it seems there is a gap of 0x10000 in shimaa64.efi, and maybe this caused the crash.

@lcp
Copy link
Collaborator

lcp commented Jun 9, 2021

The first 0x10000 data of shimaa64.so is the elf header and we use objcopy -O binary to strip it. This explains the 0x10000 gap between shimaa64.so and shimaa64.efi.

@dannf
Copy link
Author

dannf commented Jun 9, 2021

Nice find @lcp . I had added some debug Print() statements yesterday to _relocate() and it thinks it is working through a relsz of 0x4f000. If my assumption is correct that there should not /be/ any relocations, then it definitely looks like it is just processing garbage.

@lcp
Copy link
Collaborator

lcp commented Jun 10, 2021

Disassembled _start in shimaa64.so:

0000000000001000 <_start>:
    1000:       a9be7bfd        stp     x29, x30, [sp, #-32]!
    1004:       910003fd        mov     x29, sp
    1008:       a90107e0        stp     x0, x1, [sp, #16]
    100c:       aa0003e2        mov     x2, x0
    1010:       aa0103e3        mov     x3, x1
    1014:       10ff7f60        adr     x0, 0 <ImageBase>
    1018:       b00003c1        adrp    x1, 7a000 <ErrorCodeTable+0x60>
    101c:       91244021        add     x1, x1, #0x910
    1020:       94019311        bl      65c64 <_relocate>
    1024:       b5000060        cbnz    x0, 1030 <_start+0x30>
    1028:       a94107e0        ldp     x0, x1, [sp, #16]
    102c:       94000a38        bl      390c <efi_main>
    1030:       a8c27bfd        ldp     x29, x30, [sp], #32
    1034:       d65f03c0        ret

_DYNAMIC is the offset mentioned at '1018': 0x7a000.
The content of shimaa64.so at 0x7a000+0x1018+0x10000:

0008a010  07 00 00 00 00 00 00 80  50 f8 0a 00 00 00 00 00  |........P.......|
0008a020  08 00 00 00 00 00 00 80  6a f8 0a 00 00 00 00 00  |........j.......|
0008a030  09 00 00 00 00 00 00 80  8a f8 0a 00 00 00 00 00  |................|
0008a040  0a 00 00 00 00 00 00 80  ac f8 0a 00 00 00 00 00  |................|
0008a050  0b 00 00 00 00 00 00 80  ca f8 0a 00 00 00 00 00  |................|
0008a060  0c 00 00 00 00 00 00 80  e2 f8 0a 00 00 00 00 00  |................|
0008a070  0d 00 00 00 00 00 00 80  f4 f8 0a 00 00 00 00 00  |................|
0008a080  0e 00 00 00 00 00 00 80  10 f9 0a 00 00 00 00 00  |................|
0008a090  0f 00 00 00 00 00 00 80  24 f9 0a 00 00 00 00 00  |........$.......|
0008a0a0  10 00 00 00 00 00 00 80  40 f9 0a 00 00 00 00 00  |........@.......|

Disassembled _start in shimaa64.efi:

0000000000001000 .text:
    1000: fd 7b be a9                   stp     x29, x30, [sp, #-32]!
    1004: fd 03 00 91                   mov     x29, sp
    1008: e0 07 01 a9                   stp     x0, x1, [sp, #16]
    100c: e2 03 00 aa                   mov     x2, x0
    1010: e3 03 01 aa                   mov     x3, x1
    1014: 60 7f ff 10                   adr     x0, #-4116
    1018: c1 03 00 b0                   adrp    x1, #495616
    101c: 21 40 24 91                   add     x1, x1, #2320
    1020: 11 93 01 94                   bl      #412740 <.text+0x64c64>
    1024: 60 00 00 b5                   cbnz    x0, #12 <.text+0x30>
    1028: e0 07 41 a9                   ldp     x0, x1, [sp, #16]
    102c: 38 0a 00 94                   bl      #10464 <.text+0x290c>
    1030: fd 7b c2 a8                   ldp     x29, x30, [sp], #32
    1034: c0 03 5f d6                   ret

In this case, the offset to _DYNAMIC is 495616=0x79000.
The content of shimaa64.so at 0x79000+0x1018:

0007a010  07 00 00 00 00 00 00 80  50 f8 0a 00 00 00 00 00  |........P.......|
0007a020  08 00 00 00 00 00 00 80  6a f8 0a 00 00 00 00 00  |........j.......|
0007a030  09 00 00 00 00 00 00 80  8a f8 0a 00 00 00 00 00  |................|
0007a040  0a 00 00 00 00 00 00 80  ac f8 0a 00 00 00 00 00  |................|
0007a050  0b 00 00 00 00 00 00 80  ca f8 0a 00 00 00 00 00  |................|
0007a060  0c 00 00 00 00 00 00 80  e2 f8 0a 00 00 00 00 00  |................|
0007a070  0d 00 00 00 00 00 00 80  f4 f8 0a 00 00 00 00 00  |................|
0007a080  0e 00 00 00 00 00 00 80  10 f9 0a 00 00 00 00 00  |................|
0007a090  0f 00 00 00 00 00 00 80  24 f9 0a 00 00 00 00 00  |........$.......|
0007a0a0  10 00 00 00 00 00 00 80  40 f9 0a 00 00 00 00 00  |........@.......|

So at least the offset to _DYNAMIC is correct.

@lcp
Copy link
Collaborator

lcp commented Jun 10, 2021

Nice find @lcp . I had added some debug Print() statements yesterday to _relocate() and it thinks it is working through a relsz of 0x4f000. If my assumption is correct that there should not /be/ any relocations, then it definitely looks like it is just processing garbage.

Indeed, I also found that a huge relsz (0x4e000) in my AArch64 shim build:

0007a970  07 00 00 00 00 00 00 00  00 d0 07 00 00 00 00 00  |................|
0007a980  08 00 00 00 00 00 00 00  00 e0 04 00 00 00 00 00  |................|
0007a990  09 00 00 00 00 00 00 00  18 00 00 00 00 00 00 00  |................|

DT_RELA: 0x7d000
DT_RELASZ: 0x4e000
DT_RELAENT: 0x18

In my case, rela is 0x7d000 and the total image file size is 0xcb000. relsz is 0x4e000 = 0xcb000-0x7d000, so that means that the relocation table covers the whole image below 0x7d000. Then I checked further:

00099790  03 04 00 00 00 00 00 00  54 88 0a 00 00 00 00 00  |........T.......|
000997a0  80 97 07 00 00 00 00 00  03 04 00 00 00 00 00 00  |................|
000997b0  40 71 07 00 00 00 00 00  53 00 48 00 49 00 4d 00  |@q......S.H.I.M.|
000997c0  5f 00 44 00 45 00 42 00  55 00 47 00 00 00 00 00  |_.D.E.B.U.G.....|
000997d0  61 00 64 00 64 00 2d 00  73 00 79 00 6d 00 62 00  |a.d.d.-.s.y.m.b.|
000997e0  6f 00 6c 00 2d 00 66 00  69 00 6c 00 65 00 20 00  |o.l.-.f.i.l.e. .|

R_AARCH64_RELATIVE is 1027(0x403), so the entries containing 03 04 00 00 00 00 00 00 are supposed to be struct Elf64_Rela. However, there are strings in the image after 0x997b8, and those are clearly not struct Elf64_Rela. So relsz is way too large than expected...

@dannf
Copy link
Author

dannf commented Jun 10, 2021

Keeping in mind I've little experience w/ binary formats, I wonder what it means that readelf claims there is no dynamic section:

$ readelf -a shimaa64.so  2> /dev/null| grep -i DYNAMIC
There is no dynamic section in this file.
  1889: 00000000000798e8     0 OBJECT  LOCAL  DEFAULT  ABS _DYNAMIC

Is it possible that there simply is no .dynamic array, and the value of _DYNAMIC's is undefined? gnu-efi seems to always assume that there will be a .dynamic section - and that it will contain at least a terminating DT_NULL entry, but maybe it shouldn't?

@lcp
Copy link
Collaborator

lcp commented Jun 11, 2021

The dynamic section is merged into the data section:
https://github.com/rhboot/shim/blob/main/elf_aarch64_efi.lds#L30

Ideally, those elf sections would be converted to pe-coff sections by objcopy, but it wasn't implemented for AArch64 yet, so shim currently merges those sections into text, data, or rodata sections and crafts the section headers with assembly.

@dannf
Copy link
Author

dannf commented Jun 11, 2021

Yeah, I tried moving the .dynamic section out of .data and into it's own section in the linker script to see if that would make a .dynamic section appear in the elf .so, but it didn't. That's what made me think it may just not exist. Note that none of the .o files that are being linked together have a .dynamic section, and I wondered if that would make *(.dynamic) NULL.

EDIT: I was doing it wrong, I do see a .dynamic section appear in the elf if I create one at the end.

@lcp
Copy link
Collaborator

lcp commented Jun 16, 2021

I started to realize that the relocation is necessary. For example, a function pointer is stored in .data section and its content points to a function in .text section. When EFI image is loaded, those function pointers needs to be relocated from the original addresses (rel->r_addend) to the actual running addresses (ImageBase + rel->r_addend).
It seems to me that objcopy tends to assign the size of the whole .rodata section to .rela, and this caused _relocate() to go through the whole .rodata section. We have to fix the size of .rela to avoid the potential crash.

Correction: It's ld not objcopy to write the section size.

lcp added a commit to lcp/shim that referenced this issue Jun 16, 2021
The previous commit(*) merged .rel* and .dyn* into .rodata. However,
this makes ld to generate the wrong section size for .rel* to cover
other unrelated sections. This commit moves .rel* and .dyn* out of
.rodata in the ld script. However, since the related variables, such
as _evrodata, _rodata_size, and _rodata_vsize, are moved, so the
crafted pe-coff section header for .rodata still covers our new
.rela and .dyn sections.

(*) 212ba30 ("arm/aa64 targets: put .rel* and .dyn* in .rodata")

Issue: rhboot#371

Signed-off-by: Gary Lin <[email protected]>
@lcp
Copy link
Collaborator

lcp commented Jun 16, 2021

I wrote a testing patch for elf_aarch64_efi.lds:
lcp@686ff99

In the .dynamic section of patched shimaa64.efi:

0007a970  07 00 00 00 00 00 00 00  00 b0 09 00 00 00 00 00  |................|
0007a980  08 00 00 00 00 00 00 00  b8 c7 01 00 00 00 00 00  |................|
0007a990  09 00 00 00 00 00 00 00  18 00 00 00 00 00 00 00  |................|

DT_RELA: 0x9b000
DT_RELASZ: 0x1c7b8
DT_RELAENT: 0x18

The size looks much reasonable now :)

@dannf
Copy link
Author

dannf commented Jun 16, 2021

Oh, nice @lcp. Yeah, with your patch I also see a more reasonable relsz: 0x1c7a0, and it is working for me. I've put my guest in a reboot loop to see if it hits any other issues - at #71 so far w/o a problem.

lcp added a commit to lcp/shim that referenced this issue Jun 17, 2021
The previous commit(*) merged .rel* and .dyn* into .rodata, and this
made ld to generate the wrong size for .rela* sections that covered
other unrelated sections. When the EFI image was loaded, _relocate()
went through the unexpected data and may cause unexpected crash.
This commit moves .rel* and .dyn* out of .rodata in the ld script but
also moves the related variables, such as _evrodata, _rodata_size,
and _rodata_vsize, to the end of the new .dyn section, so that the
crafted pe-coff section header for .rodata still covers our new
.rela and .dyn sections.

(*) 212ba30 ("arm/aa64 targets: put .rel* and .dyn* in .rodata")

Fix issue: rhboot#371

Signed-off-by: Gary Lin <[email protected]>
@dannf
Copy link
Author

dannf commented Jun 17, 2021

Survived > 1600 reboots before I stopped it, so with this arm64 booting seems pretty robust :)

@steve-mcintyre
Copy link
Collaborator

Yay, good news. Although I'm definitely feeling we should be focussing on getting proper toolchain support which would make this moot... :-)

@vielmetti
Copy link

@steve-mcintyre

Is there a concise technical description of the missing pieces in the toolchain? I know some of the Arm folks working in that area and would be happy to convey a detailed request for assistance.

@lcp
Copy link
Collaborator

lcp commented Jun 25, 2021

@vielmetti Since objcopy doesn't support PE/COFF for AArch64, shim/gnu-efi has to craft the PE/COFF header like this:

https://github.com/rhboot/gnu-efi/blob/f0f98248649b4b219764bd46854697bcec858081/gnuefi/crt0-efi-aarch64.S#L19-L146

To make the header work properly, we have to tweak some fields in the section table such as VirtualSize and the variables in the ld script. If objcopy could support pe-aarch64 target, the we can get rid of those workarounds and just convert the elf sections into the result EFI binary.

@lcp
Copy link
Collaborator

lcp commented Jun 25, 2021

The binutils bug entry:
https://sourceware.org/bugzilla/show_bug.cgi?id=26206

@steve-mcintyre
Copy link
Collaborator

Cool, I hadn't found the bug report yet and was just thinking about adding one myself

@steve-mcintyre
Copy link
Collaborator

@steve-mcintyre

Is there a concise technical description of the missing pieces in the toolchain? I know some of the Arm folks working in that area and would be happy to convey a detailed request for assistance.

Hi @vielmetti LTNS! I've just mailed the linaro-toolchain list about this as well, with a reference to the binutils bug that @lcp referenced. Waiting on that showing up in the archives...

@steve-mcintyre
Copy link
Collaborator

@vielmetti
Copy link

Thanks @steve-mcintyre - relayed this to some Arm folks who may be able to get some additional eyes on task.

vathpela pushed a commit that referenced this issue Jul 20, 2021
The previous commit(*) merged .rel* and .dyn* into .rodata, and this
made ld to generate the wrong size for .rela* sections that covered
other unrelated sections. When the EFI image was loaded, _relocate()
went through the unexpected data and may cause unexpected crash.
This commit moves .rel* and .dyn* out of .rodata in the ld script but
also moves the related variables, such as _evrodata, _rodata_size,
and _rodata_vsize, to the end of the new .dyn section, so that the
crafted pe-coff section header for .rodata still covers our new
.rela and .dyn sections.

(*) 212ba30 ("arm/aa64 targets: put .rel* and .dyn* in .rodata")

Fix issue: #371

Signed-off-by: Gary Lin <[email protected]>
@vielmetti
Copy link

@dannf @steve-mcintyre et al - I'm coming back to this problem after about a month away - there has clearly been some progress in resolving this, but I'm not sure where the current discussion lives (or if it's been summarized recently).

My goal is to provide direction for our team that builds Ubuntu and Debian images to make sure they pick up the right versions of the right components so that GRUB works correctly on Ubuntu 20.04 / Debian 10 on arm64 server systems (specifically the Ampere systems at Equinix Metal).

@dannf
Copy link
Author

dannf commented Jul 27, 2021

@dannf @steve-mcintyre et al - I'm coming back to this problem after about a month away - there has clearly been some progress in resolving this, but I'm not sure where the current discussion lives (or if it's been summarized recently).

My goal is to provide direction for our team that builds Ubuntu and Debian images to make sure they pick up the right versions of the right components so that GRUB works correctly on Ubuntu 20.04 / Debian 10 on arm64 server systems (specifically the Ampere systems at Equinix Metal).

For Ubuntu this has been resolved in shim 15.4-0ubuntu7 - see https://launchpad.net/bugs/1928010.

pmhahn pushed a commit to univention/shim that referenced this issue Sep 20, 2021
pmhahn pushed a commit to univention/shim that referenced this issue Feb 10, 2022
The previous commit(*) merged .rel* and .dyn* into .rodata, and this
made ld to generate the wrong size for .rela* sections that covered
other unrelated sections. When the EFI image was loaded, _relocate()
went through the unexpected data and may cause unexpected crash.
This commit moves .rel* and .dyn* out of .rodata in the ld script but
also moves the related variables, such as _evrodata, _rodata_size,
and _rodata_vsize, to the end of the new .dyn section, so that the
crafted pe-coff section header for .rodata still covers our new
.rela and .dyn sections.

(*) 212ba30 ("arm/aa64 targets: put .rel* and .dyn* in .rodata")

Fix issue: rhboot#371

Signed-off-by: Gary Lin <[email protected]>
(cherry picked from commit 34e3ef2)
pmhahn added a commit to univention/shim that referenced this issue Feb 10, 2022
dbnicholson pushed a commit to endlessm/shim that referenced this issue Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants