Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve 32bit ld/st addressing mode propagation #3421

Merged
merged 1 commit into from
Mar 11, 2024

Conversation

pmatos
Copy link
Collaborator

@pmatos pmatos commented Feb 12, 2024

Signed-off-by: Paulo Matos [email protected]

@pmatos pmatos changed the title AddressingModes32 Improve 32bit ld/st addressing mode propagation Feb 12, 2024
@pmatos
Copy link
Collaborator Author

pmatos commented Feb 12, 2024

Still looking to see if I can improve other types of addressing modes, so this is still marked as draft.

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 12, 2024

Was going to say there are no asm_tests failures so it looks like the current change is ok, although I am adding access in 32bits to a block that has been 64 bits only. However, I just noticed glibc tests failed so I will look at those.

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 20, 2024

This is still failing but I am gaining an understanding of the complexities of this optimization. It's currently failing.

My last optimization was a result of fixing this:
lea eax, [0x10000040]
mov eax, [eax-0x40]

which was generating a LoadMem between a register (repr eax) and the two complement of 0x40, but failing to sign extend it.

The current issue is in :

add     dword [eax], edi

Here we are loading eax, adding eax and edi and then storing to eax. But when we create the LoadMem, and simplify it we are sign extending the incorrect base. Unsure if there's a sure-fire way to know which of the registers is the base and which is the offset, or if it makes sense to ask this question even. WIP

@pmatos pmatos marked this pull request as ready for review February 21, 2024 18:05
@pmatos pmatos requested a review from Sonicadvance1 February 21, 2024 18:05
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 21, 2024
Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of
small address wrap around, allocate the first 4GB in the 64-bit space.

Some context. Linux always reserves the first 16KB of virtual address
space (unless you tinker with some settings which nobody should do).

Example of 32-bit code:
lea eax, [0xffff_0000]
mov ebx, [eax + 0x1_0000]

The address calculated by the mov will wrap around to 0x0 which will
result in SIGSEGV. If FEX messes up zero extensions then it would try to
access 0x1_0000_0000 instead.

This could result in a 32-bit application potentially accessing some FEX
memory instead of crashing.
Add this safety net which will still SIGSEGV and we will be able to see
the crash.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 22, 2024
Spurred on by FEX-Emu#3421

This does a bunch of GPR and vector loads to showcase addressing
limitations between ARM and x86.

Tests:
- 8/16/32/64-bit GPR loads
   - Both 32-bit and 64-bit addressing modes
- 32/64/128-bit Vector loads
   - Both 32-bit and 64-bit addressing modes
- Duplicate the tests for 32-bit addressing mode with a 32-bit process
   - Since it should change behaviour.

Untested:
- 8/16-bit vector loadstores since those don't exist on x86
   - 16-bit x87 integer load exists but that doesn't go through a vector
     load in FEX.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 22, 2024
…2-bit

Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of
small address wrap around, allocate the second 4GB of virtual memory.

Some context. Linux always reserves the first 16KB of virtual address
space (unless you tinker with some settings which nobody should do).

Example of 32-bit code:
lea eax, [0xffff_0000]
mov ebx, [eax + 0x1_0000]

The address calculated by the mov will wrap around to 0x0 which will
result in SIGSEGV. If FEX messes up zero extensions then it would try to
access 0x1_0000_0000 instead.

This could result in a 32-bit application potentially accessing some FEX
memory instead of crashing.
Add this safety net which will still SIGSEGV and we will be able to see
the crash.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 24, 2024
…2-bit

Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of
small address wrap around, allocate the second 4GB of virtual memory.

Some context. Linux always reserves the first 16KB of virtual address
space (unless you tinker with some settings which nobody should do).

Example of 32-bit code:
lea eax, [0xffff_0000]
mov ebx, [eax + 0x1_0000]

The address calculated by the mov will wrap around to 0x0 which will
result in SIGSEGV. If FEX messes up zero extensions then it would try to
access 0x1_0000_0000 instead.

This could result in a 32-bit application potentially accessing some FEX
memory instead of crashing.
Add this safety net which will still SIGSEGV and we will be able to see
the crash.
@Sonicadvance1
Copy link
Member

This'll need a rebase so it can pick up the new instcountci changes before it can get merged

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 25, 2024

This'll need a rebase so it can pick up the new instcountci changes before it can get merged

done - thx.

@alyssarosenzweig
Copy link
Collaborator

IsInlineConstant is your new friend

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 25, 2024

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 25, 2024

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

Not as straightforward as calling the method. It's in ARM64JITCore which we don't have access to in ConstProp.cpp. It seems to me like this is more of an IR helper function anyway, and indeed it only depends on IR so we probably could put it elsewhere. I will see if I can refactor it.

@alyssarosenzweig
Copy link
Collaborator

alyssarosenzweig commented Feb 25, 2024

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

Not as straightforward as calling the method. It's in ARM64JITCore which we don't have access to in ConstProp.cpp. It seems to me like this is more of an IR helper function anyway, and indeed it only depends on IR so we probably could put it elsewhere. I will see if I can refactor it.

Oops, my bad! I meant IsValueConstant .. sorry for the mixup

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 25, 2024

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

Not as straightforward as calling the method. It's in ARM64JITCore which we don't have access to in ConstProp.cpp. It seems to me like this is more of an IR helper function anyway, and indeed it only depends on IR so we probably could put it elsewhere. I will see if I can refactor it.

Oops, my bad! I meant IsValueConstant .. sorry for the mixup

Ah, yes, I was wondering the different between a constant and an inline constant. :)

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 26, 2024
When the source arguments for LoadMem/StoreMem have bit 31 set then they
are incorrectly sign extending in some instances.

Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 26, 2024
When the source arguments for LoadMem/StoreMem have bit 31 set then they
are incorrectly sign extending in some instances.

Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 26, 2024
When the source arguments for LoadMem/StoreMem have bit 31 set then they
are incorrectly sign extending in some instances.

Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.
@pmatos
Copy link
Collaborator Author

pmatos commented Feb 27, 2024

I just made some sort of mistake pushing the last patch - sorry. Fixing.

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 27, 2024

Thank you git reflog show! :)
Anyway, I have now fixed the issue from #3460.
I have assumed the displacement can be sign extended and that it's always in Arg1. I don't think it's crazy to assume this, since when the LoadSource create the Add for the memory address it always put the displacement in Arg1.

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 28, 2024
…u#3421

Doesn't quite match the libc code directly because it uses `[gs:eax]`
with both having the sign bit set and we can't deal with that with ASM
tests. So match the behaviour in a different way.
@pmatos pmatos force-pushed the AddressingModes32 branch 2 times, most recently from b7f7aa9 to f4b3870 Compare February 28, 2024 13:41
@pmatos
Copy link
Collaborator Author

pmatos commented Feb 28, 2024

Rebased new patch on main - which should now include the #3466 test.

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 29, 2024
This time found in MGRR. It flips the problem space on its head.
bylaws pushed a commit to bylaws/FEX that referenced this pull request Feb 29, 2024
Spurred on by FEX-Emu#3421

This does a bunch of GPR and vector loads to showcase addressing
limitations between ARM and x86.

Tests:
- 8/16/32/64-bit GPR loads
   - Both 32-bit and 64-bit addressing modes
- 32/64/128-bit Vector loads
   - Both 32-bit and 64-bit addressing modes
- Duplicate the tests for 32-bit addressing mode with a 32-bit process
   - Since it should change behaviour.

Untested:
- 8/16-bit vector loadstores since those don't exist on x86
   - 16-bit x87 integer load exists but that doesn't go through a vector
     load in FEX.
bylaws pushed a commit to bylaws/FEX that referenced this pull request Feb 29, 2024
…2-bit

Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of
small address wrap around, allocate the second 4GB of virtual memory.

Some context. Linux always reserves the first 16KB of virtual address
space (unless you tinker with some settings which nobody should do).

Example of 32-bit code:
lea eax, [0xffff_0000]
mov ebx, [eax + 0x1_0000]

The address calculated by the mov will wrap around to 0x0 which will
result in SIGSEGV. If FEX messes up zero extensions then it would try to
access 0x1_0000_0000 instead.

This could result in a 32-bit application potentially accessing some FEX
memory instead of crashing.
Add this safety net which will still SIGSEGV and we will be able to see
the crash.
bylaws pushed a commit to bylaws/FEX that referenced this pull request Feb 29, 2024
When the source arguments for LoadMem/StoreMem have bit 31 set then they
are incorrectly sign extending in some instances.

Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.
bylaws pushed a commit to bylaws/FEX that referenced this pull request Feb 29, 2024
Spurred on by FEX-Emu#3421

This does a bunch of GPR and vector loads to showcase addressing
limitations between ARM and x86.

Tests:
- 8/16/32/64-bit GPR loads
   - Both 32-bit and 64-bit addressing modes
- 32/64/128-bit Vector loads
   - Both 32-bit and 64-bit addressing modes
- Duplicate the tests for 32-bit addressing mode with a 32-bit process
   - Since it should change behaviour.

Untested:
- 8/16-bit vector loadstores since those don't exist on x86
   - 16-bit x87 integer load exists but that doesn't go through a vector
     load in FEX.
bylaws pushed a commit to bylaws/FEX that referenced this pull request Feb 29, 2024
…2-bit

Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of
small address wrap around, allocate the second 4GB of virtual memory.

Some context. Linux always reserves the first 16KB of virtual address
space (unless you tinker with some settings which nobody should do).

Example of 32-bit code:
lea eax, [0xffff_0000]
mov ebx, [eax + 0x1_0000]

The address calculated by the mov will wrap around to 0x0 which will
result in SIGSEGV. If FEX messes up zero extensions then it would try to
access 0x1_0000_0000 instead.

This could result in a 32-bit application potentially accessing some FEX
memory instead of crashing.
Add this safety net which will still SIGSEGV and we will be able to see
the crash.
bylaws pushed a commit to bylaws/FEX that referenced this pull request Feb 29, 2024
When the source arguments for LoadMem/StoreMem have bit 31 set then they
are incorrectly sign extending in some instances.

Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.
bylaws pushed a commit to bylaws/FEX that referenced this pull request Feb 29, 2024
…u#3421

Doesn't quite match the libc code directly because it uses `[gs:eax]`
with both having the sign bit set and we can't deal with that with ASM
tests. So match the behaviour in a different way.
alyssarosenzweig added a commit that referenced this pull request Mar 1, 2024
@pmatos
Copy link
Collaborator Author

pmatos commented Mar 5, 2024

@Sonicadvance1 The bug you found in #3471 did feel like a nail in the coffin of this optimization. Problem being that anything that wraps around in 32bits cannot be faithfully represented in a single 64bit ld/st instructions. After a couple of iterations, I thought just reg+reg was problematic, but after your last bug, even reg+const is problematic. If the value in reg is high enough we wrap around and cannot fold the addition into the memory addressing.

In other words, you can't generally simplify:

add wX, wX, const
ld/st wX, [xX]

into ld/st wX, [xX, const].

For the past few days I have been playing with possible solutions including something like mapping the first 4G of pages and the second 4G of pages to the same physical pages. Something like:
https://lo.calho.st/posts/black-magic-buffer/

However, I think I found something that's easier to integrate into FEX. The use of userfaultfd available at least since kernel 4.11.
I have written a prototype that allocated a structure in the first 4G at and then tried to access it +0x1_0000_0000. The region 0x1_0000_0000 - 0x1_FFFF_FFFF is under userfaultfd, so access to that region triggers a handler that copies the pages from the first 4G to the second 4G and proceeds as if nothing happened. See:

https://gist.github.com/pmatos/677b1e3a3390c00f5c046fbc9bca7f68

What do you think about this?
I will try to integrate a mechanism like this locally in FEX and see how it does, but wanted to understand how you feel about having something like this in FEX to enable these kind of optimizations?

@alyssarosenzweig
Copy link
Collaborator

@Sonicadvance1 The bug you found in #3471 did feel like a nail in the coffin of this optimization. Problem being that anything that wraps around in 32bits cannot be faithfully represented in a single 64bit ld/st instructions. After a couple of iterations, I thought just reg+reg was problematic, but after your last bug, even reg+const is problematic. If the value in reg is high enough we wrap around and cannot fold the addition into the memory addressing.

In other words, you can't generally simplify:

add wX, wX, const
ld/st wX, [xX]

into ld/st wX, [xX, const].

For the past few days I have been playing with possible solutions including something like mapping the first 4G of pages and the second 4G of pages to the same physical pages. Something like: https://lo.calho.st/posts/black-magic-buffer/

However, I think I found something that's easier to integrate into FEX. The use of userfaultfd available at least since kernel 4.11. I have written a prototype that allocated a structure in the first 4G at and then tried to access it +0x1_0000_0000. The region 0x1_0000_0000 - 0x1_FFFF_FFFF is under userfaultfd, so access to that region triggers a handler that copies the pages from the first 4G to the second 4G and proceeds as if nothing happened. See:

https://gist.github.com/pmatos/677b1e3a3390c00f5c046fbc9bca7f68

What do you think about this? I will try to integrate a mechanism like this locally in FEX and see how it does, but wanted to understand how you feel about having something like this in FEX to enable these kind of optimizations?

This is exactly what #3439 is for. In particular the remark:

Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do).

That means that (ignoring sign-extension at least), if the constant is < 16384, the fold is legal. Since either:

  1. There is no overflow, equivalently there is no 32-bit wrap around.
  2. There is 32-bit overflow. Since the address is 32-bit and the constant < 16384, wraps around to something in the first 16KB. This is an illegal memory access and should crash. With 64-bit, this overflows to an address in the first 16KB after the 4GB marker, which FEX has specifically reserved to ensure it will crash in the same way.

@pmatos pmatos force-pushed the AddressingModes32 branch 3 times, most recently from ecfd88f to 9c16415 Compare March 5, 2024 13:55
Folds reg+const memory address into addressing mode,
if the constant is within 16Kb.
Update instcountci files.
Add test 32Bit_ASM/FEX_bugs/SubAddrBug.asm
@pmatos pmatos force-pushed the AddressingModes32 branch from 9c16415 to a86f2d3 Compare March 5, 2024 14:01
@pmatos
Copy link
Collaborator Author

pmatos commented Mar 5, 2024

@Sonicadvance1 The bug you found in #3471 did feel like a nail in the coffin of this optimization. Problem being that anything that wraps around in 32bits cannot be faithfully represented in a single 64bit ld/st instructions. After a couple of iterations, I thought just reg+reg was problematic, but after your last bug, even reg+const is problematic. If the value in reg is high enough we wrap around and cannot fold the addition into the memory addressing.
In other words, you can't generally simplify:

add wX, wX, const
ld/st wX, [xX]

into ld/st wX, [xX, const].
For the past few days I have been playing with possible solutions including something like mapping the first 4G of pages and the second 4G of pages to the same physical pages. Something like: https://lo.calho.st/posts/black-magic-buffer/
However, I think I found something that's easier to integrate into FEX. The use of userfaultfd available at least since kernel 4.11. I have written a prototype that allocated a structure in the first 4G at and then tried to access it +0x1_0000_0000. The region 0x1_0000_0000 - 0x1_FFFF_FFFF is under userfaultfd, so access to that region triggers a handler that copies the pages from the first 4G to the second 4G and proceeds as if nothing happened. See:
https://gist.github.com/pmatos/677b1e3a3390c00f5c046fbc9bca7f68
What do you think about this? I will try to integrate a mechanism like this locally in FEX and see how it does, but wanted to understand how you feel about having something like this in FEX to enable these kind of optimizations?

This is exactly what #3439 is for. In particular the remark:

Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do).

That means that (ignoring sign-extension at least), if the constant is < 16384, the fold is legal. Since either:

  1. There is no overflow, equivalently there is no 32-bit wrap around.
  2. There is 32-bit overflow. Since the address is 32-bit and the constant < 16384, wraps around to something in the first 16KB. This is an illegal memory access and should crash. With 64-bit, this overflows to an address in the first 16KB after the 4GB marker, which FEX has specifically reserved to ensure it will crash in the same way.

After that explanation the 16Kb comments makes even more sense. Which means that ok - without any other handling - we can indeed deal with 16Kb positive and negative offsets. I have pushed a patch to that effect. I am both upset and happy that I didn't see this coming. On the one hand, I spent a ton of time looking into userfaultfd which was not strictly necessary for this. On the other hand, looking into userfaultfd did show me a couple of cool tricks that could open the door for further optimization.

@Sonicadvance1
Copy link
Member

Fantastic research in to finding out the limits of 32-bit addressing. I knew it would have problems but I wasn't sure where the limits were at. Known unknowns and whatnot.

userfaultfd doesn't necessarily work because it will do copying and accesses that split the ranges will get decoupled. Additionally device mapped memory will have problems.

Good to know these problems now and this optimization must be limited to constants <16KB

@pmatos
Copy link
Collaborator Author

pmatos commented Mar 11, 2024

@Sonicadvance1 ping - what do you think about merging this now?

Copy link
Member

@Sonicadvance1 Sonicadvance1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Gave this some testing and I don't see any regressions from the more limited scope of this now.
Thanks for pushing this through and finding the issues.

@Sonicadvance1 Sonicadvance1 merged commit ff0c763 into FEX-Emu:main Mar 11, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants