Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM: FPU: using Unshared FP Services mode can still result in corrupted floating point registers #29590

Closed
mniestroj opened this issue Oct 27, 2020 · 24 comments · Fixed by #31772
Assignees
Labels
area: ARM ARM (32-bit) Architecture area: Toolchains Toolchains bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@mniestroj
Copy link
Member

mniestroj commented Oct 27, 2020

Describe the bug
Unshared FP Services mode is supposed to work properly when a single thread uses the FPU. However, it looks like GCC (version 9 and above) generates code that temporarily stores integer data to FPU registers, even if no FPU math operations are carried out within a function. The values in FPU registers can be overwritten by other threads, then such modified values are loaded into original function and cause program corruption. This happens because under unshared FP registers mode, the Cortex-M (and perhaps other architectures) do not save/restore the FP context during thread context-switch or even during exception entry and return.

A straightforward workaround to this problem is to force Shared FP Registers mode for Cortex-M whenever CONFIG_FPU is enabled. This could be done with the following work-around:

  1. Force CONFIG_FPU_SHARING=y. This will be enabled FP context preservation.
  2. Force K_FP_REGS for all threads, so when HW_STACK_PROTECTION is enabled, privileged stack overflows can be detected. Alternatively, the flag could be ignored for Cortex-M (it is only used to set up the right stack guard for privileged threads)

To Reproduce
An out of tree closed application code was used on nRF52840 with CONFIG_FPU=y and CONFIG_FPU_SHARING=n. Issue reproduced with some toolchains only, look at "Environment" below.

Additionally, the issue is reproduced by in-tree code, see #31472

Expected behavior
Using Unshared FP Services mode should be either safe to use (assuming only one thread operates on floats) or disallowed (not being possible to select by Kconfig) if not supported.
(As described above, this is not sufficient: we need to ensure FP Sharing mode is also working as expected.)

Impact
Using Unshared FP Services mode with ARM Cortex-M4F (and possibly others) results in undefined behavior.

Code listing
Here is the function that stores temporarily into FPU register (vmov s16, r3) and then loads from it later (vmov r1, s16).

00028014 <at_send>:
{
   28014:	e92d 4ff0 	stmdb	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
   28018:	ed2d 8b02 	vpush	{d8}
   2801c:	b087      	sub	sp, #28
   2801e:	4680      	mov	r8, r0
   28020:	ee08 3a10 	vmov	s16, r3
   28024:	468a      	mov	sl, r1
   28026:	9203      	str	r2, [sp, #12]
	int64_t ts = k_uptime_get();
   28028:	f022 ffb4 	bl	4af94 <k_uptime_get>
	int64_t remaining_timeout = timeout;
   2802c:	9c12      	ldr	r4, [sp, #72]	; 0x48
	int64_t ts = k_uptime_get();
   2802e:	4606      	mov	r6, r0
   28030:	460f      	mov	r7, r1
	int64_t remaining_timeout = timeout;
   28032:	17e5      	asrs	r5, r4, #31
	while (k_msgq_get(&modem->at.msgq, &c,
   28034:	f508 7ba4 	add.w	fp, r8, #328	; 0x148
	return z_impl_k_msgq_get(msgq, data, timeout);
   28038:	f640 42cd 	movw	r2, #3277	; 0xccd
   2803c:	2300      	movs	r3, #0
   2803e:	f10d 0117 	add.w	r1, sp, #23
   28042:	4658      	mov	r0, fp
   28044:	f00b fafc 	bl	33640 <z_impl_k_msgq_get>
   28048:	2800      	cmp	r0, #0
   2804a:	d052      	beq.n	280f2 <at_send+0xde>
	uart_push_at(modem->uart, request);
   2804c:	f8d8 0008 	ldr.w	r0, [r8, #8]
	uart_push_str(uart_dev, str);
   28050:	9001      	str	r0, [sp, #4]
   28052:	4651      	mov	r1, sl
   28054:	f022 ff41 	bl	4aeda <uart_push_str>
	uart_push_str(uart_dev, "\r");
   28058:	493a      	ldr	r1, [pc, #232]	; (28144 <at_send+0x130>)
   2805a:	9801      	ldr	r0, [sp, #4]
   2805c:	f022 ff3d 	bl	4aeda <uart_push_str>
   28060:	4b39      	ldr	r3, [pc, #228]	; (28148 <at_send+0x134>)
   28062:	4a3a      	ldr	r2, [pc, #232]	; (2814c <at_send+0x138>)
   28064:	1a9b      	subs	r3, r3, r2
	uint8_t *recv_ptr = modem->at.recv_buf;
   28066:	f508 79b8 	add.w	r9, r8, #368	; 0x170
   2806a:	08db      	lsrs	r3, r3, #3
   2806c:	9302      	str	r3, [sp, #8]
   2806e:	46c8      	mov	r8, r9
	uptime = k_uptime_get();
   28070:	f022 ff90 	bl	4af94 <k_uptime_get>
   28074:	19a6      	adds	r6, r4, r6
   28076:	eb45 0707 	adc.w	r7, r5, r7
		remaining_timeout -= k_uptime_delta(&ts);
   2807a:	1a34      	subs	r4, r6, r0
   2807c:	eb67 0501 	sbc.w	r5, r7, r1
		if (remaining_timeout < 0) {
   28080:	2c00      	cmp	r4, #0
   28082:	f175 0300 	sbcs.w	r3, r5, #0
   28086:	9001      	str	r0, [sp, #4]
   28088:	468a      	mov	sl, r1
   2808a:	db54      	blt.n	28136 <at_send+0x122>
   2808c:	03e9      	lsls	r1, r5, #15
   2808e:	03e0      	lsls	r0, r4, #15
   28090:	f240 36e7 	movw	r6, #999	; 0x3e7
   28094:	1980      	adds	r0, r0, r6
   28096:	ea41 4154 	orr.w	r1, r1, r4, lsr #17
   2809a:	f44f 727a 	mov.w	r2, #1000	; 0x3e8
   2809e:	f04f 0300 	mov.w	r3, #0
   280a2:	f141 0100 	adc.w	r1, r1, #0
   280a6:	f7d8 fd31 	bl	b0c <__aeabi_uldivmod>
   280aa:	4602      	mov	r2, r0
   280ac:	460b      	mov	r3, r1
   280ae:	f10d 0117 	add.w	r1, sp, #23
   280b2:	4658      	mov	r0, fp
   280b4:	f00b fac4 	bl	33640 <z_impl_k_msgq_get>
		if (k_msgq_get(&modem->at.msgq, &c,
   280b8:	2800      	cmp	r0, #0
   280ba:	d13c      	bne.n	28136 <at_send+0x122>
		if (c == '\r' || c == '\n' || c == '\0') {
   280bc:	f89d 2017 	ldrb.w	r2, [sp, #23]
   280c0:	2a0d      	cmp	r2, #13
   280c2:	d825      	bhi.n	28110 <at_send+0xfc>
   280c4:	f242 4301 	movw	r3, #9217	; 0x2401
   280c8:	40d3      	lsrs	r3, r2
   280ca:	43db      	mvns	r3, r3
   280cc:	f013 0301 	ands.w	r3, r3, #1
   280d0:	d11e      	bne.n	28110 <at_send+0xfc>
			if (recv_ptr == modem->at.recv_buf) {
   280d2:	45c8      	cmp	r8, r9
   280d4:	d00a      	beq.n	280ec <at_send+0xd8>
			*recv_ptr = '\0';
   280d6:	f888 3000 	strb.w	r3, [r8]
			ret = line_cb(modem->at.recv_buf, user_data);
   280da:	ee18 1a10 	vmov	r1, s16
   280de:	9b03      	ldr	r3, [sp, #12]
   280e0:	4648      	mov	r0, r9
   280e2:	4798      	blx	r3
			if (ret != -EAGAIN) {
   280e4:	f110 0f0b 	cmn.w	r0, #11
   280e8:	d127      	bne.n	2813a <at_send+0x126>
	uint8_t *recv_ptr = modem->at.recv_buf;
   280ea:	46c8      	mov	r8, r9
   280ec:	9e01      	ldr	r6, [sp, #4]
   280ee:	4657      	mov	r7, sl
   280f0:	e7be      	b.n	28070 <at_send+0x5c>
   280f2:	f022 ff4f 	bl	4af94 <k_uptime_get>
		remaining_timeout -= k_uptime_delta(&ts);
   280f6:	1a36      	subs	r6, r6, r0
   280f8:	eb67 0701 	sbc.w	r7, r7, r1
   280fc:	19a4      	adds	r4, r4, r6
   280fe:	eb47 0505 	adc.w	r5, r7, r5
		if (remaining_timeout < 0) {
   28102:	2c00      	cmp	r4, #0
   28104:	f175 0300 	sbcs.w	r3, r5, #0
   28108:	db15      	blt.n	28136 <at_send+0x122>
	*reftime = uptime;
   2810a:	4606      	mov	r6, r0
   2810c:	460f      	mov	r7, r1
   2810e:	e793      	b.n	28038 <at_send+0x24>
			*recv_ptr = c;
   28110:	f808 2b01 	strb.w	r2, [r8], #1
			if (recv_ptr - modem->at.recv_buf >=
   28114:	eba8 0309 	sub.w	r3, r8, r9
   28118:	2b3f      	cmp	r3, #63	; 0x3f
   2811a:	dde7      	ble.n	280ec <at_send+0xd8>
				LOG_WRN("AT response overflow");
   2811c:	4b0c      	ldr	r3, [pc, #48]	; (28150 <at_send+0x13c>)
   2811e:	681b      	ldr	r3, [r3, #0]
   28120:	f013 0f06 	tst.w	r3, #6
   28124:	d0e1      	beq.n	280ea <at_send+0xd6>
   28126:	9b02      	ldr	r3, [sp, #8]
   28128:	480a      	ldr	r0, [pc, #40]	; (28154 <at_send+0x140>)
   2812a:	0199      	lsls	r1, r3, #6
   2812c:	f041 0102 	orr.w	r1, r1, #2
   28130:	f01a fba9 	bl	42886 <log_0>
   28134:	e7d9      	b.n	280ea <at_send+0xd6>
			return -ETIMEDOUT;
   28136:	f06f 0073 	mvn.w	r0, #115	; 0x73
}
   2813a:	b007      	add	sp, #28
   2813c:	ecbd 8b02 	vpop	{d8}
   28140:	e8bd 8ff0 	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
   28144:	00060bb6 	.word	0x00060bb6
   28148:	0005183c 	.word	0x0005183c
   2814c:	000516ac 	.word	0x000516ac
   28150:	20002ecc 	.word	0x20002ecc
   28154:	0006523e 	.word	0x0006523e

Environment

  • OS: Linux
  • Toolchain: Zephyr SDK 0.11.2, 0.12.0 Beta1, GNU ARM Embedded 9.2 2019 Q4 major (but not reproducible with GNU ARM Embedded 8.3 2019 Q3 update)
  • Commit SHA or Version used: 966015f

Additional context
Similar issue was reported here: https://answers.launchpad.net/gcc-arm-embedded/+question/691604

@mniestroj mniestroj added bug The issue is a bug, or the PR is fixing a bug area: ARM ARM (32-bit) Architecture area: Toolchains Toolchains labels Oct 27, 2020
@MaureenHelm MaureenHelm added the priority: medium Medium impact/importance bug label Nov 3, 2020
@github-actions
Copy link

github-actions bot commented Jan 3, 2021

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@ioannisg
Copy link
Member

Thanks @mniestroj

An out of tree closed application code was used on nRF52840 with CONFIG_FPU=y and CONFIG_FPU_SHARING=n. Issue reproduced with some toolchains only, look at "Environment" below.

could you share some code example that can reproduce this issue?

@ioannisg
Copy link
Member

Problem can be workarounded by setting CONFIG_FPU_SHARING=y.

Well, this comes at some cost, however, so I'd better not apply this if there's a cleaner solution.

@mniestroj
Copy link
Member Author

mniestroj commented Jan 25, 2021

#31472 is a good example how to reproduce that issue with upstream code. I've added

CONFIG_MAIN_STACK_SIZE=4096
CONFIG_FPU=y

to tests/kernel/poll/prj.conf and executed scripts/twister -N --device-testing --device-serial /dev/ttyACM1 -p nrf52840dk_nrf52840 -T tests/kernel/poll/. It fails that way, but works when CONFIG_FPU_SHARING=y is added as well. Looking at zephyr.lst the pattern is the same as I have noticed originally (in this Issue description) - s16 register is used to store (cache) values in a function, but at the same time it gets overwritten by another thread doing the same, thus overwriting s16 before using it.

Problem can be workarounded by setting CONFIG_FPU_SHARING=y.

Well, this comes at some cost, however, so I'd better not apply this if there's a cleaner solution.

When I have done initial investigation it turned out that compiler (here GCC) should prevent using FPU registers unless there are FPU operations involved (e.g. float multiplication). GCC allows to prevent all FPU registers from being used with -mgeneral-regs-only commandline option, but then no operations on float will work. So my conclusion was that GCC does not support CONFIG_FPU_SHARING=n, at least based on the output for ARM Cortex-M4F. So the only solution in my opinion is to enable CONFIG_FPU_SHARING and maybe allow user to disable it if he is sure that application won't be affected. However I am not an expert in compiler area, so maybe I am missing something.

@ioannisg
Copy link
Member

I've read the same thing here: https://gcc.gnu.org/pipermail/gcc-help/2020-July/139112.html, which aligns with your investigation.

The problem is that we have made assumptions based on the K_FP_REGS thread-create flag, that is, the thread won't use the FP registers so the stack frame will never need to contain the FP registers, but if this is not the case, we might have to do more things than just enable FPU_SHARING together with FPU.

@mniestroj
Copy link
Member Author

I agree, all threads need to be treated like K_FP_REGS would be set for them. As a short term solution that flag could be set automatically in arch_new_thread() for affected <compiler, arch> pair (combination that produces code with FP registers for temporary storage).

@ioannisg
Copy link
Member

I agree, all threads need to be treated like K_FP_REGS would be set for them. As a short term solution that flag could be set automatically in arch_new_thread() for affected <compiler, arch> pair (combination that produces code with FP registers for temporary storage).

So in your understanding, this is just a subset of GCC versions that do this, or all GCC compilers (beyond some release) behave like this consistently?

@mniestroj
Copy link
Member Author

So in your understanding, this is just a subset of GCC versions that do this, or all GCC compilers (beyond some release) behave like this consistently?

I would suggest to treat all GCC versions the same and assume that it might produce code that utilizes FP registers. This is because I haven't found any option for configuring such behavior in GCC codebase and I haven't found any statement in GCC's documentation about one behavior or the other. In summary it is kind of undefined whether FP (or any other than general purpose) registers are used. Issue described here was not reproducible with -O0, but bisecting optimization flags didn't result in one responsible for using FP registers.

I was not able to reproduce this Issue with GCC 8, but it was reproducible with GCC 9 and 10. My suggestion with <compiler, arch> pair was to treat other compilers (like clang) differently from GCC if needed.

@ioannisg
Copy link
Member

@mniestroj thanks for your input here

I agree, all threads need to be treated like K_FP_REGS would be set for them. As a short term solution that flag could be set automatically in arch_new_thread() for affected <compiler, arch> pair (combination that produces code with FP registers for temporary storage).

this comes at a high cost, though.
Wondering:

Does GCC behave as described only under a certain -O configuration? If so then at least we can limit the problematic cases to a subset of optimization configurations.

@mniestroj
Copy link
Member Author

this comes at a high cost, though.

Unfortunately. As I see for ARM K_FP_FLAGS "only" affects stack size/usage, but situation is worse for other archs. I am afraid we cannot do much without explicit compiler support for "do not use FP registers for non-FPU math operations".

Does GCC behave as described only under a certain -O configuration? If so then at least we can limit the problematic cases to a subset of optimization configurations.

-O0 did not use FP
-O1 (probably) used FP - I don't remember for sure now
-O2 and -Os - used FP

@ABOSTM
Copy link
Collaborator

ABOSTM commented Jan 26, 2021

-O0 did not use FP

Warning: it did not use FP register, but doesn't it means it will never do?
We need guarantee it will not use them whatever the code compiled.

@tejlmand tejlmand self-assigned this Jan 26, 2021
@mniestroj
Copy link
Member Author

Warning: it did not use FP register, but doesn't it means it will never do?
We need guarantee it will not use them whatever the code compiled.

I agree, we cannot trust that, it might change in new GCC versions or even when compiling slightly different code. Besides, optimizing FP usage in -O0 doesn't make sense...

@ioannisg
Copy link
Member

So, I am still thinking around this issue. I am involving @tejlmand (Toolchain responsible and build-system maintainer) to get his view.

What I can tell about the workaround is:

  1. We can enable FPU_SHARING unconditionally, when we use FPU (this means we enforce sharing registers mode). This enables FP register stacking when needed. This is not enough though: we need to treat all threads as having the K_FP_REGS flag set. This adds some extra memory overhead, cause the stack area wasted for the stack guard is 128 instead of 32 bytes.
  2. If ISRs use the FP registers, arm context stacking and unstacking should do the work.
    Am I missing something @mniestroj @ABOSTM ?

@ABOSTM
Copy link
Collaborator

ABOSTM commented Jan 26, 2021

This is not enough though: we need to treat all threads as having the K_FP_REGS flag set.

There is already the following code in function prepare_multithreading() in kernel/init.c:

#if defined(CONFIG_FPU) && defined(CONFIG_FPU_SHARING)
	/* Enable FPU in main thread */
	opt |= K_FP_REGS;
#endif

but I am not zephyr thread expert so I don't know whether it is enough ?

Otherwise I am inline with your statement.
About extra memory, I can already tell that we need to enlarge CONFIG_MAIN_STACK_SIZE (see #31472), otherwise some tests will fail.
Note: that CONFIG_MAIN_STACK_SIZE enlargement is also required for stm32f3_disco for some other tests (tests/kernel/threads/tls/, tests/kernel/fatal/exception/, tests/kernel/common/) independently of FPU. But I was waiting for the conclusion/fix on this issue to kill two birds with one stone 😃
And thus proposal would be to update from 512 to 768 in kernel/Kconfig

config MAIN_STACK_SIZE
	...
	default 768 if ZTEST && !(RISCV || X86)

@ioannisg
Copy link
Member

Thanks @ABOSTM , I am not worried on how to implement the workaround, if we decide, finally, to do it.

Another option would be to consider the solution presented in https://gcc.gnu.org/pipermail/gcc-help/2020-July/139112.html, that is, to apply the gen-registers-only GCC directive on code that is not supposed to use the FP Registers.

This falls on the user to do it, throughout the code base, though, so I am reluctant...

@stephanosio
Copy link
Member

It seems the only reasonable solution at this time is to disallow CONFIG_FPU_SHARING=n for ARM targets when building with GCC, until the GCC guys decide to add -mno-implicit-float or something equivalent.

@ioannisg
Copy link
Member

It seems the only reasonable solution at this time is to disallow CONFIG_FPU_SHARING=n for ARM targets when building with GCC, until the GCC guys decide to add -mno-implicit-float or something equivalent.

@stephanosio thanks so much for your input.
I started re-writing the issue description according to the discussion we are having here.

Just to clarify, do you think this affects AARC32 other than the -M profile?

@stephanosio
Copy link
Member

It seems the only reasonable solution at this time is to disallow CONFIG_FPU_SHARING=n for ARM targets when building with GCC, until the GCC guys decide to add -mno-implicit-float or something equivalent.

@stephanosio thanks so much for your input.
I started re-writing the issue description according to the discussion we are having here.

Just to clarify, do you think this affects AARC32 other than the -M profile?

@ioannisg As far as I can see, this applies to all ARM architecture variants. It is, however, not an immediate problem for the non-M profile variants at the moment because the relevant Zephyr arch ports do not support hardware FPU (and therefore do not allow enabling FP instructions).

@ioannisg
Copy link
Member

@stephanosio thanks. Have you been able to dig deep into this? I wonder if GCC allows only callee-saved FP registers to be accessed by non-floating point calculations (i.e. s16 and above), or is it caller-saved too?

@ioannisg
Copy link
Member

@katsuster could you check if this issue is relevant for Risc-V? For Cortex-M we have considered this to be a serious issue.

@ioannisg
Copy link
Member

@katsuster could you check if this issue is relevant for Risc-V? For Cortex-M we have considered this to be a serious issue.

@dcpleung @abrodkin FYI - might worth checking if this is affecting x86 and ARC, respectively.

@katsuster
Copy link
Member

@ioannisg It seems that RISC-V GCC does not use FPU implicitly. Current GCC does not support SIMD of RISC-V instructions (packed-simd extension is not stable now).

Of course, I'm not sure about future events:

  • RISC-V committee will decide to overlay SIMD regs on FPU regs or not
  • GCC will decide to use SIMD registers for integer implicitly or not

@dcpleung
Copy link
Member

@ioannisg x86 does not seem to be affected by this. FPU/SSE have to be explicitly enabled before those registers can be used on 32-bit. On 64-bit, FPU/SSE are always available so we are saving those registers anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM ARM (32-bit) Architecture area: Toolchains Toolchains bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants