Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hal volatile warnings #3222

Open
BsAtHome opened this issue Dec 12, 2024 · 17 comments
Open

Hal volatile warnings #3222

BsAtHome opened this issue Dec 12, 2024 · 17 comments

Comments

@BsAtHome
Copy link
Contributor

There are numerous compiler warnings about using 'volatile' where it is inappropriate and makes no sense. A suggestion was made simply to remove volatile completely and see if things work.

Well, I did so and applied the patch below, which also replaces the odd define with a typedef.

It works flawlessly. Running scripts/rip_environment runtests runs all tests without any problem. There is no change in behavior. This seems to confirms that the volatile classification is void on all of these types. It also removes all volatile warnings from the compile. There are still other places where volatile is used, but this change would be a very good start.

Please, if others can confirm my findings, then we can get rid of the invalid volatile declarations and fix the compiler warnings.

diff --git a/src/hal/hal.h b/src/hal/hal.h
index c86ee55234..0478650c94 100644
--- a/src/hal/hal.h
+++ b/src/hal/hal.h
@@ -315,17 +315,16 @@ typedef enum {
    individual bytes in a machine word. */
 #include <rtapi_bool.h>
 #include <rtapi_stdint.h>
-typedef volatile bool hal_bit_t;
-typedef volatile rtapi_u32 hal_u32_t;
-typedef volatile rtapi_s32 hal_s32_t;
-typedef volatile rtapi_u64 hal_u64_t;
-typedef volatile rtapi_s64 hal_s64_t;
-typedef volatile int hal_port_t;
+typedef bool hal_bit_t;
+typedef rtapi_u32 hal_u32_t;
+typedef rtapi_s32 hal_s32_t;
+typedef rtapi_u64 hal_u64_t;
+typedef rtapi_s64 hal_s64_t;
+typedef int hal_port_t;
 typedef double real_t __attribute__((aligned(8)));
 typedef rtapi_u64 ireal_t __attribute__((aligned(8))); // integral type as wide as real_t / hal_float_t
+typedef real_t hal_float_t __attribute__((aligned(8)));
 
-#define hal_float_t volatile real_t
-       
 /** HAL "data union" structure
  ** This structure may hold any type of hal data
 */
@rmu75
Copy link
Contributor

rmu75 commented Dec 13, 2024

runtests seems to run OK on my dev setup (in sim).

@BsAtHome
Copy link
Contributor Author

It also runs fine on a RPi5 with a Mesa 7C80 card while running over an ssh tunneled X session running axis and having halscope probing some x and y position pins.

@andypugh
Copy link
Collaborator

Can you test it with RTAI (kernel mode) realtime? Is it possible that the "volatile" modifier is relevant there?
I can also see cases where it might be relevant when the variables are stored in HAL shared memory, or where the servo thread may be interrupted by the base thread.
Just because it appears to work, and the tests pass, does not mean that there are no cases where different timings might cause an issue.
It seems to me that "volatile" in our application is quite likely to be necessary in ways that the compiler is not aware of.

@BsAtHome
Copy link
Contributor Author

Can you test it with RTAI (kernel mode) realtime? Is it possible that the "volatile" modifier is relevant there?

I don't have an RTAI kernel here or a setup to test it on. However, the volatile modifier is never relevant in multi-threading (see email list reply).

There is absolutely no difference between user-space or kernel-space for this issue. They address the same memory using the same code in the same way. So if one fails the other fails. If one succeeds, the other succeeds.

I can also see cases where it might be relevant when the variables are stored in HAL shared memory, or where the servo thread may be interrupted by the base thread. Just because it appears to work, and the tests pass, does not mean that there are no cases where different timings might cause an issue. It seems to me that "volatile" in our application is quite likely to be necessary in ways that the compiler is not aware of.

You describe a race-condition, which is not solved by using volatile. You need atomicity to fix any race between threads, regardless whether you run it in kernel- or user-space.

@rmu75
Copy link
Contributor

rmu75 commented Dec 14, 2024

It seems to me that "volatile" in our application is quite likely to be necessary in ways that the compiler is not aware of.

It can hide bugs. I.e. the compiler won't act on (optimize out, eliminate) code depending on undefined behaviour because something is declared volatile. That clearly is a bug, no code should rely on undefined behaviour because compilers are allowed (and do) change how they deal with undefined behaviour (like overflowing signed ints).

Like they do in the linux kernel, volatile should be only in the access, not in interfaces, and not in datastructures. Exception is memory mapped hardware registers. But even that ins debatable IMHO how and where to use volatile in that case.

I will try it on real hardware (rpi4 with mesa 7c80 and PC with 7i92) next week.

@BsAtHome
Copy link
Contributor Author

It can hide bugs. ...

And, it is probably hiding bugs.

I've been experimenting a bit and changed the hal_X_t typedefs to _Atomic and std::atomic versions. Many accesses appear that were not atomic and some were not proper. Also, use of the typedefs were detected where they have no place.

It will take some more digging. Fixing the underlying issues will take some work and many small fixes. There is also the analysis where which kind of memory synchronization level is required. Most of hal's values are currently ad-hoc synchronized and should all work with the relaxed model. However, the defaults between the C and C++ accesses differ, were C defaults to relaxed and C++ defaults to seq_cst. All accesses need to be done explicit with atomic_{load,store}_explicit if consistent access models are to be enforced between C and C++. That is doable, but would require more work.

The hal administrative data access is protected with a properly synchronized mutex, so that works out fine.

@BsAtHome
Copy link
Contributor Author

BsAtHome commented Dec 19, 2024

And it is much worse...

Fixing one line in src/rtapi/rtapi_atomic.h:21, where the __STDC_VERSION__ macro is misspelled, results in many problems compiling with clang/clang++ (see: 04fe8ab).

Clang requires the use of _Atomic on variables when using atomic_{load,store}_explicit() and just this little fact causes a cascade of follow-up compile errors.

@andypugh
Copy link
Collaborator

One thing that might help is that I am fairly sure that realtime HAL modules (which includes the realtime motion) run sequentially, and probably single-threaded. Even if not single-threaded I think that the way that realtime functions are called ensures that only one module is executing at any one time. (The compiler has no way at all to thread multiple components in parallel, as the HAL structure is defined only at configuration load-time)
So we only really need to worry about interactions with shared data from user-space components, GUIs etc.

Thanks for looking into this, it's way outside my area of expertise.

@BsAtHome
Copy link
Contributor Author

One thing that might help is that I am fairly sure that realtime HAL modules (which includes the realtime motion) run sequentially, and probably single-threaded.
[snip]
So we only really need to worry about interactions with shared data from user-space components, GUIs etc.

Well, the problem always comes into play with shared access once you have more than one thread/process accessing the same memory. It doesn't matter that one of the threads does things sequentially. You simply cannot know when one thread reads the same memory the other writes. Therefore, it must be fixed for all accesses.

Thanks for looking into this, it's way outside my area of expertise.

Yeah, I was just fixing all those annoying warnings in the compile.

A compiler warning is a bug that (hopefully) has not yet come back to haunt you. Now some of them are beginning to haunt and it is time to start using the proton-packs and store the warnings securely in the ghost traps. ;-)

@rmu75
Copy link
Contributor

rmu75 commented Dec 20, 2024

I wouldn't be so sure about realtime HAL threads running sequentially. I have no idea about the RTAI situation, but on preempt RT, with no isolcpus or more than one core reserved for realtime, I would be surprised if a stepgen thread did not run concurrently to the servo thread, or possibly preempt it.

But with a usual HAL setup what matters is writing consistent values (32/64bit at once), atomicity, and that should be guaranteed implicitly for aligned values on relevant platforms. Maybe except for double on AARCH64.

Memory ordering shouldn't matter, nobody cares if DRO shows some values from previous servo period. Memory barrier at beginning or end of realtime thread loop should suffice.

Of course you can do stupid/funny things in hal like shannons ultimate machine in two threads, but there is no way to prevent that.

@BsAtHome
Copy link
Contributor Author

[snip]

But with a usual HAL setup what matters is writing consistent values (32/64bit at once), atomicity, and that should be guaranteed implicitly for aligned values on relevant platforms. Maybe except for double on AARCH64.
[snip]

You have been asking the atomicity question of double on aarch64 several times so I decided to test it. The question is whether or not the CPU arch can load/store without using extra locks (i.e. being lock-free). This can be easily tested because the C/C++ atomics API provides a function/method for asking.

Attached is a small test program in C and C++ with a script running it with gcc, g++, clang and clang++. The results are very interesting. atomictest.zip

Tests performed on:

  • RPi3b+, Aarch64 (Armv8)
  • RPi5, Aarch64 (Armv8)
  • Laptop, X86_64 (Intel i7-8650U)
  • PC, amd64 (Athlon(tm) II X3 405e)

Results:

  • X86_64 (amd64) all types of sizes 1, 2, 4 and 8 are lock-free (that includes double)
  • Aarch64 all types of sizes 1, 2, 4 and 8 are lock-free (that includes double)

There is a difference between architectures and compilers. Gcc/g++ is the same om both architectures. However, clang is able to do 128-bit lock-free load/store on Aarch64 , but clang++ is not able to do so. It should be noted that this is on old X86_64 hardware. Maybe newer CPUs will allow lock-free 128-bit types. Either way, the largest type used in LCNC is 64-bit, so supporting larger types is academic.

Note: default clang++ 14.0.6 on RPi3 with Raspbian crashes with a segfault. Installing clang-16 fixes that. The LCNC RPi5 image used on the RPi5 does not have any problems with clang++.

@rmu75
Copy link
Contributor

rmu75 commented Dec 22, 2024

Yes, that makes sense. I don't remember where I got the idea double could be problematic on arm -- i think i misread something talking about "double-width stores needing a loop on arm64", meaning double int-width (128bit) and not width of double.

@andypugh
Copy link
Collaborator

Just to clarify, as perhaps I did not make myself clear.
The base thread can interrupt the servo thread, yes. Hopefully in cases where this matters the individual function pairs take this into account.
Servo thread functions should only ever run in strict sequence, so if you addf a pid comp after an encoder comp, the encoder will have finished its calculations before the pid reads the position HAL pin.

@BsAtHome
Copy link
Contributor Author

In the meantime,...
I've been trying an experiment by changing the definition of the hal_{bit,float,u32,s32,u64,s64}_t types to something similar to what the kernel uses for atomics:

// In rtapi/rtapi_atomic.h
typedef _Atomic(rtapi_u32) rtapi_atomic_u32; // if C
typedef _Atomic(rtapi_s32) rtapi_atomic_s32;
//...
typedef std:atomic<rtapi_u32> rtapi_atomic_u32; // if C++
typedef std:atomic<rtapi_s32> rtapi_atomic_s32;
// ... and repeat for u64, s64, bit (i.e. bool) and float
// Then in hal/hal.h
typedef rtapi_u32 hal_u32_ut; // The underlying type
typedef rtapi_s32 hal_s32_ut;
// ...
typedef struct { rtapi_atomic_u32 _a;}    hal_u32_t;  // The atomic type
typedef struct { rtapi_atomic_s32 _a;}    hal_s32_t;
// ... etc for the others

Such change has no impact on the actual storage sizes of the types and the alignment in shared memory. It just hides access to the variables so that you are forced to use access function. The access functions are inline atomic load, store, inc, dec, add and sub (except for float, which has no atomic add/sub/inc/dec). The access functions are completely eliminated by the compiler optimizer to become the right CPU instruction(s).

Some findings:

  • It catches a lot of abuse of the hal_X_t types in almost all components and drivers. This is based on the assumption that you can do as you like where locals are treated as shared memory and vice versa.
  • The same atomic variables are read multiple times in one statement and blocks (f.ex. for atomic variable 'a': if(a) { b=a; }. This is an obvious race condition and a very bad thing when consistency is required.
  • Hal pins declared input are written to. This essentially pushes the signal in the wrong direction.
  • Many components have never been updated to support s64 and u64 types, but they do distinguish all the "older" types. This suggests some bit-rot may be present.

There are a couple of ways to go forward:

  1. use the current strategy of variable access but do so with atomic access functions (and keep the races);
  2. copy the shared memory variables when the function is called and store back the results in the output when done with the function;
  3. do nothing and wait until things break completely.

Option 2 is IMO the only viable one. It is easy for halcompile generated code. It will also eliminate the races within the functions (though the inter-function-call races will remain). It costs more shared memory space (about double), but that is a small price to pay.
Anyway, trying to buy a computer with almost no memory is a fantasy. Even the RPi4 has at least 1GByte of RAM and plenty to spare. If the copy-overhead causes you to think,... imagine that the speed of computers has increased substantially from the 20+ years ago when you needed to squeeze a bit more.

Note that option 3 will cause a real problem in the future because CPUs will become less ordered to improve speed. It will expose races not yet seen today because ordering changes. It should be noted that using (weakly ordered) ARM arch CPUs already poses a greater risk in that respect than the (strongly ordered) X86 arch.

More discussion input welcome.

@andypugh
Copy link
Collaborator

I am not sure that it is abuse to use the hal_X_types, they only end up using up shared memory of you hal_malloc them there.
Many of the HAL components do do this, they create a structure that contains both hal pins and local storage and hal_malloc the whole lot. I think this is because it's easier this way. I got pedantic about this here: https://github.com/LinuxCNC/linuxcnc/blob/master/src/hal/drivers/mesa-hostmot2/modbus/mesa_modbus.c.tmpl#L200 and it got really quite complicated (and might not be right because of this)
The existing modules could be fixed, but it would be time-consuming and is complex enough to risk breaking things.

Is "if (a) then b = a;" a danger only in the context of HAL shared memory, or in general in any multi-threaded application? I would assume that the compiler is clever enough to avoid running "if (a)" and "b=a" in separate threads that might get a different "a"?
If we are talking specifically about HAL pins then I think that, whilst there could be a problem, there generally is unlikely to be.
The base thread can interrupt the servo thread, but typically the base-thread components do not write to HAL pins that are read by servo thread components. For example the stepgen step and dir pins are only usefully read by other base-thread functions, and these will run in strict sequence.
The base-thread encoder counter writes to the accumulator that it shares with the servo-thread encoder calculations function, but I think that the accumulator read is carefully managed to avoid problems.
In point of fact the base thread is only really used on parallel port setups anyway, anyone using Mesa, Pico, EtherCAT etc won't have a base thread (or will have an empty base thread) so the serv-thread functions won't be interrupted.

This still leaves the possibility of the user-space code writing to a HAL pinat the same time as it is read by the servo thread. A common case is setting the index-enable during homing. I don't think that causes a problem, but would need to pore over the code to be sure.

Writing to HAL input pins is OK and necessary during init, to set default values to pins that are only optionally connected. I would be interested to see if you have found cases outside this scenario.

I think that introducing S64 and U64 HAL pins was a mistake, we should have just made all integer pins 64 bits.
I plan to remove the 32 bit pins, but I don't know the timescale for that. (There are some internal variables that need attention too, in fact that's part of the motivation, to make sure that the encoders and stepgens can never "wrap")

@rmu75
Copy link
Contributor

rmu75 commented Dec 29, 2024

HAL and linuxcnc is designed in a way that no real concurrent access to critical data is done. G-Code interpreter communicates with realtime via its own dedicated exclusive queue.

Of course you can mess with stuff and configure stupid things that won't work. If you follow examples and keep the structure of base and servo thread, everyting is OK because no concurrent access happens.

I would not bother tryting to fix anything because it is unfixable in principle. Non-realtime code can not lock datastructures and prevent realtime from accessing those, that would immediately lead to missed deadlines. So everything has to be done in a way that races are impossible or inconsequential. With clear "ownership" and separation of hal pins into "inputs" and "outputs" that should be guaranteed. Doing read-modify-write of the same value from 2 threads or from non-realtime and realtime is probably a bug and should not happen (with exceptions like homing but that is carefully done so it works).

Keeping stuff like it is won't become a bigger problem with newer CPU architectures because concurrency is limited.

Reading the same shared memory location multiple times in a function may be a problem, that depends. Userspace (means non-realtime) setting something that is read by realtime can't be really that time critical (else that setting would have to happen in realtime to begin with) so if this changed value is dealt with one realtime period earlier or later can't make a difference, that would be a bug otherwise. Communicating a set of values that need to be changed atomically is not possible in this way, but it is not needed anywhere AFAIK, and it would not be possible with current HAL infra.

BTW writing into "input" pins is done on purpose, some pins are bidirectional -- see homing on encoder index pulse for example.

@andypugh the compiler is not "clever enough" to do/avoid anything regarding threads, it is completely unaware that something like threads exist and doesn't schedule/distribute/lock or anything else, it just generates a stream of instructions that are supposed to be executed linearly. Possible interruptions and continuations somewhere else ("threads"), and on SMP machines real concurrentcy (on a single CPU real concurrency can't happen of course) all has to be handled by the program, the compiler doesn't really know about that.

@BsAtHome
Copy link
Contributor Author

I am not sure that it is abuse to use the hal_X_types, they only end up using up shared memory of you hal_malloc them there.

The point is that the hal_X_t types, or effectively the pins, live in shared memory. They are currently declared volatile and we have been discussing to make them proper atomic types.

In either case, these types come with attributes for the compiler that are not compatible with pass-by-value argument types in functions, nor can they be used as value return types of functions. The attributes are also useless (and detrimental) in local variable types. Using the hal_X_t types in these contexts is the reason for the compiler warnings and therefore inappropriate.

You have to distinguish between the variable type living in shared memory and the value type it represents. That is why you need two types to describe what is happening:

  • the type living in shared memory for accessing shared memory
  • the type representing the value (the underlying type) for manipulating the value

The fix in PR #3240 makes exactly the above described distinction. Changing the hal_port_t argument to a reference is an example of that. The hal_port_t is a completely opaque entity and should never be passed by value. Also, the pendent driver is changed to use the underlying type to pass and return values.

Many of the HAL components do do this, they create a structure that contains both hal pins and local storage and hal_malloc the whole lot. I think this is because it's easier this way.

The fact that it is done does not make it necessarily right... However, using shared memory for local data is IMO not a too big problem as long as you keep the size in check.

What bothers me is (ab-)using the hal_X_t types for locals. That is an error because these local variable are most certainly not volatile. Such use is prone to compiler warnings. With volatile in use, it also makes the code run significantly slower because it prevents the compiler from applying a large set of code optimizations.

I got pedantic about this here: https://github.com/LinuxCNC/linuxcnc/blob/master/src/hal/drivers/mesa-hostmot2/modbus/mesa_modbus.c.tmpl#L200 and it got really quite complicated (and might not be right because of this) The existing modules could be fixed, but it would be time-consuming and is complex enough to risk breaking things.

I do appreciate that effort because it shows that you are aware of what is private and what is shared. And, yes, it gets complicated very easily.

It will be a time-consuming effort, but it does need to be addressed in one or another form and way.

@rmu75 limited concurrency is still concurrent with all the problems associated with it. The main problem is memory (and cache) synchronization between CPUs. That will get more tricky with more advanced CPUs.

Is "if (a) then b = a;" a danger only in the context of HAL shared memory, or in general in any multi-threaded application? I would assume that the compiler is clever enough to avoid running "if (a)" and "b=a" in separate threads that might get a different "a"?

It is not about whether "if(a)" and "b=a" run in separate threads. They do not and never will.

The point is that you read the value of 'a' twice. That means, when 'a' is a shared variable, you have a non-zero possibility that your second read has a different value! It may have been changed by another thread between reads. And, actually, you can not say when the change occurred because that particularity is up to the CPU hardware. It may have been changed a "long" time ago, but the coherency flush just picked a "bad" spot for your application.

Remember, ordering of reads and writes are not up to the programmer if the compiler and CPU are not explicitly told to do things in a very specific way. The compiler and CPU are normally free to shuffle code within a thread to optimize any way they like, including which and when memory synchronizes.

If we are talking specifically about HAL pins then I think that, whilst there could be a problem, there generally is unlikely to be. The base thread can interrupt the servo thread, but typically the base-thread components do not write to HAL pins that are read by servo thread components.

When there only is one RT thread, all runs sequentially within that thread. From @rmu75, the input from gcode is a special queue. And the input/output drivers to hardware run in the same thread. Once multiple threads run and interact, just like your encoder example, then you need to be careful. That is all fine and it has apparently been tuned to work properly.

However, the scenario I am worried about is where you have interactions that cause data to change while a RT function runs. The functions added to a RT thread run in sequence, but their pin-values are not fixed at the time they
run.

When a RT function is called, it should run in the following way:

  1. add a memory barrier
  2. make a snapshot of all the pins (used in the function)
  3. run the function on the snapshot data
  4. copy output-, io- and r/w param pins from the snapshot back to shared mem
  5. (add a memory barrier)

Then you do no longer have to worry about multiple reads of the same variable as in 'if(a) b=a' because 'a' will always be the snapshot value.

Please note that it does not matter that creating the snapshot is not atomic. Reading pins in sequence isn't atomic today either and is not necessary while some guarantees about execution order are maintained.

The above change prevents inconsistencies while running the function. That is more important than having the set of pins being atomic. Therefore, no locks are required.

Also, the snapshot can simply be in shared memory, next to the actual pins (in the function's data structures). The snapshot is always local to the thread, but putting the data next to each other is easier in handling and administrating.

Writing to HAL input pins is OK and necessary during init, to set default values to pins that are only optionally connected. I would be interested to see if you have found cases outside this scenario.

IMO it is never OK to write to input pins. Input pins should be declared const. That way you can catch wrong direction writing at compile-time. All pins' initial values should be part of the pin create function call, regardless of direction.

@rmu75 io- pins are strictly not input pins. They are writable pins. However, io-pins are a special case of which I am not sure they should exist at all.

I think that introducing S64 and U64 HAL pins was a mistake, we should have just made all integer pins 64 bits.

That could be one way to look at it. It surely isn't a question of memory usage. But changing it is also a big change and may break things. But that will be another discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants