-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hal volatile warnings #3222
Comments
runtests seems to run OK on my dev setup (in sim). |
It also runs fine on a RPi5 with a Mesa 7C80 card while running over an ssh tunneled X session running axis and having halscope probing some x and y position pins. |
Can you test it with RTAI (kernel mode) realtime? Is it possible that the "volatile" modifier is relevant there? |
I don't have an RTAI kernel here or a setup to test it on. However, the volatile modifier is never relevant in multi-threading (see email list reply). There is absolutely no difference between user-space or kernel-space for this issue. They address the same memory using the same code in the same way. So if one fails the other fails. If one succeeds, the other succeeds.
You describe a race-condition, which is not solved by using volatile. You need atomicity to fix any race between threads, regardless whether you run it in kernel- or user-space. |
It can hide bugs. I.e. the compiler won't act on (optimize out, eliminate) code depending on undefined behaviour because something is declared volatile. That clearly is a bug, no code should rely on undefined behaviour because compilers are allowed (and do) change how they deal with undefined behaviour (like overflowing signed ints). Like they do in the linux kernel, volatile should be only in the access, not in interfaces, and not in datastructures. Exception is memory mapped hardware registers. But even that ins debatable IMHO how and where to use volatile in that case. I will try it on real hardware (rpi4 with mesa 7c80 and PC with 7i92) next week. |
And, it is probably hiding bugs. I've been experimenting a bit and changed the hal_X_t typedefs to _Atomic and std::atomic versions. Many accesses appear that were not atomic and some were not proper. Also, use of the typedefs were detected where they have no place. It will take some more digging. Fixing the underlying issues will take some work and many small fixes. There is also the analysis where which kind of memory synchronization level is required. Most of hal's values are currently ad-hoc synchronized and should all work with the relaxed model. However, the defaults between the C and C++ accesses differ, were C defaults to relaxed and C++ defaults to seq_cst. All accesses need to be done explicit with The hal administrative data access is protected with a properly synchronized mutex, so that works out fine. |
And it is much worse... Fixing one line in src/rtapi/rtapi_atomic.h:21, where the __STDC_VERSION__ macro is misspelled, results in many problems compiling with clang/clang++ (see: 04fe8ab). Clang requires the use of _Atomic on variables when using |
One thing that might help is that I am fairly sure that realtime HAL modules (which includes the realtime motion) run sequentially, and probably single-threaded. Even if not single-threaded I think that the way that realtime functions are called ensures that only one module is executing at any one time. (The compiler has no way at all to thread multiple components in parallel, as the HAL structure is defined only at configuration load-time) Thanks for looking into this, it's way outside my area of expertise. |
Well, the problem always comes into play with shared access once you have more than one thread/process accessing the same memory. It doesn't matter that one of the threads does things sequentially. You simply cannot know when one thread reads the same memory the other writes. Therefore, it must be fixed for all accesses.
Yeah, I was just fixing all those annoying warnings in the compile. A compiler warning is a bug that (hopefully) has not yet come back to haunt you. Now some of them are beginning to haunt and it is time to start using the proton-packs and store the warnings securely in the ghost traps. ;-) |
I wouldn't be so sure about realtime HAL threads running sequentially. I have no idea about the RTAI situation, but on preempt RT, with no isolcpus or more than one core reserved for realtime, I would be surprised if a stepgen thread did not run concurrently to the servo thread, or possibly preempt it. But with a usual HAL setup what matters is writing consistent values (32/64bit at once), atomicity, and that should be guaranteed implicitly for aligned values on relevant platforms. Maybe except for double on AARCH64. Memory ordering shouldn't matter, nobody cares if DRO shows some values from previous servo period. Memory barrier at beginning or end of realtime thread loop should suffice. Of course you can do stupid/funny things in hal like shannons ultimate machine in two threads, but there is no way to prevent that. |
[snip]
You have been asking the atomicity question of double on aarch64 several times so I decided to test it. The question is whether or not the CPU arch can load/store without using extra locks (i.e. being lock-free). This can be easily tested because the C/C++ atomics API provides a function/method for asking. Attached is a small test program in C and C++ with a script running it with gcc, g++, clang and clang++. The results are very interesting. atomictest.zip Tests performed on:
Results:
There is a difference between architectures and compilers. Gcc/g++ is the same om both architectures. However, clang is able to do 128-bit lock-free load/store on Aarch64 , but clang++ is not able to do so. It should be noted that this is on old X86_64 hardware. Maybe newer CPUs will allow lock-free 128-bit types. Either way, the largest type used in LCNC is 64-bit, so supporting larger types is academic. Note: default clang++ 14.0.6 on RPi3 with Raspbian crashes with a segfault. Installing clang-16 fixes that. The LCNC RPi5 image used on the RPi5 does not have any problems with clang++. |
Yes, that makes sense. I don't remember where I got the idea double could be problematic on arm -- i think i misread something talking about "double-width stores needing a loop on arm64", meaning double int-width (128bit) and not width of double. |
Just to clarify, as perhaps I did not make myself clear. |
In the meantime,... // In rtapi/rtapi_atomic.h
typedef _Atomic(rtapi_u32) rtapi_atomic_u32; // if C
typedef _Atomic(rtapi_s32) rtapi_atomic_s32;
//...
typedef std:atomic<rtapi_u32> rtapi_atomic_u32; // if C++
typedef std:atomic<rtapi_s32> rtapi_atomic_s32;
// ... and repeat for u64, s64, bit (i.e. bool) and float // Then in hal/hal.h
typedef rtapi_u32 hal_u32_ut; // The underlying type
typedef rtapi_s32 hal_s32_ut;
// ...
typedef struct { rtapi_atomic_u32 _a;} hal_u32_t; // The atomic type
typedef struct { rtapi_atomic_s32 _a;} hal_s32_t;
// ... etc for the others Such change has no impact on the actual storage sizes of the types and the alignment in shared memory. It just hides access to the variables so that you are forced to use access function. The access functions are inline atomic load, store, inc, dec, add and sub (except for float, which has no atomic add/sub/inc/dec). The access functions are completely eliminated by the compiler optimizer to become the right CPU instruction(s). Some findings:
There are a couple of ways to go forward:
Option 2 is IMO the only viable one. It is easy for halcompile generated code. It will also eliminate the races within the functions (though the inter-function-call races will remain). It costs more shared memory space (about double), but that is a small price to pay. Note that option 3 will cause a real problem in the future because CPUs will become less ordered to improve speed. It will expose races not yet seen today because ordering changes. It should be noted that using (weakly ordered) ARM arch CPUs already poses a greater risk in that respect than the (strongly ordered) X86 arch. More discussion input welcome. |
I am not sure that it is abuse to use the hal_X_types, they only end up using up shared memory of you hal_malloc them there. Is "if (a) then b = a;" a danger only in the context of HAL shared memory, or in general in any multi-threaded application? I would assume that the compiler is clever enough to avoid running "if (a)" and "b=a" in separate threads that might get a different "a"? This still leaves the possibility of the user-space code writing to a HAL pinat the same time as it is read by the servo thread. A common case is setting the index-enable during homing. I don't think that causes a problem, but would need to pore over the code to be sure. Writing to HAL input pins is OK and necessary during init, to set default values to pins that are only optionally connected. I would be interested to see if you have found cases outside this scenario. I think that introducing S64 and U64 HAL pins was a mistake, we should have just made all integer pins 64 bits. |
HAL and linuxcnc is designed in a way that no real concurrent access to critical data is done. G-Code interpreter communicates with realtime via its own dedicated exclusive queue. Of course you can mess with stuff and configure stupid things that won't work. If you follow examples and keep the structure of base and servo thread, everyting is OK because no concurrent access happens. I would not bother tryting to fix anything because it is unfixable in principle. Non-realtime code can not lock datastructures and prevent realtime from accessing those, that would immediately lead to missed deadlines. So everything has to be done in a way that races are impossible or inconsequential. With clear "ownership" and separation of hal pins into "inputs" and "outputs" that should be guaranteed. Doing read-modify-write of the same value from 2 threads or from non-realtime and realtime is probably a bug and should not happen (with exceptions like homing but that is carefully done so it works). Keeping stuff like it is won't become a bigger problem with newer CPU architectures because concurrency is limited. Reading the same shared memory location multiple times in a function may be a problem, that depends. Userspace (means non-realtime) setting something that is read by realtime can't be really that time critical (else that setting would have to happen in realtime to begin with) so if this changed value is dealt with one realtime period earlier or later can't make a difference, that would be a bug otherwise. Communicating a set of values that need to be changed atomically is not possible in this way, but it is not needed anywhere AFAIK, and it would not be possible with current HAL infra. BTW writing into "input" pins is done on purpose, some pins are bidirectional -- see homing on encoder index pulse for example. @andypugh the compiler is not "clever enough" to do/avoid anything regarding threads, it is completely unaware that something like threads exist and doesn't schedule/distribute/lock or anything else, it just generates a stream of instructions that are supposed to be executed linearly. Possible interruptions and continuations somewhere else ("threads"), and on SMP machines real concurrentcy (on a single CPU real concurrency can't happen of course) all has to be handled by the program, the compiler doesn't really know about that. |
The point is that the hal_X_t types, or effectively the pins, live in shared memory. They are currently declared volatile and we have been discussing to make them proper atomic types. In either case, these types come with attributes for the compiler that are not compatible with pass-by-value argument types in functions, nor can they be used as value return types of functions. The attributes are also useless (and detrimental) in local variable types. Using the hal_X_t types in these contexts is the reason for the compiler warnings and therefore inappropriate. You have to distinguish between the variable type living in shared memory and the value type it represents. That is why you need two types to describe what is happening:
The fix in PR #3240 makes exactly the above described distinction. Changing the hal_port_t argument to a reference is an example of that. The hal_port_t is a completely opaque entity and should never be passed by value. Also, the pendent driver is changed to use the underlying type to pass and return values.
The fact that it is done does not make it necessarily right... However, using shared memory for local data is IMO not a too big problem as long as you keep the size in check. What bothers me is (ab-)using the hal_X_t types for locals. That is an error because these local variable are most certainly not volatile. Such use is prone to compiler warnings. With volatile in use, it also makes the code run significantly slower because it prevents the compiler from applying a large set of code optimizations.
I do appreciate that effort because it shows that you are aware of what is private and what is shared. And, yes, it gets complicated very easily. It will be a time-consuming effort, but it does need to be addressed in one or another form and way. @rmu75 limited concurrency is still concurrent with all the problems associated with it. The main problem is memory (and cache) synchronization between CPUs. That will get more tricky with more advanced CPUs.
It is not about whether "if(a)" and "b=a" run in separate threads. They do not and never will. The point is that you read the value of 'a' twice. That means, when 'a' is a shared variable, you have a non-zero possibility that your second read has a different value! It may have been changed by another thread between reads. And, actually, you can not say when the change occurred because that particularity is up to the CPU hardware. It may have been changed a "long" time ago, but the coherency flush just picked a "bad" spot for your application. Remember, ordering of reads and writes are not up to the programmer if the compiler and CPU are not explicitly told to do things in a very specific way. The compiler and CPU are normally free to shuffle code within a thread to optimize any way they like, including which and when memory synchronizes.
When there only is one RT thread, all runs sequentially within that thread. From @rmu75, the input from gcode is a special queue. And the input/output drivers to hardware run in the same thread. Once multiple threads run and interact, just like your encoder example, then you need to be careful. That is all fine and it has apparently been tuned to work properly. However, the scenario I am worried about is where you have interactions that cause data to change while a RT function runs. The functions added to a RT thread run in sequence, but their pin-values are not fixed at the time they When a RT function is called, it should run in the following way:
Then you do no longer have to worry about multiple reads of the same variable as in 'if(a) b=a' because 'a' will always be the snapshot value. Please note that it does not matter that creating the snapshot is not atomic. Reading pins in sequence isn't atomic today either and is not necessary while some guarantees about execution order are maintained. The above change prevents inconsistencies while running the function. That is more important than having the set of pins being atomic. Therefore, no locks are required. Also, the snapshot can simply be in shared memory, next to the actual pins (in the function's data structures). The snapshot is always local to the thread, but putting the data next to each other is easier in handling and administrating.
IMO it is never OK to write to input pins. Input pins should be declared @rmu75 io- pins are strictly not input pins. They are writable pins. However, io-pins are a special case of which I am not sure they should exist at all.
That could be one way to look at it. It surely isn't a question of memory usage. But changing it is also a big change and may break things. But that will be another discussion. |
There are numerous compiler warnings about using 'volatile' where it is inappropriate and makes no sense. A suggestion was made simply to remove volatile completely and see if things work.
Well, I did so and applied the patch below, which also replaces the odd define with a typedef.
It works flawlessly. Running
scripts/rip_environment runtests
runs all tests without any problem. There is no change in behavior. This seems to confirms that the volatile classification is void on all of these types. It also removes all volatile warnings from the compile. There are still other places where volatile is used, but this change would be a very good start.Please, if others can confirm my findings, then we can get rid of the invalid volatile declarations and fix the compiler warnings.
The text was updated successfully, but these errors were encountered: