-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LinuxCNC and musl C library (possible?) #3185
Comments
|
For the pthread_attr_setaffinity_np error, this commit from AWS might offer some insight: |
struct timeval is declared in sys/time.h. It looks like it is not included. Maybe it gets included indirectly with glibc and not with musl. Try to add #include <sys/time.h> at the top of hal/user_comps/xhc-whb04b-6/usb.h where the other includes are. |
Getting closer! That solved that! Thanks! |
Shining light on basename error: https://bugzilla.mozilla.org/attachment.cgi?id=8516179&action=diff |
Just squashed libnml/buffer/tcpmem.cc by adding: #include <sys/types.h> To the file. WOO!! |
The declaration of "extern FILE *const stderr;" is not standard, but it is not wrong. I suggest to change/fix the code to use freopen(3), without reassigning to stderr. [edit]The return value may be used to check for failure, but the reassignment is done implicitly by freopen.[edit] |
Now that I don't know how to properly do. |
change the lines (src/emc/task/backtrace.cc:43): Description of the change: There is no need to close "old" stderr before reassigning it. The close operation is implicit in the call to freopen (man page says: The original stream (if it exists) is closed.). The assignment to stderr is the problem that freopen is supposed to solve(*). So, just use freopen and drop the assignment. (*) There is some irony in the original code. The assignment to stderr and friends is the exactly the problem that freopen is supposed to solve and then you see a piece of code using freopen and assigning to stderr... Gotta love it. ;-) |
Thanks, I'm not much of a coder! I'll try it in a bit! |
Things are looking REALLY good! Only two files left; emc/rs274ngc/interp_o_word.cc and rtapi/uspace_rtapi_app.cc I made a musl branch with the respective changes: As for M_TRIM_THRESHOLD and M_MMAP_MAX, I spoke to a few people in the musl IRC channel and someone didn't see why 32MB of pre-faulted memory would be any different with or without mmap. It could not be doing anything on glibc builds either, i.e. changing the trim threshold modifies when free() returns memory to the OS, but as this is a real-time application, it pre-allocates and faults in all it's memory. As a result, from the perspective of the allocator, the program doesn't allocate anything as it's only done once on init so M_TRIM_THRESHOLD and M_MMAP_MAX could be doing nothing. |
The mmap problem comes into existence when you allocate large blocks. The allocator may choose to use mmap instead of sbrk. This has been explicitly disabled by setting M_MMAP_MAX to zero. I agree that the problem might be academic, but there may be many corner cases. For musl, it may not apply for their allocator. That can only be answered by investigating the allocator code. For this case, it may be necessary to add a configure test to check mallopt() and the existense of the mmap and threshold flags. Then it can be coded in a preprocessor conditional. |
Nice! |
Yes it is! |
What's the problem here, exactly? mmap on Linux behaves the same as brk.
How? munmap and brk reduction don't care about regions being locked. "The memory lock on an address range is automatically removed if
FWIW musl's malloc always uses mmap. |
The point of The code in question allocates a 32 MByte (PRE_ALLOC_SIZE) buffer using malloc() and touches every page. Then it actually does a free() on that same buffer. A call to mlockall() precedes to lock the pages, which makes them resident and unswapable. This is where the two mallopt() calls come into play. They are, apparently, glibc specific on linux, to inform the allocator never to release memory back to the OS while running, preserving the pages to be resident in memory, regardless whether they are malloc'ed or not. The mallopt() calls are already in a #ifdef __linux__ block. When musl always uses mmap, then there must be a way to guarantee never to release touched pages back to the OS, even if they are not currently used by the allocator. Remember, the code does a malloc()/free() combi on that large block. So, if musl were to munmap the pages after the call to free(), then musl will break real-time for the process and re-introduce page-faults because the memory would no longer be resident at the next call to malloc. The question is: does musl's allocator break the resident pages requirement used for real-time by possibly releasing memory back to the OS? If so, is there a way to make musl preserve the resident pages and make it refrain from releasing memory back to the OS? |
Yeah.... this doesn't actually work :) mlock doesn't stop you from getting page faults, merely stops you from getting expensive page faults (swapping, re-faulting in from a file, etc). Things like page migration can come in and ruin your day.
Yeah... not only will this not work, it might not even work for glibc (in the future, or maybe now, idk). It seems to expect that your PRE_ALLOC_SIZE allocation is reused for all of your heap needs? This isn't actually written down anywhere.
Yes, it will release memory back to the OS in certain situations, and no, you can't stop it from releasing memory. But it does sound like what you really want is your own custom allocator if you really need all of those things (and mlock, again, will not stop you from getting a free trip to the kernel via page fault). |
By the way: it's very possible your prefaulting loop has been compiling into nothing for a while (https://godbolt.org/z/379d57x4v). The last gcc godbolt has that doesn't straight up optimize this example out is gcc 4.4.7 :) |
Clang seems to cut it out... |
It isn't supposed to prevent all page-faults. It is supposed to prevent all page-faults on the locked memory because the locked pages are kept resident after being touched. That does work, in principle. Posix says that locked pages must be memory resident. The notes in linux' mlock(2)/mlockall(2) state:
Assuming that free()'d memory is reused by the allocator algorithm may not hold now or in the future. That is very true. However, I think it still holds for glibc. Another problem is that it is missing a call So, you are absolutely right, the current implementation is not adequate. [snip]
Indeed, a custom allocator would be required to do it right. Anyhow |
Just tested on:
Each compiler/glibc combi with (-Wall -O3) would allocate memory by setting brk, never reduced brk after free and malloc'ing a small piece of memory after the freeing the large block would occupy the same (locked) memory. And I never added the M_MMAP_THRESHOLD option as indicated earlier. It appears not to be required. I guess it works as intended with glibc. |
I just want to say this is all over my head :) |
Mine too, pretty much. I understand the problems being described, but I was unaware of them unil now, and have no idea what fixes might be required. |
The basename() problem is rather peculiar. There are two versions in glibc. The Posix version may alter its argument, whereas the GNU version never alters the argument. Now, the GNU version is used because libgen.h is not included. The easiest fix is to include the code in the file to prevent any future problems. So, add this function above Interp::control_back_to() (line 540): // Prevent any possibility of the Posix version of basename(3) being used
// (possibly altering the argument). Therefore, local reimplementation
// of the *GNU version* of basename(3).
static char *thebasename(const char *filename)
{
char *p = strrchr(filename, '/');
return p ? p + 1 : (char *)filename;
} And change all occurrences of basename() to thebasename() (lines 548, 549, 612 and 613). The code originates from glibc (https://sourceware.org/git/?p=glibc.git;a=blob;f=string/basename.c). This small piece of code is rather trivial. Regardless, the licenses are compatible (LGPL 2.1+ in glibc and GPL 2 for the interpreter). |
is basename only used in the c++ part? if so would it make sense to use this opportunity and replace this with |
I'll let BsAtHome be the answer for that. After basename is sorted, we're just left with 'pthread_attr_setaffinity_np' vs. 'pthread_setaffinity_np'. |
Yes, it seems only be used in this specific place. Yes, you can argue to use the c++ equivalent function. It is slightly more expensive to use, which would suggest to cache the filename part. It is also a filesystem-generic solution to the problem of finding the basename. Does the c++ version give you the same results as the current version? Not sure if that should matter, but the question is whether it would be prudent to change. I'm not sure we need the extra complexity for a generic solution. Or are the O-words' filenames possibly very complex? If so, then we'd seen problems already, which we haven't. So my guess is that a "simple" fix should suffice. Attached is a suggested diff to fix the basename problem with cached result. |
To me it's more a matter of principle to incrementally modernize the codebase, e.g. replace #defines with functions or enum class, avoid C-style string handling etc... I'm pretty sure nothing using basename is in any way performance critical. |
It is not about being performance critical. Calling the same function giving the same result is just Bad CodeTM in my view... And, yes, it may be good to change the code to become real c++ instead of c-with-classes. But for the glibc->musl thing it may be too much asked in this instance. Changing c-strings into real c++ strings, while a Good ThingTM to do, is a very large task and making just this one basename change is not going to make the code more c++-like. It would be an admirable goal to make the code to conform to (modern) c++ standards. But then you should start a separate project to do so because it will be a very long (and probably bumpy) road. |
src/emc/rs274ngc/interp_o_word.cc built! Only two left:
|
Actually, it is just one problem giving two errors. pthread_attr_setaffinity_np() does not exist in musl. The function is appended "_np", meaning non-portable. Musl does have a pthread_setaffinity_np() implementation, which could be used after the call to create the thread. However, my pthread portability wisdom is not up-to-date and can not say for sure what to do here. But a guess would be to, preserving the current status quo for glibc:
If the status quo for glibc may change, then drop using the attribute for scheduling affinity and always use pthread_setaffinity_np(). Then the same code should work for glibc and musl. Please, someone with current pthread knowledge should confirm this or point out the flaws in my reasoning :-) As a test-case, see code below. Note that it cannot fail in the same way because the thread was already created. The choice is to kill the thread and return with an error or live with failing to set scheduling affinity. The glibc implementation actually starts the new thread in a "stopped" mode and kills it if setting the affinity fails. If the glibc behavior is required, then an additional trampoline while creating the thread is needed to emulate glibc behavior. ... remove previous if(nprocs > 1) {}...
... pthread_create()...
if(nprocs > 1) {
const static int rt_cpu_number = find_rt_cpu_number();
if(rt_cpu_number != -1) {
#ifdef __FreeBSD__
cpuset_t cpuset;
#else
cpu_set_t cpuset;
#endif
CPU_ZERO(&cpuset);
CPU_SET(rt_cpu_number, &cpuset);
if(0 != pthread_setaffinity_np(task->thr, sizeof(cpuset), &cpuset))
rtapi_print("ERROR: failed to set thread scheduling affinity (%s)", strerror(errno));
}
} |
They do not, but their reasoning IMO makes a lot of sense: https://www.openwall.com/lists/musl/2013/03/29/13 Instead of using Also, not sure if you saw my post earlier: Tried your change and it compiled! |
That would indeed be the way. But I will not be writing autoconf tests. Someone else will need to fight configure.ac. |
You've done more than enough and I greatly appreciate all your help, you've been fantastic! LinuxCNC at least now compiles against musl for the very first time! The LinuxCNC devs can now review the changes and hopefully get some of these merged, or fix up whatever isn't exactly right. Should I open a pull request? |
You could and it can be discussed in the pr. I don't think it will be accepted as a whole, but it gives a good indication of what needs to change. You can use the feedback to distill viable parts. |
Here are the steps I follow to reproduce the issue:
This is what I expected to happen:
A proper compilation.
This is what happened instead:
A few errors, maybe an easy fix?
It worked properly before this:
Clang 16 and glibc built fine when I last tested this.
Information about my hardware and software:
I am using Gentoo
I am using this kernel version (shown by
uname -a
): Linux ntu-gentoo 6.12.0-rc7-BUILD SMP Mon Nov 11 10:20:31 CST 2024 x86_64 AMD Ryzen 7 2700X Eight-Core Processor AuthenticAMD GNU/LinuxI am running ...
I am using this LinuxCNC version (shown in package manager or, for git versions,
scripts/get-version-from-git
): latest 2.9 git checkoutThe text was updated successfully, but these errors were encountered: