-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enqueuecommand() - buffer overflow? #2911
Comments
Yikes! And I'm sure you saw this thread with a serial communication corruption happening: We better figure this out. But I'm wondering if it is OK just because BUFSIZE says how many commands we have buffer space to hold. So, it might be OK without the -1 ??? UPDATE: I think it is OK. cmd_queue_index_w never gets to a value of BUFSIZE because of the mod operation at the end of this code. And commands_in_queue gets set to 1 when the first command comes in and gets stored at command_queue[0].
|
Of course i've seen #2890 - that's the reason for searching for hints. At the moment i'm also quite sure that in the case where In between i know for sure that the |
That may be the case. Without some extra space, a lot of the LCD Panel control is going to fail because they enqueuecommand()'s all the time. |
The other problem is that the host are counting "ok"s. |
If the hosts are counting OK's that are generated by enqueuecommand() calls done by the LCD Panel, aren't they going to get out of sync? |
@Roxy-3DPrintBoard enqueuecommands_P() in not only used by the menu system. Some g-codes use it. For example G29 with Z_PROBE_END_SCRIPT, or G29 (Mesh), or M428, FILAMENT_RUNOUT_SCRIPT, ... That is no problem as long as they are not used in a script/program, but ... |
If you are interested in my tests you could try to read AnHardt#14 commit by commit. Preliminary Result: |
I vaguely recall there was a detailed conversation around OK responses when the |
All the threads I read, where about the case where an "ok" was missed. Non of them took in account there could be an extra "ok". Meanwhile i discovered a (the?) problem with the MarlinSerial. Searching at https://github.com/arduino/Arduino/pulls for |
Ah, before i forget. |
@AnHardt Can you paste the defective lines of code into a post so we can talk about them. Is the problem with a 16 bit pointer that it isn't being incremented autonomously? |
@Roxy-3DPrintBoard Code? Everyehere where |
So what is the right answer? I guess there are a couple of possibilities. We could turn off interrupts while the pointer is being updated (or checked). Another possibility is we could just use single byte indexes into a doubly dimensioned array. |
@Roxy-3DPrintBoard I'm not convinced my PR is the complete solution for the problem, but if you feel the difference (and i do), it's definitely a PR you want. Just try it. |
Rereading #1922 #2000 Some claim increasing BUFSIZE would help to avoid errors. They are right, if they are the ones using enqueuecommand() or other commands producing extra "ok"s, intensively. In normal PingPong (receive ok - send next line) mode these buffers are rarely (never?) used. |
Try to run:
on a unmodified Marlin. |
Thank you for the links. I've read them both quickly. I will need to go back and re-read them again.
I'm not comfortable 'decreasing' the chances of a failure. (I know you were not suggesting this.) I think we want to eliminate any chance of problems. I understand and agree with that analysis.
I'm not in a position to try it right now. We may be able to get somebody else here to try it and confirm the improvement.
Is this true? Have we seen an increased number of communication bug reports from people with Arduino Mega's? Even if we don't know the answer to this question, let's keep this in mind in the future so we can get more detail when somebody complains.
Agreed.
After reading those threads, a number of thoughts come to mind. 1 We probably should reduce the indexes to the serial buffers to 8 bit values to avoid non-atomic operations. 2 We could (should?) disable the interrupts around any rx_buffer.head and rx_buffer.tail usage. If we could look at the generated code from gcc with -ii files we could verify where read modify write operations are not happening and maybe not do that extra work everywhere since above we have already reduced the indexes to 8 bit values. 3 We could add a 'guard byte' at the end of every serial buffer line that is set to a never used value. When we go process the line, we could check that the 'guard byte' has not been corrupted and perhaps we can safely infer that the given line is intact and has not over flowed the buffer space. This would not protect us from a character getting mangled and placed in the buffer. But it would help us know the buffer did not get over filled. 4 This is much more work, but it kind of feels to me like we should consider working with Repetier and PronterFace to put the ability to do line by line checksums and error correction. |
Would it make sense to use the current serial library instead of the older one that we are using? Some of these bug fixes are in the new library. |
How to get rid of MarlinSerial and replace it with Arduinos HardwareSerial i showed in #2945. Like everything this has pros and cons.
For me MarlinSerial is the winner. |
Hmmmm.... I guess my initial reaction is different than yours. Obviously, you know more about the serial stuff. But my thinking is:
An extra 1.5Kb of program memory and 122 bytes of RAM to make the communications solid don't seem that bad.
If I remember right, you actually proved the 125000 baud rate was better in many cases because it didn't flood the processor with as many interrupts and cause it to stutter. Do many printers have the ability to fully utilize the 250,000 baud rate? If so, can we move the hack to the HardwareSerial code?
Wouldn't a Transmit Buffer be a good thing? While that is happening we can start processing the next command and get a little bit of a head start. Right now it 'Busy Waits'. That is a waste of processing power on an already over burdened microprocessor. Do you know of any cases where switching to an interrupt driven transmit buffer would break things? |
The hardware does not one interrupt per baud rate tick, not one interrupt per received bit, but one interrupt per received byte. That means the load depends on the amount of data the host sends - that is independent from the baud rate. Only the time between the (possible) interrupts decreases. If that time is shorter than the higher priority stepper interrupt lasts, we need the "hack", to ask during the stepper interrupt for new incoming input. Lets calculate. We'll take a HR_Yoda_Bust (http://www.thingiverse.com/thing:10650)(these Yoda because i was not able to print him without reducing the faces). Slice it with CuraEngine (Default, Fast). We get a file size 13,204,071byte in 404,023 lines. Repetier claimes it will print in 1h:22m:48s = 4,968 seconds. 14 bytes/line (Linenumber and checksum). 13,204,071 + 404,023*14 = 18,860,393bytes. 10bits/byte (parity stop). 18,860,3930bit/ 4,968 = 37,964baud in average. Tried to test that but failed. When doing a lot of fast short moves - the planner-buffer decreases fast - and there would bee a need to refill the buffer - we have not enough remaining processor power (or an other blockage) to parse the next lines to fill the buffer - so we do not send "ok"s - don't get new input. Even with extra large buffers and prefilled B-buffer the p-buffer runs empty.
When it matters we send "ok\r\n" = about 40 bits. We don't have to wait for the first byte - so 30 bits. That means we wait about 0.26 ms @115200bd, 0.12 ms @250000bd
I don't think it was me.
I think so - if not overloaded by moving.
I have not tried yet. But it can't be that difficult. But that would also lessen the beauty of having no own hardware serial code. Experiments will go on. |
Doesn't a Transmit Buffer help us because we put stuff into the buffer and when the transmitter is empty it interrupts and grabs the next character. If that is what would happen, that lets us get a head start on what ever else has to be done while the buffer is being sent. |
|
Anyone an idea how to measure the impact of a write-buffer? |
On memory space? I suspect you mean with regard to how much CPU time it saves. Here is an idea that would warrant a whole new MCode. We could write a function that stored the current time and then kept track of how much idle() time it accumulated. We would have idle() mark the start and end of its loop. We would be able to measure if an interrupt driven transmit buffer saved time. I'm pretty sure it would. But with a new MCode, we could measure it. |
Now i have added a TX-buffer to MarlinSerial. (AnHardt#22) The use of memory, compared to RCBugFix is much better than using the original Arduino HardwareSerial.
Still looking for a way to measure how much time the TX-buffer saves. On the other side for +28 byte we can be less shy than for +1.5k. |
Having a counter that continues to be incremented as the processor 'Busy Waits' for the transmitter to be Ready would get you most of the information. You would know the clock cycle count of the loop that 'Busy Waits'. So, if you had a way to dump this counter 10 or 15 minutes into a print, you would know how many clock cycles were spent unproductively (doing nothing) waiting for that Transmitter Ready bit to be in the right state. (Probably, this counter would be a 4 byte long integer and it may be necessary to add some extra NOP instructions to the busy loop just to keep the counter from over flowing.) And then combined with the information of how many bytes were transmitted (probably another counter that gets dumped at the same time), you would know the average number of clock cycles per byte spent busy waiting. If this number is more than the number of clock cycles it takes to enter the interrupt routine, fetch the character, give it to the USART, and return from interrupt, you would know it helped and by how much. UPDATE: More creeping elegance. You could also double check these numbers by adding yet another counter in the idle() loop. If the Tx buffer interrupts are saving processor cycles, you you should see it show up as an increase in idle() time. (But in this case you would have to compare different environments. You would be comparing idle() loops with and without the TX buffer operating.) |
Answering #2994 (comment) here. #2994
I'd cal it evolution. #2981 handled one aspect of a special kind of errors on the Serial and gave enough improvement to justify publishing it. #2994 includes #2981 and adds an other part of the puzzle. My AnHardt#22 includes #2981 and #2994. They all work on the same two files. The later are basing on the older. The order of the changes is: Do less dangerous changes with high impact first, and more risky changes with lower potential for improvement last. Normally i prefer to make PRs you could merge in each order, like #2983 #2984; dealing with the same problem - to many ok's - but fighting different causes. Here this was impossible. I could do all that changes in one commit but than this one would be so complex nobody could see the consequences. The most people here around seem to give up reading a commit if more than 20 lines of code are affected. or a two liner you get 10 highly philosophic comments - for a 200 line, nothing. (exceptions prove the rule) Here is what I'm thinking: In 2890 alexrj (developer of Slc3r) is complaining about errors in Marlins communication. foosel (develpoer of OctoPrint) complains since last summer about problems with Marlins communication. It has rumor about repetier does not care about Marlin anymore - because of errors in the communication. If i'd be the owner of Marlin's root repository, i'd take this very serious and would try to improve the situation as soon as possible. At least i'd answer alexri's offer for help. (#2890 (comment)). |
Summary:
If everything is repaired so far:
This errors can be reduced by:
This errors could be completely avoided if all needed read-modify-write operations would be protected against interruption. This problem is also known for the Arduino HardwareSerial, but not fixed until now. Other observations:
|
Thank You for taking the time to write all that up. I did not understand the full extent of the problem until now.
I think I would argue we want to detect it immediately. If we wait until the line numbers are incorrect, can we recover and say we want the previous line again? And, we have already tried to execute a line with an error in it, right?
Why are these commands sending extra OK's ? This doesn't make sense. Did all of these commands get added by the same person and that person did not understand something? It would seem we can correct these 4 commands by just deleting the line of code:
in them.
What is your suggestion on this? My initial thinking is we need to make enqueuecommand() mark each command it puts in the buffer so that when it is taken out of the list of things to do we can look at it and realize it was generated internally and not something that was sent from the host. If it is something generated internally, we don't want to send an OK. |
We can detect if the ringbuffer is full when we try to write to it But all we can say, is: we skipped a char - nothing else. |
My apologies. I missed all of these. (I was travelling, but I should have seen them when I got back home.)
Instead of sending an OK, what is the NAK (Negative Acknowledgement Sequence) ? Can we send that immediately to get the line re-transmitted? I don't know anything about the GCode protocol, but it kind of looks like this is trying to recover from lost characters. Bad CheckSums cause this code to be executed:
|
We do not now what line! |
Why? Is it because the line number got corrupted? It appears we always know the last successful line number:
And that code up above re-pasted here:
Flushes the recieve buffer and requests a resend starting at line gcode_LastN+1
Can you help me understand the scenario where missing some characters causes a failure we don't recover from? |
Because the skipping occurs at the write end of the buffer and we are reading at the other end. We have no clue about what is in the buffer. |
Yes, but lines up to that line should be OK. We can keep processing lines until we get to the line that is missing some characters. And we know the line number of the last good (and correct) line. So if the code flushes the receive buffer at that point and requests a resend starting with the bad line (that was missing characters), doesn't it recover? (Mostly, I'm trying to figure out what is left unfixed and still needs attention.) |
Yes, if we have read the line and have analysed it and we find an error we exactly know what the last correct line was.
We can detect it immediately when we want to write to the buffer and it's full. So, what to do with the information we skipped? Wait until we know what line to resend. That's what we already do. The information does not help us - except when we are debugging and want to know why this damn char got lost. |
Thank You for your patients! I have a much better understanding now.
My wording was sloppy. I really meant "We want to detect and respond as soon as we start analyzing the corrupted line." And it appears that does happen. I don't like the simple 8 bit XOR checksum. That seems very weak. But it is what it is... I agree that we need #2983, #2984 and #2994 ASAP. But don't we also want these |
@thinkyhead Can you get these changes merged into RCBugFix? |
Are part of my test documentation (AnHardt#14). Not very interesting if you don't want to make your own tests. |
#3012 is my proposed solution for the corrupted command queue. It simply adds a single line buffer for the serial stream. It introduces one extra |
Related to the original post in MarlinFirmware#2911. The comparison of `commands_in_queue` to `BUFSIZE` in `enqueuecommand` is actually correct. It is the comparison to `BUFSIZE - 1` that is incorrect. This commit patches that error.
Thank you for your interest making Marlin better and reporting this issue but this topic has been open for a long period of time without any further development. Marlin has been under heavy development for the past couple of months and moving to it's last mile to finish the RC cycle and release Marlin v1.1.0. We suggest you to try out the latest RCBugfix branch and reopening this issue if required. |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Could please someone check if
should better be
I could be wrong, but at other places
is used.
The text was updated successfully, but these errors were encountered: