Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enqueuecommand() - buffer overflow? #2911

Closed
AnHardt opened this issue Jan 18, 2016 · 45 comments
Closed

enqueuecommand() - buffer overflow? #2911

AnHardt opened this issue Jan 18, 2016 · 45 comments

Comments

@AnHardt
Copy link
Contributor

AnHardt commented Jan 18, 2016

Could please someone check if

bool enqueuecommand(const char* cmd) {
  if (*cmd == ';' || commands_in_queue >= BUFSIZE) return false;

should better be

bool enqueuecommand(const char* cmd) {
  if (*cmd == ';' || commands_in_queue >= BUFSIZE - 1 ) return false;

I could be wrong, but at other places

if (commands_in_queue < BUFSIZE - 1) get_command();
  if (commands_in_queue < BUFSIZE - 1) get_command();

is used.

@Roxy-3D
Copy link
Member

Roxy-3D commented Jan 19, 2016

Yikes! And I'm sure you saw this thread with a serial communication corruption happening:

#2890

We better figure this out. But I'm wondering if it is OK just because BUFSIZE says how many commands we have buffer space to hold. So, it might be OK without the -1 ???

UPDATE: I think it is OK. cmd_queue_index_w never gets to a value of BUFSIZE because of the mod operation at the end of this code. And commands_in_queue gets set to 1 when the first command comes in and gets stored at command_queue[0].

bool enqueuecommand(const char* cmd) {
  if (*cmd == ';' || commands_in_queue >= BUFSIZE) return false;
  // This is dangerous if a mixing of serial and this happens
  char* command = command_queue[cmd_queue_index_w];
  strcpy(command, cmd);
  SERIAL_ECHO_START;
  SERIAL_ECHOPGM(MSG_Enqueueing);
  SERIAL_ECHO(command);
  SERIAL_ECHOLNPGM("\"");
  cmd_queue_index_w = (cmd_queue_index_w + 1) % BUFSIZE;
  commands_in_queue++;
  return true;

@AnHardt
Copy link
Contributor Author

AnHardt commented Jan 19, 2016

Of course i've seen #2890 - that's the reason for searching for hints.

At the moment i'm also quite sure that in the case where - 1 is used the buffer is not used completely. Now i wonder if that is done by purpose, to give enqueuecommand() a chance to throw in at anytime, or an error.

In between i know for sure that the rx_buffer is full sometimes and incoming chars are thrown away. The question now is why? I'm following a promising hint where two "ok"s are sent for a single g-code (M280, M301, M304, M851, ??? ).

@Roxy-3D
Copy link
Member

Roxy-3D commented Jan 19, 2016

Now i wonder if that is done by purpose, to give enqueuecommand() a chance to throw in at anytime, or an error.

That may be the case. Without some extra space, a lot of the LCD Panel control is going to fail because they enqueuecommand()'s all the time.

@AnHardt
Copy link
Contributor Author

AnHardt commented Jan 27, 2016

The other problem is that the host are counting "ok"s.
In the simple case they are sending the next line when they receive an OK
They also count them if the are coming from enqueuecommands_P() - and overfill the RX-buffer.

@Roxy-3D
Copy link
Member

Roxy-3D commented Jan 27, 2016

If the hosts are counting OK's that are generated by enqueuecommand() calls done by the LCD Panel, aren't they going to get out of sync?

@AnHardt
Copy link
Contributor Author

AnHardt commented Jan 29, 2016

@Roxy-3DPrintBoard
You are ingenious.
About what am I talking here?

enqueuecommands_P() in not only used by the menu system. Some g-codes use it. For example G29 with Z_PROBE_END_SCRIPT, or G29 (Mesh), or M428, FILAMENT_RUNOUT_SCRIPT, ...

That is no problem as long as they are not used in a script/program, but ...

@AnHardt
Copy link
Contributor Author

AnHardt commented Jan 29, 2016

If you are interested in my tests you could try to read AnHardt#14 commit by commit.

Preliminary Result:
1 Sending additional OKs with (M280, M301, M304, M851, ??? ) is evil.
2 Sending OKs for enqueuecommand() is evil.
3 The 'hack' for 250000bd has a problem.
4 115200bd seems to be ok when 1 and 2 are corrected and the hack is out.

@thinkyhead
Copy link
Member

I vaguely recall there was a detailed conversation around OK responses when the ADVANCED_OK feature came along. Maybe review some of the points there before assuming anything... If I find the thread link I will add it below.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 9, 2016

All the threads I read, where about the case where an "ok" was missed. Non of them took in account there could be an extra "ok".
The "ok" is part of the serial protocol. enqueuecommand() is, like sd-printing, not based on the serial protocol - it's a purely internal thing.
Eric introduced enqueuecommand() for the autostart feature. It worked and is still working with some extra "ok"s because usually then is not running a program - no communication over the serial. But since we use it now for filament change, G29 mesh, Z_PROBE_END_SCRIPT, M428 (to call M300 for sound instead of a direct call to buzz()), ... , it's time to repair that.
For M105 we already handle the extra "ok" when we return; instead of break; in the big case statement in MarlinMain.cpp. It's pure luck M280 (manual servo position), M301/304 (PID values), M851 (Z probe's Z offset) are rarely used in g-code programs. Else the overflow of the RX buffer would occur much more frequent. The "ok" in the responses of these commands is not required by any host i know. What could be the advantage (for the host) to know about the servo position, the PID values or if the z-probe-offset was set correctly?

Meanwhile i discovered a (the?) problem with the MarlinSerial. Searching at https://github.com/arduino/Arduino/pulls for is:pr is:open HardwareSerial shows some attempts to deal with race conditions between normal program and ISR, when the buffer pointers are 16 bit long and the baud-rates are high. We have exactly this situation and we do not need 16bit for a 128byte long buffer. Making them uint_8 dropped my error rate at 250000bd considerably.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 9, 2016

Ah, before i forget.
The line-number in ADVANCED_OK is from the wrong side of the buffer. It's the number of the line just pushed into the buffer- not the executed one.
Just activate ADVANCED_OK and send a program with RepetierHost with PingPong off. You'll get lots of messages about corrected buffer usage.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 9, 2016

Meanwhile i discovered a (the?) problem with the MarlinSerial. Searching at https://github.com/arduino/Arduino/pulls for is:pr is:open HardwareSerial shows some attempts to deal with race conditions between normal program and ISR, when the buffer pointers are 16 bit long and the baud-rates are high. We have exactly this situation and we do not need 16bit for a 128byte long buffer. Making them uint_8 dropped my error rate at 250000bd considerably.

@AnHardt Can you paste the defective lines of code into a post so we can talk about them. Is the problem with a 16 bit pointer that it isn't being incremented autonomously?

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 9, 2016

@Roxy-3DPrintBoard
A 8 bit prozessor can't read/write 16bits at the same time. This is done in two operations.
When the normal program reads the firs byte, then the interrupt modifies the number, then the normal program reads the second byte - the number the normal program gets is crap, when the interrupt changed the byte read fist.

Code? Everyehere where rx_buffer.head and rx_buffer.tail is used - ~20 places.
This gets very interesting results - especially when, like in our case with the hack, one interrupt can interrupt the other.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 9, 2016

So what is the right answer? I guess there are a couple of possibilities. We could turn off interrupts while the pointer is being updated (or checked). Another possibility is we could just use single byte indexes into a doubly dimensioned array.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 9, 2016

@Roxy-3DPrintBoard
It would really help if you'd follow my links (arduino/Arduino#3848) or at least have a brief look at my PR #2981.
So, yes, disabling interrupts is needed when the the variables are 16bit. 8bit variables in out case are completely sufficient. Using 8bit variables also speeds up the manipulating them. The chance a read-modify-write-operation is interrupted, is decreased.

I'm not convinced my PR is the complete solution for the problem, but if you feel the difference (and i do), it's definitely a PR you want. Just try it.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 9, 2016

Rereading #1922 #2000
Some claim their usb-connection never have errors. They use the AT90USB chips with the USB-chip directly integrated into the processor. They don't use MarlinSerial. I guess they (some) are right.
The rest of us, using the Arduino Megas with a separate USB-chip and a true serial connection to it, claims to have errors. The most common explanation for that is bad hardware. But that , may be, the reason for a lot of the problems, is, by far, not the one and only possible source. Nophead already told us abut the firmware of the usb-chips. But nobody talked about the idea MarlinSerial, or Arduino HardwareSerial, could have a problem

Some claim increasing BUFSIZE would help to avoid errors. They are right, if they are the ones using enqueuecommand() or other commands producing extra "ok"s, intensively. In normal PingPong (receive ok - send next line) mode these buffers are rarely (never?) used.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 9, 2016

Try to run:

; Produce Some Errors
;
; Configure  TEMP_SENSOR_0 999, TEMP_SENSOR_BED 999, PIDTEMP, PIDTEMPBED, AUTO_BED_LEVELING_FEATURE, Z_PROBE_END_SCRIPT, NUM_SERVOS 1, SERVO_ENDSTOP_ANGLES, ADVANCED_OK. For a bare Arduino Mega.

M111 S15
G28
M105
M280 P0 ;Set servo position absolute. P: servo index, S: angle or microseconds
M280 P0 ;Set servo position absolute. P: servo index, S: angle or microseconds
M280 P0 ;Set servo position absolute. P: servo index, S: angle or microseconds
M280 P0 ;Set servo position absolute. P: servo index, S: angle or microseconds
M301 ;Set PID parameters P I and D
M301 ;Set PID parameters P I and D
M301 ;Set PID parameters P I and D
M301 ;Set PID parameters P I and D
M304 ;Set bed PID parameters P I and D
M304 ;Set bed PID parameters P I and D
M304 ;Set bed PID parameters P I and D
M304 ;Set bed PID parameters P I and D
M851 Z-1 ;Set Z probe's Z offset
M851 Z-1 ;Set Z probe's Z offset
M851 Z-1 ;Set Z probe's Z offset
M851 Z-1 ;Set Z probe's Z offset
G29
G29
G29
;

; move around in small squares
G1 X0.000  Y0.000  Z10.000 F9001
G1 X10.000 Y0.000  Z10.000 F9002
G1 X10.000 Y10.000 Z10.000 F9003
G1 X0.000  Y10.000 Z10.000 F9004
;
G1 X0.000  Y0.000  Z10.000 F9005
G1 X10.000 Y0.000  Z10.000 F9006
G1 X10.000 Y10.000 Z10.000 F9007
G1 X0.000  Y10.000 Z10.000 F9008
;
G1 X0.000  Y0.000  Z10.000 F9009
G1 X10.000 Y0.000  Z10.000 F9010
G1 X10.000 Y10.000 Z10.000 F9011
G1 X0.000  Y10.000 Z10.000 F9012
;
G1 X0.000  Y0.000  Z10.000 F9013
G1 X10.000 Y0.000  Z10.000 F9014
G1 X10.000 Y10.000 Z10.000 F9015
G1 X0.000  Y10.000 Z10.000 F9016
;
G1 X0.000  Y0.000  Z10.000 F9017
G1 X10.000 Y0.000  Z10.000 F9018
G1 X10.000 Y10.000 Z10.000 F9019
G1 X0.000  Y10.000 Z10.000 F9020

on a unmodified Marlin.
The question is only when the first error occurs - not if.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 9, 2016

It would really help if you'd follow my links (arduino/Arduino#3848) or at least have a brief look at my PR #2981.

Thank you for the links. I've read them both quickly. I will need to go back and re-read them again.

So, yes, disabling interrupts is needed when the variables are 16bit. 8bit variables in our case are completely sufficient. Using 8bit variables also speeds up the manipulating them. The chance a read-modify-write-operation is interrupted, is decreased.

I'm not comfortable 'decreasing' the chances of a failure. (I know you were not suggesting this.) I think we want to eliminate any chance of problems. I understand and agree with that analysis.

I'm not convinced my PR is the complete solution for the problem, but if you feel the difference (and i do), it's definitely a PR you want. Just try it.

I'm not in a position to try it right now. We may be able to get somebody else here to try it and confirm the improvement.

Rereading #1922 #2000
Some claim their usb-connection never have errors. They use the AT90USB chips with the USB-chip directly integrated into the processor. They don't use MarlinSerial. I guess they (some) are right.
The rest of us, using the Arduino Megas with a separate USB-chip and a true serial connection to it, claims to have errors.

Is this true? Have we seen an increased number of communication bug reports from people with Arduino Mega's? Even if we don't know the answer to this question, let's keep this in mind in the future so we can get more detail when somebody complains.

The most common explanation for that is bad hardware. But that , may be, the reason for a lot of the problems, is, by far, not the one and only possible source. Nophead already told us abut the firmware of the usb-chips. But nobody talked about the idea MarlinSerial, or Arduino HardwareSerial, could have a problem

Agreed.

Some claim increasing BUFSIZE would help to avoid errors. They are right, if they are the ones using enqueuecommand() or other commands producing extra "ok"s, intensively. In normal PingPong (receive ok - send next line) mode these buffers are rarely (never?) used.

After reading those threads, a number of thoughts come to mind.

1 We probably should reduce the indexes to the serial buffers to 8 bit values to avoid non-atomic operations.

2 We could (should?) disable the interrupts around any rx_buffer.head and rx_buffer.tail usage. If we could look at the generated code from gcc with -ii files we could verify where read modify write operations are not happening and maybe not do that extra work everywhere since above we have already reduced the indexes to 8 bit values.

3 We could add a 'guard byte' at the end of every serial buffer line that is set to a never used value. When we go process the line, we could check that the 'guard byte' has not been corrupted and perhaps we can safely infer that the given line is intact and has not over flowed the buffer space. This would not protect us from a character getting mangled and placed in the buffer. But it would help us know the buffer did not get over filled.

4 This is much more work, but it kind of feels to me like we should consider working with Repetier and PronterFace to put the ability to do line by line checksums and error correction.

@Bio-Printer
Copy link

Would it make sense to use the current serial library instead of the older one that we are using? Some of these bug fixes are in the new library.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 9, 2016

How to get rid of MarlinSerial and replace it with Arduinos HardwareSerial i showed in #2945.

Like everything this has pros and cons.

  • make serial.available, peek, read atomic arduino/Arduino#3848 fixing the 16bit case is not merged.
  • HardwareSerial uses considerable more memory than ours. (+1524byte Progmem and +122byte RAM)
  • HardwareSerial has not the "hack". Without it we will lose the ability to run at 250000bd because of loosing input.
  • A TX-buffer. Currently we are waiting until everything is sent.
  • Better when the RX-buffer is smaller than 256 bytes. But as Make rx tail/head atomic #2981 shows, it only needs to change 7 lines of code to do the same (changing the type).

For me MarlinSerial is the winner.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 10, 2016

Hmmmm.... I guess my initial reaction is different than yours. Obviously, you know more about the serial stuff. But my thinking is:

•HardwareSerial uses considerable more memory than ours. (+1524byte Progmem and +122byte RAM)

An extra 1.5Kb of program memory and 122 bytes of RAM to make the communications solid don't seem that bad.

•HardwareSerial has not the "hack". Without it we will lose the ability to run at 250000bd because of loosing input.

If I remember right, you actually proved the 125000 baud rate was better in many cases because it didn't flood the processor with as many interrupts and cause it to stutter. Do many printers have the ability to fully utilize the 250,000 baud rate? If so, can we move the hack to the HardwareSerial code?

•A TX-buffer. Currently we are waiting until everything is sent.

Wouldn't a Transmit Buffer be a good thing? While that is happening we can start processing the next command and get a little bit of a head start. Right now it 'Busy Waits'. That is a waste of processing power on an already over burdened microprocessor. Do you know of any cases where switching to an interrupt driven transmit buffer would break things?

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 11, 2016

The hardware does not one interrupt per baud rate tick, not one interrupt per received bit, but one interrupt per received byte. That means the load depends on the amount of data the host sends - that is independent from the baud rate. Only the time between the (possible) interrupts decreases. If that time is shorter than the higher priority stepper interrupt lasts, we need the "hack", to ask during the stepper interrupt for new incoming input.

Lets calculate. We'll take a HR_Yoda_Bust (http://www.thingiverse.com/thing:10650)(these Yoda because i was not able to print him without reducing the faces). Slice it with CuraEngine (Default, Fast). We get a file size 13,204,071byte in 404,023 lines. Repetier claimes it will print in 1h:22m:48s = 4,968 seconds. 14 bytes/line (Linenumber and checksum). 13,204,071 + 404,023*14 = 18,860,393bytes. 10bits/byte (parity stop). 18,860,3930bit/ 4,968 = 37,964baud in average.
250,000 bd is about 6.6 times faster - 115200bd is about 3 times faster.
The question now is, how is the distribution between long moves, where not much data is needed (infill) and short moves (for the detailed shell). I can imagine, the peak transfer speed only 3 times faster than the average is not high enough.

Tried to test that but failed.

When doing a lot of fast short moves - the planner-buffer decreases fast - and there would bee a need to refill the buffer - we have not enough remaining processor power (or an other blockage) to parse the next lines to fill the buffer - so we do not send "ok"s - don't get new input.

Even with extra large buffers and prefilled B-buffer the p-buffer runs empty.
Count the parsed lines per 1/10 sec. 115200 bd, no display.

23:17:07.591 : N6444 G1 X90.503 Y125.033 E646.1796 *116
23:17:07.599 : ok N6438 P19 B2
23:17:07.599 : N6445 G1 X90.226 Y124.947 E646.1893 *116

23:17:07.607 : ok N6439 P20 B2
23:17:07.607 : N6446 G1 X89.898 Y124.888 E646.2004 *119
23:17:07.615 : ok N6440 P19 B2
23:17:07.616 : N6447 G1 X89.576 Y124.807 E646.2114 *124
23:17:07.628 : ok N6441 P18 B2
23:17:07.628 : N6448 G1 X89.181 Y124.76 E646.2246 *69
23:17:07.636 : ok N6442 P18 B2
23:17:07.636 : N6449 G1 X88.941 Y124.696 E646.2329 *113
23:17:07.644 : ok N6443 P19 B2
23:17:07.644 : N6450 G1 X88.824 Y124.649 E646.2371 *116
23:17:07.652 : ok N6444 P19 B2
23:17:07.652 : N6451 G1 X88.374 Y124.595 E646.2521 *122
23:17:07.660 : ok N6445 P19 B2
23:17:07.661 : N6452 G1 X87.984 Y124.504 E646.2654 *122
23:17:07.673 : ok N6446 P20 B2
23:17:07.673 : N6453 G1 X87.832 Y124.398 E646.2717 *114
23:17:07.681 : ok N6447 P20 B2
23:17:07.681 : N6454 G1 X87.128 Y124.074 E646.2974 *125
23:17:07.685 : ok N6448 P20 B2
23:17:07.685 : N6455 G1 X86.711 Y123.771 E646.3146 *124
23:17:07.693 : ok N6449 P20 B2
23:17:07.693 : N6456 G1 X86.622 Y123.726 E646.3179 *112

23:17:07.701 : ok N6450 P20 B2
23:17:07.702 : N6457 G1 X86.078 Y123.576 E646.3367 *114
23:17:07.710 : ok N6451 P19 B2
23:17:07.710 : N6458 G1 X85.251 Y123.016 E646.3699 *112
23:17:07.718 : ok N6452 P18 B2
23:17:07.718 : N6459 G1 X84.916 Y122.688 E646.3854 *119
23:17:07.726 : ok N6453 P19 B2
23:17:07.729 : N6460 G1 X84.637 Y122.261 E646.4024 *122

23:17:07.828 : ok N6454 P30 B1                <- 11 moves used in the last 1/10sec but non parsed
                                                 Slowed down we are able to fill the buffer again.
23:17:07.828 : N6461 G1 X84.315 Y121.952 E646.4172 *116
23:17:07.832 : ok N6455 P29 B2
23:17:07.833 : N6462 G1 X84.105 Y121.679 E646.4287 *123
23:17:07.841 : ok N6456 P28 B2
23:17:07.841 : N6463 G1 X83.927 Y121.324 E646.442 *67
23:17:07.845 : ok N6457 P27 B2
23:17:07.845 : N6464 G1 X83.499 Y120.698 E646.4672 *122
23:17:07.849 : ok N6458 P26 B3
23:17:07.849 : N6465 G1 X83.205 Y120.161 E646.4875 *112
23:17:07.853 : ok N6459 P26 B2
23:17:07.853 : N6466 G1 X83.177 Y120.097 E646.4898 *126
23:17:07.857 : ok N6460 P25 B2
23:17:07.857 : N6467 G1 X83.001 Y119.908 E646.4984 *118
23:17:07.865 : ok N6461 P24 B2
23:17:07.865 : N6468 G1 X82.542 Y119.306 E646.5236 *125
23:17:07.869 : ok N6462 P23 B2
23:17:07.869 : N6469 G1 X82.309 Y118.75 E646.5436 *69
23:17:07.873 : ok N6463 P23 B3
23:17:07.874 : N6470 G1 X82.302 Y118.719 E646.5447 *125
23:17:07.877 : ok N6464 P22 B2
23:17:07.878 : N6471 G1 X82.036 Y118.27 E646.5621 *64
23:17:07.886 : ok N6465 P21 B2
23:17:07.886 : ok N6466 P21 B2
23:17:07.886 : N6472 G1 X81.895 Y117.985 E646.5726 *121
23:17:07.886 : N6473 G1 X81.857 Y117.8 E646.5789 *127
23:17:07.894 : ok N6467 P21 B3
23:17:07.894 : N6474 G1 X81.937 Y117.575 E646.5869 *113
23:17:07.898 : ok N6468 P20 B2
23:17:07.898 : N6475 G1 X82.482 Y116.827 E646.6177 *126

23:17:07.902 : ok N6469 P19 B2
23:17:07.902 : N6476 G1 X82.624 Y116.552 E646.6279 *113
23:17:07.906 : ok N6470 P19 B2
23:17:07.906 : N6477 G1 X82.674 Y116.375 E646.634 *77
23:17:07.914 : ok N6471 P19 B2
23:17:07.914 : N6478 G1 X82.729 Y116.258 E646.6384 *125
23:17:07.918 : ok N6472 P18 B2
23:17:07.919 : N6479 G1 X82.719 Y116.097 E646.6437 *113

When it matters we send "ok\r\n" = about 40 bits. We don't have to wait for the first byte - so 30 bits. That means we wait about 0.26 ms @115200bd, 0.12 ms @250000bd

If I remember right, you actually proved the 125000 baud rate was better in many cases because it didn't flood the processor with as many interrupts and cause it to stutter.

I don't think it was me.

Do many printers have the ability to fully utilize the 250,000 baud rate?

I think so - if not overloaded by moving.

If so, can we move the hack to the HardwareSerial code?

I have not tried yet. But it can't be that difficult. But that would also lessen the beauty of having no own hardware serial code.
Because the test where a bit time consuming i'v had some time to integrate some CRITICAL_SECTIONs in the MarlinSerial code. And, wow, the test now is running for more than 20h without detecting any malformed line.

Experiments will go on.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 11, 2016

•A TX-buffer. Currently we are waiting until everything is sent.

Wouldn't a Transmit Buffer be a good thing? While that is happening we can start processing the next command and get a little bit of a head start. Right now it 'Busy Waits'. That is a waste of processing power on an already over burdened microprocessor. Do you know of any cases where switching to an interrupt driven transmit buffer would break things?

Doesn't a Transmit Buffer help us because we put stuff into the buffer and when the transmitter is empty it interrupts and grabs the next character. If that is what would happen, that lets us get a head start on what ever else has to be done while the buffer is being sent.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 11, 2016

Experiments will go on.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 11, 2016

Anyone an idea how to measure the impact of a write-buffer?

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 12, 2016

Anyone an idea how to measure the impact of a write-buffer?

On memory space? I suspect you mean with regard to how much CPU time it saves. Here is an idea that would warrant a whole new MCode. We could write a function that stored the current time and then kept track of how much idle() time it accumulated. We would have idle() mark the start and end of its loop. We would be able to measure if an interrupt driven transmit buffer saved time. I'm pretty sure it would. But with a new MCode, we could measure it.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 16, 2016

Now i have added a TX-buffer to MarlinSerial. (AnHardt#22)
The implementation is similar to the current Arduino HardwareSerial
but with max. 256 byte buffer size and hopefully interrupt save.
It was quite interesting to debug without output on the serial line. Some output on the display has been very helpful.

The use of memory, compared to RCBugFix is much better than using the original Arduino HardwareSerial.
Using Arduino HardwareSerial needs 1524byte Progmem and 122byte RAM more than the version with original MarlinSerial.

RCBugFix
Der Sketch verwendet 50.270 Bytes (19%) des Programmspeicherplatzes. Das Maximum sind 253.952 Bytes.
Globale Variablen verwenden 2.846 Bytes (34%) des dynamischen Speichers, 5.346 Bytes für lokale Variablen verbleiben. Das Maximum sind 8.192 Bytes.

Atomic-RX-Head-Tail -90 -2
Der Sketch verwendet 50.180 Bytes (19%) des Programmspeicherplatzes. Das Maximum sind 253.952 Bytes.
Globale Variablen verwenden 2.844 Bytes (34%) des dynamischen Speichers, 5.348 Bytes für lokale Variablen verbleiben. Das Maximum sind 8.192 Bytes.

Interrupt-save-serial -58 -2
Der Sketch verwendet 50.212 Bytes (19%) des Programmspeicherplatzes. Das Maximum sind 253.952 Bytes.
Globale Variablen verwenden 2.844 Bytes (34%) des dynamischen Speichers, 5.348 Bytes für lokale Variablen verbleiben. Das Maximum sind 8.192 Bytes.

addTXbuffer +28 +65
TX-buffer size 64byte
Der Sketch verwendet 50.298 Bytes (19%) des Programmspeicherplatzes. Das Maximum sind 253.952 Bytes.
Globale Variablen verwenden 2.911 Bytes (35%) des dynamischen Speichers, 5.281 Bytes für lokale Variablen verbleiben. Das Maximum sind 8.192 Bytes.

Still looking for a way to measure how much time the TX-buffer saves.
The problem is looking sharp enough to see the difference but not changing the result by the measurement (Heisenberg problem)

On the other side for +28 byte we can be less shy than for +1.5k.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 16, 2016

Still looking for a way to measure how much time the TX-buffer saves.
The problem is looking sharp enough to see the difference but not changing the result by the measurement (Heisenberg problem)

Having a counter that continues to be incremented as the processor 'Busy Waits' for the transmitter to be Ready would get you most of the information. You would know the clock cycle count of the loop that 'Busy Waits'. So, if you had a way to dump this counter 10 or 15 minutes into a print, you would know how many clock cycles were spent unproductively (doing nothing) waiting for that Transmitter Ready bit to be in the right state. (Probably, this counter would be a 4 byte long integer and it may be necessary to add some extra NOP instructions to the busy loop just to keep the counter from over flowing.)

And then combined with the information of how many bytes were transmitted (probably another counter that gets dumped at the same time), you would know the average number of clock cycles per byte spent busy waiting. If this number is more than the number of clock cycles it takes to enter the interrupt routine, fetch the character, give it to the USART, and return from interrupt, you would know it helped and by how much.

UPDATE: More creeping elegance. You could also double check these numbers by adding yet another counter in the idle() loop. If the Tx buffer interrupts are saving processor cycles, you you should see it show up as an increase in idle() time. (But in this case you would have to compare different environments. You would be comparing idle() loops with and without the TX buffer operating.)

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

Answering #2994 (comment) here. #2994

I'm confused. Did you do two separate solutions? Can you explain what you are (or were) thinking?

I'd cal it evolution. #2981 handled one aspect of a special kind of errors on the Serial and gave enough improvement to justify publishing it. #2994 includes #2981 and adds an other part of the puzzle. My AnHardt#22 includes #2981 and #2994. They all work on the same two files. The later are basing on the older. The order of the changes is: Do less dangerous changes with high impact first, and more risky changes with lower potential for improvement last. Normally i prefer to make PRs you could merge in each order, like #2983 #2984; dealing with the same problem - to many ok's - but fighting different causes. Here this was impossible. I could do all that changes in one commit but than this one would be so complex nobody could see the consequences. The most people here around seem to give up reading a commit if more than 20 lines of code are affected. or a two liner you get 10 highly philosophic comments - for a 200 line, nothing. (exceptions prove the rule)

Here is what I'm thinking: In 2890 alexrj (developer of Slc3r) is complaining about errors in Marlins communication. foosel (develpoer of OctoPrint) complains since last summer about problems with Marlins communication. It has rumor about repetier does not care about Marlin anymore - because of errors in the communication. If i'd be the owner of Marlin's root repository, i'd take this very serious and would try to improve the situation as soon as possible. At least i'd answer alexri's offer for help. (#2890 (comment)).

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

Summary:

  • Occasionally we lose incoming input because the RX-buffer is full.
    We could detect that directly if we'd want, but already see it a moment later when line-number or check-sum errors occur.
  • The hosts do send to much data because they are receiving to many "ok"s.
  • Marlin sends to many "ok"s with M280, M301, M304, M851.
  • Marlin sends "ok"s for lines injected by enqueuecommand() where it should not.
  • ADVANCED_OK was ought to improve the reliability of the protocol but miserably fails by sending the line-number from the wrong side of the B-buffer.

If everything is repaired so far:

  • The use of the B-buffer is bad. Only one buffer-line is used. The rest of them is only involved when enqueuecommand() injects lines.
  • At least with processors without native USB we still occasionally see 'strange' parts of strings that have never been sent from the host.

This errors can be reduced by:

  • Disabling the hack - but than we loose chars again when sending with 250000bd.
  • Reducing the serial speed to 115200bd - but than lose some speed because of busy waiting when writing.
  • Making the buffer-pointers 8 bit. That makes impossible a 16it value is split into bytes what could be modified by a interrupt during the reads. Speeds up manipulation of these pointers so an interrupt has less chance to get in between read-modify-write operations.

This errors could be completely avoided if all needed read-modify-write operations would be protected against interruption. This problem is also known for the Arduino HardwareSerial, but not fixed until now.

Other observations:

  • The 'hack' is not used for native USB-processors and likely not needed because they have more than one byte hardware buffer.
  • When RepetierHost is not in PingPong mode it sends more data than it should. Reducing the number what describes the amount of buffers in Marlin (usually 127) to about 110 helps a lot.
  • Pronterface does not send line-numbers and check-sums with M105 when you click 'Watch'.
  • When printing fast short segments, the bottleneck is not getting serial input fast enough, but parsing and producing new p-buffer-lines. Even if the b-buffer is full no new p-buffer lines are produced.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 17, 2016

Thank You for taking the time to write all that up. I did not understand the full extent of the problem until now.

Occasionally we lose incoming input because the RX-buffer is full. We could detect that directly if we'd want, but already see it a moment later when line-number or check-sum errors occur.

I think I would argue we want to detect it immediately. If we wait until the line numbers are incorrect, can we recover and say we want the previous line again? And, we have already tried to execute a line with an error in it, right?

The hosts do send to much data because they are receiving to many "ok"s.
Marlin sends to many "ok"s with M280, M301, M304, M851.

Why are these commands sending extra OK's ? This doesn't make sense. Did all of these commands get added by the same person and that person did not understand something? It would seem we can correct these 4 commands by just deleting the line of code:

    SERIAL_PROTOCOL(MSG_OK);

in them.

Marlin sends "ok"s for lines injected by enqueuecommand() where it should not.

What is your suggestion on this? My initial thinking is we need to make enqueuecommand() mark each command it puts in the buffer so that when it is taken out of the list of things to do we can look at it and realize it was generated internally and not something that was sent from the host. If it is something generated internally, we don't want to send an OK.

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

Dear @Roxy-3DPrintBoard
Recently i talked about reinventing the wheel. Please see #2983 #2984

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

We can detect if the ringbuffer is full when we try to write to it
here
AnHardt@700c3d6
and here
AnHardt@28e69ac

But all we can say, is: we skipped a char - nothing else.
We have to parse the line to get some information about where it happened. Exactly that we do when checking the line-numbers and the checksum.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 17, 2016

Dear @Roxy-3DPrintBoard
Recently i talked about reinventing the wheel. Please see #2983 #2984

My apologies. I missed all of these. (I was travelling, but I should have seen them when I got back home.)

We can detect if the ringbuffer is full when we try to write to it
here AnHardt/Marlin@700c3d6 and here AnHardt/Marlin@28e69ac

But all we can say, is: we skipped a char - nothing else. We have to parse the line to get some information about where it happened. Exactly that we do when checking the line-numbers and the checksum.

Instead of sending an OK, what is the NAK (Negative Acknowledgement Sequence) ? Can we send that immediately to get the line re-transmitted? I don't know anything about the GCode protocol, but it kind of looks like this is trying to recover from lost characters. Bad CheckSums cause this code to be executed:

void gcode_line_error(const char *err, bool doFlush=true) {
  SERIAL_ERROR_START;
  serialprintPGM(err);
  SERIAL_ERRORLN(gcode_LastN);
  //Serial.println(gcode_N);
  if (doFlush) FlushSerialRequestResend();
  serial_count = 0;
}

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

We do not now what line!
When we know we do.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 17, 2016

Why? Is it because the line number got corrupted?

It appears we always know the last successful line number:

  gcode_LastN = gcode_N;

And that code up above re-pasted here:

void gcode_line_error(const char *err, bool doFlush=true) {
  SERIAL_ERROR_START;
  serialprintPGM(err);
  SERIAL_ERRORLN(gcode_LastN);
  //Serial.println(gcode_N);
  if (doFlush) FlushSerialRequestResend();
  serial_count = 0;
}

Flushes the recieve buffer and requests a resend starting at line gcode_LastN+1

void FlushSerialRequestResend() {
  //char command_queue[cmd_queue_index_r][100]="Resend:";
  MYSERIAL.flush();
  SERIAL_PROTOCOLPGM(MSG_RESEND);
  SERIAL_PROTOCOLLN(gcode_LastN + 1);
  ok_to_send();
}

Can you help me understand the scenario where missing some characters causes a failure we don't recover from?

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

Because the skipping occurs at the write end of the buffer and we are reading at the other end. We have no clue about what is in the buffer.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 17, 2016

Yes, but lines up to that line should be OK.

We can keep processing lines until we get to the line that is missing some characters. And we know the line number of the last good (and correct) line. So if the code flushes the receive buffer at that point and requests a resend starting with the bad line (that was missing characters), doesn't it recover?

(Mostly, I'm trying to figure out what is left unfixed and still needs attention.)

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

Yes, if we have read the line and have analysed it and we find an error we exactly know what the last correct line was.
We started with

I think I would argue we want to detect it immediately.

We can detect it immediately when we want to write to the buffer and it's full.
But we have to read the buffer content to say in what line it was.

So, what to do with the information we skipped? Wait until we know what line to resend. That's what we already do. The information does not help us - except when we are debugging and want to know why this damn char got lost.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 17, 2016

Thank You for your patients!

I have a much better understanding now.

We started with
I think I would argue we want to detect it immediately.

My wording was sloppy. I really meant "We want to detect and respond as soon as we start analyzing the corrupted line." And it appears that does happen.

I don't like the simple 8 bit XOR checksum. That seems very weak. But it is what it is...

I agree that we need #2983, #2984 and #2994 ASAP. But don't we also want these
AnHardt/Marlin@700c3d6 and AnHardt/Marlin@28e69ac ???

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 17, 2016

@thinkyhead Can you get these changes merged into RCBugFix?

@AnHardt
Copy link
Contributor Author

AnHardt commented Feb 17, 2016

AnHardt/Marlin@700c3d6 and AnHardt/Marlin@28e69ac ???

Are part of my test documentation (AnHardt#14). Not very interesting if you don't want to make your own tests.

@Roxy-3D
Copy link
Member

Roxy-3D commented Feb 17, 2016

OK. Let's get #2983, #2984 and #2994 merged. Let's see if we can get the Serial Communications reliable for everybody.

@thinkyhead
Copy link
Member

#3012 is my proposed solution for the corrupted command queue. It simply adds a single line buffer for the serial stream. It introduces one extra memcpy for each completed serial command, but neatly fixes all interaction between input streams by only ever injecting whole commands into the ring buffer and never leaving it in a blocked state. It also fixes a bug where a long lines, over 96 characters, would be split up and processed as two lines.

thinkyhead added a commit to thinkyhead/Marlin that referenced this issue Mar 12, 2016
Related to the original post in MarlinFirmware#2911. The comparison of
`commands_in_queue` to `BUFSIZE` in `enqueuecommand` is actually
correct. It is the comparison to `BUFSIZE - 1` that is incorrect. This
commit patches that error.
@jbrazio
Copy link
Contributor

jbrazio commented Apr 23, 2016

Thank you for your interest making Marlin better and reporting this issue but this topic has been open for a long period of time without any further development. Marlin has been under heavy development for the past couple of months and moving to it's last mile to finish the RC cycle and release Marlin v1.1.0. We suggest you to try out the latest RCBugfix branch and reopening this issue if required.

@jbrazio jbrazio closed this as completed Apr 23, 2016
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants