Added buffer in get_ch() for line continuation lookahead #136

xdufour · 2025-01-19T21:22:45Z

Problem statement: #135

This PR aims to allow pnut.c::get_ch() as simply as possible to walk over line continuations while being compatible with the existing code for parsing escape sequences.

This support is optional and can be disabled by removing line 48:
#define SUPPORT_LINE_CONTINUATION

laurenthuberdeau · 2025-01-19T21:39:51Z

pnut.c

 int ch;
+int chbuf[CHBUF_SIZE];


Not sure how a buffer larger than 1 would be useful?

Not sure - I implemented it as the concept it was meant to be such that it was future proof if the tokenizer needed to do longer lookahead eventually, but it could be simplified down to just an integer variable and without the tail pointer.

I'd prefer to keep it as simple as possible. We can always add the buffer and tail pointer back if we need it. Also, this should be in a #ifdef SUPPORT_LINE_CONTINUATION block.

pnut.c

tests/_exe/line_continuation.c

tests/_sh/line_continuation.c

laurenthuberdeau · 2025-01-19T22:14:28Z

pnut.c

+        chbuf_head = 0; // Set the pointer to the character buffer
+        ch = '\\';      // Restore the character that was read on this call
+      } else {          // The character is a newline, so this is a line continuation which we want to bypass
+        ch = fgetc(fp); // Consume yet another character, the next one for logical parsing


We'll want to increment the line and column number if INCLUDE_LINE_NUMBER_ON_ERROR or the line count will get out of sync

Done - as a side note, I haven't looked extensively at the debug error messages, but it seems the debug output always says the error is one line up from where it actually is - but I'm guessing that's because the tokenizer is behind where the input is being consumed? Is that right/ expected?

The error location can be off by a few lines when the code generator throws the error. That's because we parse each declaration fully before passing it to the code generator, so the line_number and column_number point to the last token of the declaration (the last } for function declarations) which is often incorrect.

I've been thinking of annotating the AST objects with the values of line_number and column_number when the object is created. That would allow the code generator to throw a more precise error, it would probably still be off by a few characters but it would give a relatively good idea of what line caused the error.

When the error is thrown by the parser however, I haven't noticed the error location to be wrong. Do you have an example?

feeley · 2025-01-19T22:30:13Z

We need to check how much it costs to check \ on every character that is read. Normally this would be negligible in a C compiler but for pnut.sh we don't want to slow down the bootstrap of pnut.exe, which is in a certain sense the only benchmark we care about. There is also something to be said for a simpler tokenizer so that the auditing of pnut.sh is simpler. Even if the checking of \ adds a trivial auditing complexity, it is the sum of all these "trivial complexities" that adds up to something non-trivial. Is it necessary to support \ for bootstrapping TCC?

xdufour · 2025-01-20T00:41:33Z

Hi - I have ran the benchmark with and without the #ifdef guarding the check on \ in getchar()

The first run is without, the second run is with.

PLATFORM: Linux DufourXavier 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
SHELL: bash
PNUT_SH_OPTIONS_EXTRA: 
0.546s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.048s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
167.347s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
123.178s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
145.917s for: bash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
47.859s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-pnut-i386-exe.exe

xdufour@DufourXavier:/mnt/e/git/pnut$ ./benchmark-bootstrap.sh bash
PLATFORM: Linux DufourXavier 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
SHELL: bash
PNUT_SH_OPTIONS_EXTRA: 
0.450s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.043s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
162.041s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
129.112s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
147.307s for: bash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
49.952s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-pnut-i386-exe.exe```

monnier · 2025-01-22T13:07:20Z

we don't want to slow down the bootstrap of `pnut.exe`, which is in a certain sense the only benchmark we care about. There is also something to be said for a simpler tokenizer so that the auditing of

Indeed, supporting \ at EOL is important for macros, but I've never seen it used anywhere else than between tokens (and it should be trivial to change the source code in the few cases where it might not be so already), so I'm not sure this feature pays for itself.

laurenthuberdeau · 2025-01-23T18:36:49Z

pnut.c

@@ -757,7 +763,35 @@ void output_declaration_c_code(bool no_header) {
 #endif

 void get_ch() {


I think we could simplify the whole thing by separating get_ch in 2:

One function (let's call it get_ch_) that calls fgetc, does the EOF handling and updates the location (like get_ch from main).

Another function (called get_ch) that calls get_ch_ that loops when it encounters a \\n sequence.

That way we wouldn't need to repeat the location logic.

And when SUPPORT_LINE_CONTINUATION is off, get_ch_ would simply be called get_ch.

zsh doesn't handle octal and hex literals properly, so use a decimal literal instead.

Added optional buffer in get_ch() for line continuation consumption

e5699fa

laurenthuberdeau reviewed Jan 19, 2025

View reviewed changes

pnut.c Outdated Show resolved Hide resolved

laurenthuberdeau reviewed Jan 19, 2025

View reviewed changes

tests/_exe/line_continuation.c Outdated Show resolved Hide resolved

laurenthuberdeau reviewed Jan 19, 2025

View reviewed changes

tests/_sh/line_continuation.c Outdated Show resolved Hide resolved

laurenthuberdeau reviewed Jan 19, 2025

View reviewed changes

Changes following PR comments

97d8b2f

Merge branch 'main' into xd/line_continuation

7c2343a

laurenthuberdeau reviewed Jan 23, 2025

View reviewed changes

laurenthuberdeau force-pushed the main branch from cfa2032 to 5eaa69f Compare February 15, 2025 17:11

laurenthuberdeau added 2 commits February 20, 2025 09:04

Simplify get_ch by splitting it in 2

23b6ec2

Add consecutive line continuations to test file

391e186

laurenthuberdeau force-pushed the xd/line_continuation branch from 9e00727 to 391e186 Compare February 20, 2025 14:04

laurenthuberdeau added 4 commits February 20, 2025 09:10

Rewrite putint for zsh compatibility

8421f15

Merge branch 'main' into xd/line_continuation

445aa0a

Change constant in test for zsh compatibility

b7d57f4

zsh doesn't handle octal and hex literals properly, so use a decimal literal instead.

pnut-sh: add leading 0 to octal literal

c4656ee

laurenthuberdeau merged commit 64767f1 into udem-dlteam:main Feb 20, 2025
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added buffer in get_ch() for line continuation lookahead #136

Added buffer in get_ch() for line continuation lookahead #136

xdufour commented Jan 19, 2025

laurenthuberdeau Jan 19, 2025

xdufour Jan 20, 2025

laurenthuberdeau Jan 23, 2025 •

edited

Loading

laurenthuberdeau Jan 19, 2025

xdufour Jan 20, 2025

laurenthuberdeau Jan 23, 2025

feeley commented Jan 19, 2025

xdufour commented Jan 20, 2025

monnier commented Jan 22, 2025 via email

laurenthuberdeau Jan 23, 2025

		@@ -757,7 +763,35 @@ void output_declaration_c_code(bool no_header) {
		#endif

		void get_ch() {

Added buffer in get_ch() for line continuation lookahead #136

Added buffer in get_ch() for line continuation lookahead #136

Conversation

xdufour commented Jan 19, 2025

laurenthuberdeau Jan 19, 2025

Choose a reason for hiding this comment

xdufour Jan 20, 2025

Choose a reason for hiding this comment

laurenthuberdeau Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

laurenthuberdeau Jan 19, 2025

Choose a reason for hiding this comment

xdufour Jan 20, 2025

Choose a reason for hiding this comment

laurenthuberdeau Jan 23, 2025

Choose a reason for hiding this comment

feeley commented Jan 19, 2025

xdufour commented Jan 20, 2025

monnier commented Jan 22, 2025 via email

laurenthuberdeau Jan 23, 2025

Choose a reason for hiding this comment

laurenthuberdeau Jan 23, 2025 •

edited

Loading