Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added buffer in get_ch() for line continuation lookahead #136

Merged
merged 9 commits into from
Feb 20, 2025

Conversation

xdufour
Copy link
Contributor

@xdufour xdufour commented Jan 19, 2025

Problem statement: #135

This PR aims to allow pnut.c::get_ch() as simply as possible to walk over line continuations while being compatible with the existing code for parsing escape sequences.

This support is optional and can be disabled by removing line 48:
#define SUPPORT_LINE_CONTINUATION

pnut.c Outdated
int ch;
int chbuf[CHBUF_SIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how a buffer larger than 1 would be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure - I implemented it as the concept it was meant to be such that it was future proof if the tokenizer needed to do longer lookahead eventually, but it could be simplified down to just an integer variable and without the tail pointer.

Copy link
Collaborator

@laurenthuberdeau laurenthuberdeau Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep it as simple as possible. We can always add the buffer and tail pointer back if we need it. Also, this should be in a #ifdef SUPPORT_LINE_CONTINUATION block.

pnut.c Outdated
chbuf_head = 0; // Set the pointer to the character buffer
ch = '\\'; // Restore the character that was read on this call
} else { // The character is a newline, so this is a line continuation which we want to bypass
ch = fgetc(fp); // Consume yet another character, the next one for logical parsing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to increment the line and column number if INCLUDE_LINE_NUMBER_ON_ERROR or the line count will get out of sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - as a side note, I haven't looked extensively at the debug error messages, but it seems the debug output always says the error is one line up from where it actually is - but I'm guessing that's because the tokenizer is behind where the input is being consumed? Is that right/ expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error location can be off by a few lines when the code generator throws the error. That's because we parse each declaration fully before passing it to the code generator, so the line_number and column_number point to the last token of the declaration (the last } for function declarations) which is often incorrect.

I've been thinking of annotating the AST objects with the values of line_number and column_number when the object is created. That would allow the code generator to throw a more precise error, it would probably still be off by a few characters but it would give a relatively good idea of what line caused the error.

When the error is thrown by the parser however, I haven't noticed the error location to be wrong. Do you have an example?

@feeley
Copy link
Member

feeley commented Jan 19, 2025

We need to check how much it costs to check \ on every character that is read. Normally this would be negligible in a C compiler but for pnut.sh we don't want to slow down the bootstrap of pnut.exe, which is in a certain sense the only benchmark we care about. There is also something to be said for a simpler tokenizer so that the auditing of pnut.sh is simpler. Even if the checking of \ adds a trivial auditing complexity, it is the sum of all these "trivial complexities" that adds up to something non-trivial. Is it necessary to support \ for bootstrapping TCC?

@xdufour
Copy link
Contributor Author

xdufour commented Jan 20, 2025

Hi - I have ran the benchmark with and without the #ifdef guarding the check on \ in getchar()

The first run is without, the second run is with.

PLATFORM: Linux DufourXavier 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
SHELL: bash
PNUT_SH_OPTIONS_EXTRA: 
0.546s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.048s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
167.347s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
123.178s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
145.917s for: bash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
47.859s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-pnut-i386-exe.exe

xdufour@DufourXavier:/mnt/e/git/pnut$ ./benchmark-bootstrap.sh bash
PLATFORM: Linux DufourXavier 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
SHELL: bash
PNUT_SH_OPTIONS_EXTRA: 
0.450s for: gcc -DRT_NO_INIT_GLOBALS -Dsh  pnut.c -o pnut-sh-compiled-by-gcc.exe
0.043s for: pnut-sh-compiled-by-gcc.exe -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh.sh
162.041s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dsh  pnut.c > pnut-sh-compiled-by-pnut-sh-sh.sh
129.112s for: bash pnut-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-sh-sh.sh
147.307s for: bash pnut-i386-compiled-by-pnut-sh-sh.sh -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-by-pnut-i386-sh.exe
49.952s for: pnut-i386-compiled-by-pnut-i386-sh.exe -DRT_NO_INIT_GLOBALS -Dtarget_i386_linux pnut.c > pnut-i386-compiled-pnut-i386-exe.exe```

@monnier
Copy link
Collaborator

monnier commented Jan 22, 2025 via email

pnut.c Outdated
@@ -757,7 +763,35 @@ void output_declaration_c_code(bool no_header) {
#endif

void get_ch() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could simplify the whole thing by separating get_ch in 2:

  • One function (let's call it get_ch_) that calls fgetc, does the EOF handling and updates the location (like get_ch from main).
  • Another function (called get_ch) that calls get_ch_ that loops when it encounters a \\n sequence.

That way we wouldn't need to repeat the location logic.

And when SUPPORT_LINE_CONTINUATION is off, get_ch_ would simply be called get_ch.

@laurenthuberdeau laurenthuberdeau merged commit 64767f1 into udem-dlteam:main Feb 20, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants