Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added buffer in get_ch() for line continuation lookahead #136

Merged
merged 9 commits into from
Feb 20, 2025
34 changes: 34 additions & 0 deletions pnut.c
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@
// Use positional parameter directly for function parameters that are constants
#define OPTIMIZE_CONSTANT_PARAM_not
#define SUPPORT_ADDRESS_OF_OP_not
// Make get_ch() use a length-1 character buffer to lookahead and skip line continuations
#define SUPPORT_LINE_CONTINUATION

// Shell backend codegen options
#ifndef SH_AVOID_PRINTF_USE_NOT
Expand Down Expand Up @@ -247,7 +249,11 @@ void syntax_error(char *msg) {

// tokenizer

#define CHBUF_SIZE 1
int ch;
int chbuf[CHBUF_SIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how a buffer larger than 1 would be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure - I implemented it as the concept it was meant to be such that it was future proof if the tokenizer needed to do longer lookahead eventually, but it could be simplified down to just an integer variable and without the tail pointer.

Copy link
Collaborator

@laurenthuberdeau laurenthuberdeau Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep it as simple as possible. We can always add the buffer and tail pointer back if we need it. Also, this should be in a #ifdef SUPPORT_LINE_CONTINUATION block.

int chbuf_head = -1;
int chbuf_tail = 0;
#ifdef DEBUG_EXPAND_INCLUDES
int prev_ch = EOF;
#endif
Expand Down Expand Up @@ -757,7 +763,35 @@ void output_declaration_c_code(bool no_header) {
#endif

void get_ch() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could simplify the whole thing by separating get_ch in 2:

  • One function (let's call it get_ch_) that calls fgetc, does the EOF handling and updates the location (like get_ch from main).
  • Another function (called get_ch) that calls get_ch_ that loops when it encounters a \\n sequence.

That way we wouldn't need to repeat the location logic.

And when SUPPORT_LINE_CONTINUATION is off, get_ch_ would simply be called get_ch.

#ifdef SUPPORT_LINE_CONTINUATION
if(chbuf_head > -1) {
ch = chbuf[chbuf_head++];
if(chbuf_head == chbuf_tail) {
chbuf_head = -1;
}
} else {
ch = fgetc(fp);

if(ch == '\\') {
ch = fgetc(fp);

if(ch != '\n'){ // The character isn't a newline, so this is an escape sequence:
chbuf[0] = ch; // Put the character back in the character buffer for future consumption
chbuf_tail = 1; // Set the character buffer size
chbuf_head = 0; // Set the pointer to the character buffer
ch = '\\'; // Restore the character that was read on this call
} else { // The character is a newline, so this is a line continuation which we want to bypass
ch = fgetc(fp); // Consume yet another character, the next one for logical parsing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to increment the line and column number if INCLUDE_LINE_NUMBER_ON_ERROR or the line count will get out of sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - as a side note, I haven't looked extensively at the debug error messages, but it seems the debug output always says the error is one line up from where it actually is - but I'm guessing that's because the tokenizer is behind where the input is being consumed? Is that right/ expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error location can be off by a few lines when the code generator throws the error. That's because we parse each declaration fully before passing it to the code generator, so the line_number and column_number point to the last token of the declaration (the last } for function declarations) which is often incorrect.

I've been thinking of annotating the AST objects with the values of line_number and column_number when the object is created. That would allow the code generator to throw a more precise error, it would probably still be off by a few characters but it would give a relatively good idea of what line caused the error.

When the error is thrown by the parser however, I haven't noticed the error location to be wrong. Do you have an example?

#ifdef INCLUDE_LINE_NUMBER_ON_ERROR
line_number += 1;
column_number = 0;
#endif
}
}
}
#else
ch = fgetc(fp);
#endif
if (ch == EOF) {
// If it's not the last file on the stack, EOF means that we need to switch to the next file
if (include_stack->next != 0) {
Expand Down
36 changes: 36 additions & 0 deletions tests/_all/line_continuation.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#include <stdio.h>

void putint_aux(int n, int base) {
int d = n % base;
int top = n / base;
if (n == 0) return;
putint_aux(top, base);
putchar("0123456789abcdef"[d & 15]);
}

void putint(int n, int base) {
if (n < 0) {
putchar('-');
putint_aux(-n, base);
} else {
putint_aux(n, base);
}
}

void main() {

/**/
int foo = 0;

/\
*
*/ fo\
o +\
= 0\
x\
10\
200;

putint(foo, 10);
return 0;
}
1 change: 1 addition & 0 deletions tests/_all/line_continuation.golden
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
66048
Loading