Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add APIs to reuse token buffers in Tokenizer #1094

Merged
merged 6 commits into from
Jan 22, 2024
Merged

Add APIs to reuse token buffers in Tokenizer #1094

merged 6 commits into from
Jan 22, 2024

Conversation

0rphon
Copy link
Contributor

@0rphon 0rphon commented Jan 15, 2024

This is a super simple PR that adds two new methods to aid in the reuse of token buffers.

  • Tokenizer::tokenize_with_location_into: operates identically to Tokenizer::tokenize_with_location except it allows you to supply your own buffer to use.
  • Parser::tokens: allows you to retrieve the token buffer from a parser after you're done with it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @0rphon -- this PR looks pretty close to me

I had some naming sugestions

Also, I think we should add a test of this API in tests/.. so that a future refactor doesn't accidentally break this API without also having to explicitly change a test

src/parser/mod.rs Outdated Show resolved Hide resolved
src/tokenizer.rs Outdated

/// Tokenize the statement and append tokens with location information into the provided buffer.
/// If an error is thrown, the buffer will contain all tokens that were successfully parsed before the error.
pub fn tokenize_with_location_into(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling this into_buffer like:

Suggested change
pub fn tokenize_with_location_into(
pub fn tokenize_with_location_into_buffer(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed it to into_buf to match the naming notation of standard library like Read::read_buf or BufRead::fill_buf. let me know if youd rather it be into_buffer though!

@alamb alamb changed the title Ability to reuse token buffers Add APIs to reuse token buffers in Tokenizer Jan 19, 2024
@coveralls
Copy link

coveralls commented Jan 19, 2024

Pull Request Test Coverage Report for Build 7616338243

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 87.868%

Totals Coverage Status
Change from base Build 7527748620: 0.01%
Covered Lines: 18889
Relevant Lines: 21497

💛 - Coveralls

src/tokenizer.rs Outdated
@@ -543,21 +543,30 @@ impl<'a> Tokenizer<'a> {

/// Tokenize the statement and produce a vector of tokens with location information
pub fn tokenize_with_location(&mut self) -> Result<Vec<TokenWithLocation>, TokenizerError> {
let mut tokens: Vec<TokenWithLocation> = vec![];
self.tokenize_with_location_into(&mut tokens)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my education, what do we usually do when we make breaking change in public API like this one? There is a behavior change here right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why this is a breaking change in the sense of "programs that used to compile against sqlparser will still compile against sqlparser after this change". Perhaps I am missing something but I think this PR just adds a new API

In terms of breaking public API changes, for this crate I normally put a note in the changelog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR shouldnt contain any breaking changes to the API. If you looked at the tokenize_with_location_info_buf method that's being called here, its functionally equivalent to the old code. I just separated it out so users can supply their own buffer if desired

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I misread. I thought that in case of failure, we would still return a partially tokenized vector but actually we are still returning an Err.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! The new method tokenize_with_location_into_buf will actually return a partial vec of successfully parsed tokens. I specifically needed this behavior for a project im working on that parses a single SQL statement followed by a lot of random unparseable garbage. Then i needed the into_tokens API so i could figure out the length of the parsed statement using Parser::index and a custom token_length function i made. Once i was done with the fork though i realized my changes actually worked perfectly as a way to reuse buffers so i figured id PR it as a little optimization.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @0rphon I am curious how you figure out the length of the parsed statement as I also need that for a separate issue. I think I'll want byte-length for a precise calculation but it's non-trivial to calculate the byte length for all variants of Tokens.

Copy link
Contributor Author

@0rphon 0rphon Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just based it off the formatting you use in impl Display for Token. Its a little hacky and definitely not accurate in every scenario, hence why i didn't add it to this PR. I was just trying to determine if a text blob contains an SQL statement, so i just needed an approximate len so i could display a rough output of the matched string.

fn token_len(t: &TokenWithLocation) -> usize {
    match &t.token {
        Token::EOF => 3,
        Token::Word(w) => w.value.len() + w.quote_style.map(|_| 2).unwrap_or_default(),
        Token::Number(n, l) => n.len() + *l as usize,
        Token::Char(c) if c.is_ascii() => 1,
        // todo is this correct?
        Token::Char(_) => 4,
        Token::SingleQuotedString(s) => s.len() + 2,
        Token::DoubleQuotedString(s) => s.len() + 2,
        Token::DollarQuotedString(s) => {
            s.value.len() + s.tag.as_ref().map(|t| t.len() * 2).unwrap_or_default() + 4
        }
        Token::NationalStringLiteral(s) => s.len() + 3,
        Token::EscapedStringLiteral(s) => s.len() + 3,
        Token::HexStringLiteral(s) => s.len() + 3,
        Token::SingleQuotedByteStringLiteral(s) => s.len() + 3,
        Token::DoubleQuotedByteStringLiteral(s) => s.len() + 3,
        Token::RawStringLiteral(s) => s.len() + 3,
        Token::Comma => 1,
        Token::Whitespace(Whitespace::Space) => 1,
        Token::Whitespace(Whitespace::Newline) => 1,
        Token::Whitespace(Whitespace::Tab) => 1,
        Token::Whitespace(Whitespace::SingleLineComment { comment, prefix }) => {
            comment.len() + prefix.len()
        }
        Token::Whitespace(Whitespace::MultiLineComment(s)) => s.len() + 4,
        Token::DoubleEq => 2,
        Token::Spaceship => 3,
        Token::Eq => 1,
        Token::Neq => 2,
        Token::Lt => 1,
        Token::Gt => 1,
        Token::LtEq => 2,
        Token::GtEq => 2,
        Token::Plus => 1,
        Token::Minus => 1,
        Token::Mul => 1,
        Token::Div => 1,
        Token::DuckIntDiv => 2,
        Token::StringConcat => 2,
        Token::Mod => 1,
        Token::LParen => 1,
        Token::RParen => 1,
        Token::Period => 1,
        Token::Colon => 1,
        Token::DoubleColon => 2,
        Token::DuckAssignment => 2,
        Token::SemiColon => 1,
        Token::Backslash => 1,
        Token::LBracket => 1,
        Token::RBracket => 1,
        Token::Ampersand => 1,
        Token::Caret => 1,
        Token::Pipe => 1,
        Token::LBrace => 1,
        Token::RBrace => 1,
        Token::RArrow => 2,
        Token::Sharp => 1,
        Token::ExclamationMark => 1,
        Token::DoubleExclamationMark => 2,
        Token::Tilde => 1,
        Token::TildeAsterisk => 2,
        Token::ExclamationMarkTilde => 2,
        Token::ExclamationMarkTildeAsterisk => 3,
        Token::AtSign => 1,
        Token::ShiftLeft => 2,
        Token::ShiftRight => 2,
        Token::Overlap => 2,
        Token::PGSquareRoot => 2,
        Token::PGCubeRoot => 3,
        Token::Placeholder(s) => s.len(),
        Token::Arrow => 2,
        Token::LongArrow => 3,
        Token::HashArrow => 2,
        Token::HashLongArrow => 3,
        Token::AtArrow => 2,
        Token::ArrowAt => 2,
        Token::HashMinus => 2,
        Token::AtQuestion => 2,
        Token::AtAt => 2,
    }
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I was thinking of adding the actual byte position of a token in the struct Location to be able to achieve what I want. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a great feature!

let q = "INSERT INTO customer WITH foo AS (SELECT 1) SELECT * FROM foo UNION VALUES (1)";
let mut buf = Vec::new();
Tokenizer::new(&d, q)
.tokenize_with_location_into_buf(&mut buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for this contribution @0rphon and for the review @trungda

@alamb alamb merged commit d72f0a9 into apache:main Jan 22, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants