Add APIs to reuse token buffers in `Tokenizer` #1094

0rphon · 2024-01-15T19:49:53Z

This is a super simple PR that adds two new methods to aid in the reuse of token buffers.

Tokenizer::tokenize_with_location_into: operates identically to Tokenizer::tokenize_with_location except it allows you to supply your own buffer to use.
Parser::tokens: allows you to retrieve the token buffer from a parser after you're done with it.

alamb

Thank you for the contribution @0rphon -- this PR looks pretty close to me

I had some naming sugestions

Also, I think we should add a test of this API in tests/.. so that a future refactor doesn't accidentally break this API without also having to explicitly change a test

src/parser/mod.rs

alamb · 2024-01-19T21:59:21Z

src/tokenizer.rs

+
+    /// Tokenize the statement and append tokens with location information into the provided buffer.
+    /// If an error is thrown, the buffer will contain all tokens that were successfully parsed before the error.
+    pub fn tokenize_with_location_into(


What do you think about calling this into_buffer like:

Suggested change

pub fn tokenize_with_location_into(

pub fn tokenize_with_location_into_buffer(

i changed it to into_buf to match the naming notation of standard library like Read::read_buf or BufRead::fill_buf. let me know if youd rather it be into_buffer though!

coveralls · 2024-01-19T22:02:42Z

Pull Request Test Coverage Report for Build 7616338243

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.01%) to 87.868%

Totals
Change from base Build 7527748620:	0.01%
Covered Lines:	18889
Relevant Lines:	21497

💛 - Coveralls

trungda · 2024-01-22T06:53:00Z

src/tokenizer.rs

@@ -543,21 +543,30 @@ impl<'a> Tokenizer<'a> {

    /// Tokenize the statement and produce a vector of tokens with location information
    pub fn tokenize_with_location(&mut self) -> Result<Vec<TokenWithLocation>, TokenizerError> {
+        let mut tokens: Vec<TokenWithLocation> = vec![];
+        self.tokenize_with_location_into(&mut tokens)


Just for my education, what do we usually do when we make breaking change in public API like this one? There is a behavior change here right?

I am not sure why this is a breaking change in the sense of "programs that used to compile against sqlparser will still compile against sqlparser after this change". Perhaps I am missing something but I think this PR just adds a new API

In terms of breaking public API changes, for this crate I normally put a note in the changelog

This PR shouldnt contain any breaking changes to the API. If you looked at the tokenize_with_location_info_buf method that's being called here, its functionally equivalent to the old code. I just separated it out so users can supply their own buffer if desired

Thank you! I misread. I thought that in case of failure, we would still return a partially tokenized vector but actually we are still returning an Err.

No worries! The new method tokenize_with_location_into_buf will actually return a partial vec of successfully parsed tokens. I specifically needed this behavior for a project im working on that parses a single SQL statement followed by a lot of random unparseable garbage. Then i needed the into_tokens API so i could figure out the length of the parsed statement using Parser::index and a custom token_length function i made. Once i was done with the fork though i realized my changes actually worked perfectly as a way to reuse buffers so i figured id PR it as a little optimization.

Thank @0rphon I am curious how you figure out the length of the parsed statement as I also need that for a separate issue. I think I'll want byte-length for a precise calculation but it's non-trivial to calculate the byte length for all variants of Tokens.

I just based it off the formatting you use in impl Display for Token. Its a little hacky and definitely not accurate in every scenario, hence why i didn't add it to this PR. I was just trying to determine if a text blob contains an SQL statement, so i just needed an approximate len so i could display a rough output of the matched string.

fn token_len(t: &TokenWithLocation) -> usize { match &t.token { Token::EOF => 3, Token::Word(w) => w.value.len() + w.quote_style.map(|_| 2).unwrap_or_default(), Token::Number(n, l) => n.len() + *l as usize, Token::Char(c) if c.is_ascii() => 1, // todo is this correct? Token::Char(_) => 4, Token::SingleQuotedString(s) => s.len() + 2, Token::DoubleQuotedString(s) => s.len() + 2, Token::DollarQuotedString(s) => { s.value.len() + s.tag.as_ref().map(|t| t.len() * 2).unwrap_or_default() + 4 } Token::NationalStringLiteral(s) => s.len() + 3, Token::EscapedStringLiteral(s) => s.len() + 3, Token::HexStringLiteral(s) => s.len() + 3, Token::SingleQuotedByteStringLiteral(s) => s.len() + 3, Token::DoubleQuotedByteStringLiteral(s) => s.len() + 3, Token::RawStringLiteral(s) => s.len() + 3, Token::Comma => 1, Token::Whitespace(Whitespace::Space) => 1, Token::Whitespace(Whitespace::Newline) => 1, Token::Whitespace(Whitespace::Tab) => 1, Token::Whitespace(Whitespace::SingleLineComment { comment, prefix }) => { comment.len() + prefix.len() } Token::Whitespace(Whitespace::MultiLineComment(s)) => s.len() + 4, Token::DoubleEq => 2, Token::Spaceship => 3, Token::Eq => 1, Token::Neq => 2, Token::Lt => 1, Token::Gt => 1, Token::LtEq => 2, Token::GtEq => 2, Token::Plus => 1, Token::Minus => 1, Token::Mul => 1, Token::Div => 1, Token::DuckIntDiv => 2, Token::StringConcat => 2, Token::Mod => 1, Token::LParen => 1, Token::RParen => 1, Token::Period => 1, Token::Colon => 1, Token::DoubleColon => 2, Token::DuckAssignment => 2, Token::SemiColon => 1, Token::Backslash => 1, Token::LBracket => 1, Token::RBracket => 1, Token::Ampersand => 1, Token::Caret => 1, Token::Pipe => 1, Token::LBrace => 1, Token::RBrace => 1, Token::RArrow => 2, Token::Sharp => 1, Token::ExclamationMark => 1, Token::DoubleExclamationMark => 2, Token::Tilde => 1, Token::TildeAsterisk => 2, Token::ExclamationMarkTilde => 2, Token::ExclamationMarkTildeAsterisk => 3, Token::AtSign => 1, Token::ShiftLeft => 2, Token::ShiftRight => 2, Token::Overlap => 2, Token::PGSquareRoot => 2, Token::PGCubeRoot => 3, Token::Placeholder(s) => s.len(), Token::Arrow => 2, Token::LongArrow => 3, Token::HashArrow => 2, Token::HashLongArrow => 3, Token::AtArrow => 2, Token::ArrowAt => 2, Token::HashMinus => 2, Token::AtQuestion => 2, Token::AtAt => 2, } }

Ah I see. I was thinking of adding the actual byte position of a token in the struct Location to be able to achieve what I want. What do you think?

that would be a great feature!

alamb · 2024-01-22T19:30:18Z

tests/sqlparser_common.rs

+    let q = "INSERT INTO customer WITH foo AS (SELECT 1) SELECT * FROM foo UNION VALUES (1)";
+    let mut buf = Vec::new();
+    Tokenizer::new(&d, q)
+        .tokenize_with_location_into_buf(&mut buf)


alamb

Thank you again for this contribution @0rphon and for the review @trungda

0rphon added 5 commits January 11, 2024 23:06

tokenize_with_loction_into

02209d3

tokens fn

e379557

fix punctuation on comment

7fa4b1a

Merge branch 'sqlparser-rs:main' into main

9886baa

better comment for tokenize_with_location_info

2c8000e

alamb mentioned this pull request Jan 18, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 15, 2024 apache/datafusion#8864

Closed

9 tasks

alamb reviewed Jan 19, 2024

View reviewed changes

alamb changed the title ~~Ability to reuse token buffers~~ Add APIs to reuse token buffers in Tokenizer Jan 19, 2024

alamb mentioned this pull request Jan 21, 2024

Add syntax highlight to datafusion-cli apache/datafusion#8918

Merged

trungda reviewed Jan 22, 2024

View reviewed changes

address comments

cabda9d

alamb reviewed Jan 22, 2024

View reviewed changes

alamb approved these changes Jan 22, 2024

View reviewed changes

alamb merged commit d72f0a9 into apache:main Jan 22, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add APIs to reuse token buffers in `Tokenizer` #1094

Add APIs to reuse token buffers in `Tokenizer` #1094

0rphon commented Jan 15, 2024

alamb left a comment

alamb Jan 19, 2024

0rphon Jan 22, 2024

coveralls commented Jan 19, 2024 •

edited

Loading

trungda Jan 22, 2024

alamb Jan 22, 2024

0rphon Jan 22, 2024

trungda Jan 22, 2024

0rphon Jan 22, 2024

trungda Jan 22, 2024

0rphon Jan 22, 2024 •

edited

Loading

trungda Jan 23, 2024

0rphon Jan 23, 2024

alamb Jan 22, 2024

alamb left a comment

	pub fn tokenize_with_location_into(
	pub fn tokenize_with_location_into_buffer(

Add APIs to reuse token buffers in Tokenizer #1094

Add APIs to reuse token buffers in Tokenizer #1094

Conversation

0rphon commented Jan 15, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 19, 2024 • edited Loading

Pull Request Test Coverage Report for Build 7616338243

Warning: This coverage report may be inaccurate.

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0rphon Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Add APIs to reuse token buffers in `Tokenizer` #1094

Add APIs to reuse token buffers in `Tokenizer` #1094

coveralls commented Jan 19, 2024 •

edited

Loading

0rphon Jan 22, 2024 •

edited

Loading