-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add APIs to reuse token buffers in Tokenizer
#1094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @0rphon -- this PR looks pretty close to me
I had some naming sugestions
Also, I think we should add a test of this API in tests/..
so that a future refactor doesn't accidentally break this API without also having to explicitly change a test
src/tokenizer.rs
Outdated
|
||
/// Tokenize the statement and append tokens with location information into the provided buffer. | ||
/// If an error is thrown, the buffer will contain all tokens that were successfully parsed before the error. | ||
pub fn tokenize_with_location_into( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about calling this into_buffer
like:
pub fn tokenize_with_location_into( | |
pub fn tokenize_with_location_into_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed it to into_buf
to match the naming notation of standard library like Read::read_buf
or BufRead::fill_buf
. let me know if youd rather it be into_buffer
though!
Tokenizer
Pull Request Test Coverage Report for Build 7616338243Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
src/tokenizer.rs
Outdated
@@ -543,21 +543,30 @@ impl<'a> Tokenizer<'a> { | |||
|
|||
/// Tokenize the statement and produce a vector of tokens with location information | |||
pub fn tokenize_with_location(&mut self) -> Result<Vec<TokenWithLocation>, TokenizerError> { | |||
let mut tokens: Vec<TokenWithLocation> = vec![]; | |||
self.tokenize_with_location_into(&mut tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my education, what do we usually do when we make breaking change in public API like this one? There is a behavior change here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this is a breaking change in the sense of "programs that used to compile against sqlparser will still compile against sqlparser after this change". Perhaps I am missing something but I think this PR just adds a new API
In terms of breaking public API changes, for this crate I normally put a note in the changelog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR shouldnt contain any breaking changes to the API. If you looked at the tokenize_with_location_info_buf
method that's being called here, its functionally equivalent to the old code. I just separated it out so users can supply their own buffer if desired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I misread. I thought that in case of failure, we would still return a partially tokenized vector but actually we are still returning an Err
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! The new method tokenize_with_location_into_buf
will actually return a partial vec of successfully parsed tokens. I specifically needed this behavior for a project im working on that parses a single SQL statement followed by a lot of random unparseable garbage. Then i needed the into_tokens
API so i could figure out the length of the parsed statement using Parser::index
and a custom token_length
function i made. Once i was done with the fork though i realized my changes actually worked perfectly as a way to reuse buffers so i figured id PR it as a little optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @0rphon I am curious how you figure out the length of the parsed statement as I also need that for a separate issue. I think I'll want byte-length for a precise calculation but it's non-trivial to calculate the byte length for all variants of Tokens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just based it off the formatting you use in impl Display for Token
. Its a little hacky and definitely not accurate in every scenario, hence why i didn't add it to this PR. I was just trying to determine if a text blob contains an SQL statement, so i just needed an approximate len so i could display a rough output of the matched string.
fn token_len(t: &TokenWithLocation) -> usize {
match &t.token {
Token::EOF => 3,
Token::Word(w) => w.value.len() + w.quote_style.map(|_| 2).unwrap_or_default(),
Token::Number(n, l) => n.len() + *l as usize,
Token::Char(c) if c.is_ascii() => 1,
// todo is this correct?
Token::Char(_) => 4,
Token::SingleQuotedString(s) => s.len() + 2,
Token::DoubleQuotedString(s) => s.len() + 2,
Token::DollarQuotedString(s) => {
s.value.len() + s.tag.as_ref().map(|t| t.len() * 2).unwrap_or_default() + 4
}
Token::NationalStringLiteral(s) => s.len() + 3,
Token::EscapedStringLiteral(s) => s.len() + 3,
Token::HexStringLiteral(s) => s.len() + 3,
Token::SingleQuotedByteStringLiteral(s) => s.len() + 3,
Token::DoubleQuotedByteStringLiteral(s) => s.len() + 3,
Token::RawStringLiteral(s) => s.len() + 3,
Token::Comma => 1,
Token::Whitespace(Whitespace::Space) => 1,
Token::Whitespace(Whitespace::Newline) => 1,
Token::Whitespace(Whitespace::Tab) => 1,
Token::Whitespace(Whitespace::SingleLineComment { comment, prefix }) => {
comment.len() + prefix.len()
}
Token::Whitespace(Whitespace::MultiLineComment(s)) => s.len() + 4,
Token::DoubleEq => 2,
Token::Spaceship => 3,
Token::Eq => 1,
Token::Neq => 2,
Token::Lt => 1,
Token::Gt => 1,
Token::LtEq => 2,
Token::GtEq => 2,
Token::Plus => 1,
Token::Minus => 1,
Token::Mul => 1,
Token::Div => 1,
Token::DuckIntDiv => 2,
Token::StringConcat => 2,
Token::Mod => 1,
Token::LParen => 1,
Token::RParen => 1,
Token::Period => 1,
Token::Colon => 1,
Token::DoubleColon => 2,
Token::DuckAssignment => 2,
Token::SemiColon => 1,
Token::Backslash => 1,
Token::LBracket => 1,
Token::RBracket => 1,
Token::Ampersand => 1,
Token::Caret => 1,
Token::Pipe => 1,
Token::LBrace => 1,
Token::RBrace => 1,
Token::RArrow => 2,
Token::Sharp => 1,
Token::ExclamationMark => 1,
Token::DoubleExclamationMark => 2,
Token::Tilde => 1,
Token::TildeAsterisk => 2,
Token::ExclamationMarkTilde => 2,
Token::ExclamationMarkTildeAsterisk => 3,
Token::AtSign => 1,
Token::ShiftLeft => 2,
Token::ShiftRight => 2,
Token::Overlap => 2,
Token::PGSquareRoot => 2,
Token::PGCubeRoot => 3,
Token::Placeholder(s) => s.len(),
Token::Arrow => 2,
Token::LongArrow => 3,
Token::HashArrow => 2,
Token::HashLongArrow => 3,
Token::AtArrow => 2,
Token::ArrowAt => 2,
Token::HashMinus => 2,
Token::AtQuestion => 2,
Token::AtAt => 2,
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I was thinking of adding the actual byte position of a token in the struct Location
to be able to achieve what I want. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be a great feature!
let q = "INSERT INTO customer WITH foo AS (SELECT 1) SELECT * FROM foo UNION VALUES (1)"; | ||
let mut buf = Vec::new(); | ||
Tokenizer::new(&d, q) | ||
.tokenize_with_location_into_buf(&mut buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super simple PR that adds two new methods to aid in the reuse of token buffers.
Tokenizer::tokenize_with_location_into
: operates identically toTokenizer::tokenize_with_location
except it allows you to supply your own buffer to use.Parser::tokens
: allows you to retrieve the token buffer from a parser after you're done with it.