-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Presto SQL to Calcite SqlNode #171
Convert Presto SQL to Calcite SqlNode #171
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work. Thanks a lot, Wenrui!
|
||
|
||
public class PrestoParserDriver { | ||
private final static ParsingOptions parsingOptions = new ParsingOptions(AS_DOUBLE /* anything */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like new ParsingOptions
is deprecated, maybe change it to:
ParsingOptions.builder().setDecimalLiteralTreatment(AS_DOUBLE).build();
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why do we choose AS_DOUBLE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 options here. REJECT will fail the parser. It either treats the literal as double or decimal. I think either way should be good to give the same result.
It failed to parse 0.00 or some other cases.
* Licensed under the BSD-2 Clause license. | ||
* See LICENSE in the project root for license information. | ||
*/ | ||
package com.linkedin.coral.presto.parser; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a little bit confusing to have both coral-trino
module and coral-presto
module (Actually, coral-trino
as called coral-presto
before.)
Would it be better if we rename this module to something like coral-from-presto
or coral-from-trino
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I would suggest to coral-from-presto in this case since preso is a public name. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to use Trino dependencies and classes? I also suggest we reuse the coral-trino
module and add a package called com.linkedin.coral.trino.trino2rel
along the lines of the existing com.linkedin.coral.trino.rel2trino
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the differences between the trino dependency and public presto dependency? Is it just about the name? Not sure the original context to choose trino over presto. If it's intended to open source, I think public dependency might be more suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although most of their grammars are still the same, the differences between them may become larger and larger.
FYI:
- Trino-346:Add support for window frames based on GROUPS. (#5713)
- Prestodb-0.232:Add support for ALTER FUNCTION.(#13799)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Will move this change under the coral-trino/trino2rel
} | ||
|
||
public static SqlIdentifier createSqlIdentifier(SqlParserPos pos, String... path) { | ||
return new SqlIdentifier(Arrays.asList(path), ZERO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be
return new SqlIdentifier(Arrays.asList(path), pos);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
} | ||
SqlParserPos pos = getParserPos(node); | ||
List<SqlNode> operands = | ||
node.getArguments().stream().map(arg -> visitNode(arg, context)).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that if inner operand is *
, then node.getArguments()
is empty. Therefore, count(*)
would be converted to count()
We might need some special handlings for such kind of corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that difference in the test. I think to count() and count(*) are semantically equivalent in Calcite. If it's not same, I can add some special handling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be aligned with what coral-hive does for star:
coral/coral-hive/src/main/java/com/linkedin/coral/hive/hive2rel/parsetree/ParseTreeBuilder.java
Line 578 in f023597
protected SqlNode visitFunctionStar(ASTNode node, ParseContext ctx) { |
I think it is better to handle the corner cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add speical handling for count(*)
coral-presto/src/main/java/com/linkedin/coral/presto/parser/ParseTreeBuilder.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
protected SqlNode visitExtract(Extract node, ParserVisitorContext context) { | ||
SqlParserPos pos = getParserPos(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pos
is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
SqlLiteral functionQualifier = node.isDistinct() ? SqlLiteral.createSymbol(SqlSelectKeyword.DISTINCT, ZERO) : null; | ||
SqlCall call = createCall(unresolvedFunction, operands, functionQualifier); | ||
if (node.getWindow().isPresent()) { | ||
return OVER.createCall(pos, call, visitWindow(node.getWindow().get(), context)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it doesn't consider ignore/respect nulls
.
lag(salary, 1) ignore nulls over (partition by depname)
and lag(salary, 1) respect nulls over (partition by depname)
are returning same results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the following code to take the ignoreNull into account.
if (NULL_CARED_OPERATOR.contains(call.getKind())) { if (node.isIgnoreNulls()) { call = IGNORE_NULLS.createCall(pos, call); } else { call = RESPECT_NULLS.createCall(pos, call); } }
import static org.testng.AssertJUnit.assertEquals; | ||
|
||
|
||
public class ParseTreeBuilderTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tests, I think it is better to be aligned with other coral modules. i.e. create different unit tests for different functions and use assertion to check if the result is correct for each unit test, which would be cleaner and easier for debug. Maybe add a TODO item if we cannot handle it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test on each function is a good suggestion. I can add them later. The test cases here are more like brute force testing to make sure most of the presto queries can be handled here. I do check their equivalence with the Calcite parsed SqlNode. Most of them are equal except of some corner cases such as count() and count(*), identifier upper or lower case, etc.
SqlNode percentageNode = visitNode(node.getSamplePercentage(), context); | ||
float percentage = 0; | ||
if (percentageNode instanceof SqlNumericLiteral) { | ||
percentage = (float) (((SqlNumericLiteral) percentageNode).longValue(true) / 100.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there would be precision issue for float type, i.e.:
select * from foo tablesample system (10) join bar tablesample bernoulli (30) on a.id = b.id
would be converted to SELECT * FROM "FOO" TABLESAMPLE SYSTEM(10.000000149011612) INNER JOIN "BAR" TABLESAMPLE BERNOULLI(30.000001192092896) ON "A"."ID" = "B"."ID"
But there seems no way to avoid it since it requires float type, any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljfgem, I think you meant 0.10000000149011612
. Regardless, we can try to maintain just two decimal places (by formatting to string and and back to float).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wmoustafa It's stored 0.1 but converted 10.000000149011612 by Calcite toString function. That's probably unavoidable.
public String toString() {
StringBuilder b = new StringBuilder();
b.append(this.isBernoulli ? "BERNOULLI" : "SYSTEM");
b.append('(');
b.append((double)this.samplePercentage * 100.0D);
b.append(')');
if (this.isRepeatable) {
b.append(" REPEATABLE(");
b.append(this.repeatableSeed);
b.append(')');
}
return b.toString();
}
import static org.apache.calcite.util.Litmus.IGNORE; | ||
|
||
|
||
public class CalciteUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like many methods in this class are not used, are they for future use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added these util functions which are very common when we deal with Calcite SqlNode processing. I leave them for potential use cases. I could delete them if that's not a good decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would delete them if possible. We can introduce them when we need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @wenruimeng! I am done with ParseTreeBuilder
. Will get back the to rest soon.
|
||
|
||
/** | ||
* This is a reimplementation of vertical_blank's StandardSqlFormatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a link here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
|
||
public class ParseTreeBuilder extends AstVisitor<SqlNode, ParserVisitorContext> { | ||
private static final String UNSUPPORTED_EXCEPTION_MSG = "%s is not supported in the visit."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do something along the lines of
Line 11 in 579ce52
public class UnhandledASTTokenException extends RuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add line and column number in the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant using the same exception class name and structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
private final SqlTypeFactoryImpl sqlTypeFactory = new SqlTypeFactoryImpl(new HiveTypeSystem()); | ||
private final HiveFunctionResolver functionResolver = | ||
new HiveFunctionResolver(new StaticHiveFunctionRegistry(), new ConcurrentHashMap<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to replace those with non-Hive versions? Something along those lines is in #151.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could create a trino specific type system, but I didn't find the incompatibility so far. Maybe I missed some scenarios. If we found any issue, we can fix it later. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what you meant by incompatibilities? For example, the semantics of strpos
in Trino is carried by instr
of Coral-IR. There should be a number of other instances.
new HiveFunctionResolver(new StaticHiveFunctionRegistry(), new ConcurrentHashMap<>()); | ||
|
||
// convert the Presto node parse location to the Calcite SqlParserPos | ||
private SqlParserPos getParserPos(Node node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to pos
to make the code less verbose? (it has 65 usages).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private UnsupportedOperationException getUnsupportedException(Node node) { | ||
return new UnsupportedOperationException(format(UNSUPPORTED_EXCEPTION_MSG, node.toString())); | ||
} | ||
|
||
private UnsupportedOperationException getDDLException(Node node) { | ||
return new UnsupportedOperationException(format(DDL_NOT_SUPPORT_MSG, node.toString())); | ||
} | ||
|
||
private UnsupportedOperationException getLambdaException(Node node) { | ||
return new UnsupportedOperationException(format(LAMBDA_NOT_SUPPORT_MSG, node.toString())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine to one method with string argument? Also please see comment on UnhandledASTTokenException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
SqlNode percentageNode = visitNode(node.getSamplePercentage(), context); | ||
float percentage = 0; | ||
if (percentageNode instanceof SqlNumericLiteral) { | ||
percentage = (float) (((SqlNumericLiteral) percentageNode).longValue(true) / 100.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljfgem, I think you meant 0.10000000149011612
. Regardless, we can try to maintain just two decimal places (by formatting to string and and back to float).
} | ||
|
||
private RelDataType getDecimalType(String type) { | ||
String value = type.substring(8, type.length() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more reliable way to get those values from the Trino/Presto Node
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the trino parser, it has better way to do the conversion. Will update it here.
private RelDataType convertType(String type) { | ||
if (type.toUpperCase().startsWith("DECIMAL(")) { | ||
return getDecimalType(type); | ||
} else if (type.toUpperCase().startsWith("VARCHAR(")) { | ||
return getVarcharType(type); | ||
} else if (type.toUpperCase().startsWith("CHAR(")) { | ||
return getCharType(type); | ||
} | ||
return sqlTypeFactory.createSqlType(SqlTypeName.valueOf(type.toUpperCase())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leverage Decimal#getPrecesion
and Decimal#getScale
, and comparable VarcharType
methods, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the trino parser, all these types are GenericDataType object which has name as Identifier and a list of arguments. We can do the name match and arguments handling separately. In previous Presto, the type is just a string. That's why it handled in such way.
return new SqlNodeList(getChildren(node, context), getParserPos(node)); | ||
} | ||
|
||
private List<SqlNode> getListSqlNode(List<? extends Node> nodes, ParserVisitorContext context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toListOfSqlNode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@Override | ||
protected SqlNode visitWindow(Window node, ParserVisitorContext context) { | ||
SqlParserPos pos = getParserPos(node); | ||
SqlNodeList partitionList = createSqlNodeList(getListSqlNode(node.getPartitionBy(), context), pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth introducing new method for createSqlNodeList(getListSqlNode(x))
together.
We can rename getListSqlNode
to toListOfSqlNode
, and implement createSqlNodeList(getListSqlNode(x))
in toSqlNodeList
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…lOrder so it's consistent to the Calcite parser. Fix some minor edge cases by comparing the Calcite parsed SqlNode and Translated SqlNode
c99f66d
to
2d06db8
Compare
Thank you so much @wenruimeng for this great work! |
Very nice work! Thank you for working on this. |
I am out of touch with slack conversation so trying to understand the plan:
|
It has not started, but yes, conversion to
We had a couple of discussions around this, and |
This is the first task of issue #170
This PR introduces presto-parser dependency to parse the Presto SQL string to Presto Statement and implements the Presto AST visitor to convert the Presto Statement to SqlNode. DDL and Lambda are not supported in this PR.
There are about 300+ test cases that are most modified from the Presto repo test cases in presto-parser and presto-product-tests.