Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Presto SQL to Calcite SqlNode #171

Merged
merged 10 commits into from
Nov 13, 2021

Conversation

wenruimeng
Copy link
Contributor

@wenruimeng wenruimeng commented Oct 10, 2021

This is the first task of issue #170
This PR introduces presto-parser dependency to parse the Presto SQL string to Presto Statement and implements the Presto AST visitor to convert the Presto Statement to SqlNode. DDL and Lambda are not supported in this PR.

There are about 300+ test cases that are most modified from the Presto repo test cases in presto-parser and presto-product-tests.

Copy link
Collaborator

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. Thanks a lot, Wenrui!



public class PrestoParserDriver {
private final static ParsingOptions parsingOptions = new ParsingOptions(AS_DOUBLE /* anything */);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like new ParsingOptions is deprecated, maybe change it to:

ParsingOptions.builder().setDecimalLiteralTreatment(AS_DOUBLE).build();

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why do we choose AS_DOUBLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 options here. REJECT will fail the parser. It either treats the literal as double or decimal. I think either way should be good to give the same result.
It failed to parse 0.00 or some other cases.

* Licensed under the BSD-2 Clause license.
* See LICENSE in the project root for license information.
*/
package com.linkedin.coral.presto.parser;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a little bit confusing to have both coral-trino module and coral-presto module (Actually, coral-trino as called coral-presto before.)
Would it be better if we rename this module to something like coral-from-presto or coral-from-trino?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I would suggest to coral-from-presto in this case since preso is a public name. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to use Trino dependencies and classes? I also suggest we reuse the coral-trino module and add a package called com.linkedin.coral.trino.trino2rel along the lines of the existing com.linkedin.coral.trino.rel2trino.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the differences between the trino dependency and public presto dependency? Is it just about the name? Not sure the original context to choose trino over presto. If it's intended to open source, I think public dependency might be more suitable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although most of their grammars are still the same, the differences between them may become larger and larger.
FYI:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Will move this change under the coral-trino/trino2rel

}

public static SqlIdentifier createSqlIdentifier(SqlParserPos pos, String... path) {
return new SqlIdentifier(Arrays.asList(path), ZERO);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be

return new SqlIdentifier(Arrays.asList(path), pos);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

}
SqlParserPos pos = getParserPos(node);
List<SqlNode> operands =
node.getArguments().stream().map(arg -> visitNode(arg, context)).collect(Collectors.toList());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that if inner operand is *, then node.getArguments() is empty. Therefore, count(*) would be converted to count()
We might need some special handlings for such kind of corner cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that difference in the test. I think to count() and count(*) are semantically equivalent in Calcite. If it's not same, I can add some special handling here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be aligned with what coral-hive does for star:

protected SqlNode visitFunctionStar(ASTNode node, ParseContext ctx) {

I think it is better to handle the corner cases here.

Copy link
Contributor Author

@wenruimeng wenruimeng Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add speical handling for count(*)


@Override
protected SqlNode visitExtract(Extract node, ParserVisitorContext context) {
SqlParserPos pos = getParserPos(node);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pos is not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

SqlLiteral functionQualifier = node.isDistinct() ? SqlLiteral.createSymbol(SqlSelectKeyword.DISTINCT, ZERO) : null;
SqlCall call = createCall(unresolvedFunction, operands, functionQualifier);
if (node.getWindow().isPresent()) {
return OVER.createCall(pos, call, visitWindow(node.getWindow().get(), context));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it doesn't consider ignore/respect nulls.

lag(salary, 1) ignore nulls over (partition by depname) and lag(salary, 1) respect nulls over (partition by depname) are returning same results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the following code to take the ignoreNull into account.
if (NULL_CARED_OPERATOR.contains(call.getKind())) { if (node.isIgnoreNulls()) { call = IGNORE_NULLS.createCall(pos, call); } else { call = RESPECT_NULLS.createCall(pos, call); } }

import static org.testng.AssertJUnit.assertEquals;


public class ParseTreeBuilderTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests, I think it is better to be aligned with other coral modules. i.e. create different unit tests for different functions and use assertion to check if the result is correct for each unit test, which would be cleaner and easier for debug. Maybe add a TODO item if we cannot handle it in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test on each function is a good suggestion. I can add them later. The test cases here are more like brute force testing to make sure most of the presto queries can be handled here. I do check their equivalence with the Calcite parsed SqlNode. Most of them are equal except of some corner cases such as count() and count(*), identifier upper or lower case, etc.

SqlNode percentageNode = visitNode(node.getSamplePercentage(), context);
float percentage = 0;
if (percentageNode instanceof SqlNumericLiteral) {
percentage = (float) (((SqlNumericLiteral) percentageNode).longValue(true) / 100.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there would be precision issue for float type, i.e.:
select * from foo tablesample system (10) join bar tablesample bernoulli (30) on a.id = b.id would be converted to SELECT * FROM "FOO" TABLESAMPLE SYSTEM(10.000000149011612) INNER JOIN "BAR" TABLESAMPLE BERNOULLI(30.000001192092896) ON "A"."ID" = "B"."ID"
But there seems no way to avoid it since it requires float type, any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljfgem, I think you meant 0.10000000149011612. Regardless, we can try to maintain just two decimal places (by formatting to string and and back to float).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmoustafa It's stored 0.1 but converted 10.000000149011612 by Calcite toString function. That's probably unavoidable.

        public String toString() {
            StringBuilder b = new StringBuilder();
            b.append(this.isBernoulli ? "BERNOULLI" : "SYSTEM");
            b.append('(');
            b.append((double)this.samplePercentage * 100.0D);
            b.append(')');
            if (this.isRepeatable) {
                b.append(" REPEATABLE(");
                b.append(this.repeatableSeed);
                b.append(')');
            }

            return b.toString();
        }

import static org.apache.calcite.util.Litmus.IGNORE;


public class CalciteUtil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like many methods in this class are not used, are they for future use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added these util functions which are very common when we deal with Calcite SqlNode processing. I leave them for potential use cases. I could delete them if that's not a good decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would delete them if possible. We can introduce them when we need them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do it

Copy link
Contributor

@wmoustafa wmoustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @wenruimeng! I am done with ParseTreeBuilder. Will get back the to rest soon.



/**
* This is a reimplementation of vertical_blank's StandardSqlFormatter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a link here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



public class ParseTreeBuilder extends AstVisitor<SqlNode, ParserVisitorContext> {
private static final String UNSUPPORTED_EXCEPTION_MSG = "%s is not supported in the visit.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something along the lines of

public class UnhandledASTTokenException extends RuntimeException {
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add line and column number in the error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant using the same exception class name and structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Comment on lines 88 to 90
private final SqlTypeFactoryImpl sqlTypeFactory = new SqlTypeFactoryImpl(new HiveTypeSystem());
private final HiveFunctionResolver functionResolver =
new HiveFunctionResolver(new StaticHiveFunctionRegistry(), new ConcurrentHashMap<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to replace those with non-Hive versions? Something along those lines is in #151.

Copy link
Contributor Author

@wenruimeng wenruimeng Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could create a trino specific type system, but I didn't find the incompatibility so far. Maybe I missed some scenarios. If we found any issue, we can fix it later. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what you meant by incompatibilities? For example, the semantics of strpos in Trino is carried by instr of Coral-IR. There should be a number of other instances.

new HiveFunctionResolver(new StaticHiveFunctionRegistry(), new ConcurrentHashMap<>());

// convert the Presto node parse location to the Calcite SqlParserPos
private SqlParserPos getParserPos(Node node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to pos to make the code less verbose? (it has 65 usages).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 100 to 110
private UnsupportedOperationException getUnsupportedException(Node node) {
return new UnsupportedOperationException(format(UNSUPPORTED_EXCEPTION_MSG, node.toString()));
}

private UnsupportedOperationException getDDLException(Node node) {
return new UnsupportedOperationException(format(DDL_NOT_SUPPORT_MSG, node.toString()));
}

private UnsupportedOperationException getLambdaException(Node node) {
return new UnsupportedOperationException(format(LAMBDA_NOT_SUPPORT_MSG, node.toString()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine to one method with string argument? Also please see comment on UnhandledASTTokenException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SqlNode percentageNode = visitNode(node.getSamplePercentage(), context);
float percentage = 0;
if (percentageNode instanceof SqlNumericLiteral) {
percentage = (float) (((SqlNumericLiteral) percentageNode).longValue(true) / 100.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljfgem, I think you meant 0.10000000149011612. Regardless, we can try to maintain just two decimal places (by formatting to string and and back to float).

}

private RelDataType getDecimalType(String type) {
String value = type.substring(8, type.length() - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more reliable way to get those values from the Trino/Presto Node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the trino parser, it has better way to do the conversion. Will update it here.

Comment on lines 1028 to 1037
private RelDataType convertType(String type) {
if (type.toUpperCase().startsWith("DECIMAL(")) {
return getDecimalType(type);
} else if (type.toUpperCase().startsWith("VARCHAR(")) {
return getVarcharType(type);
} else if (type.toUpperCase().startsWith("CHAR(")) {
return getCharType(type);
}
return sqlTypeFactory.createSqlType(SqlTypeName.valueOf(type.toUpperCase()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage Decimal#getPrecesion and Decimal#getScale, and comparable VarcharType methods, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the trino parser, all these types are GenericDataType object which has name as Identifier and a list of arguments. We can do the name match and arguments handling separately. In previous Presto, the type is just a string. That's why it handled in such way.

return new SqlNodeList(getChildren(node, context), getParserPos(node));
}

private List<SqlNode> getListSqlNode(List<? extends Node> nodes, ParserVisitorContext context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toListOfSqlNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Override
protected SqlNode visitWindow(Window node, ParserVisitorContext context) {
SqlParserPos pos = getParserPos(node);
SqlNodeList partitionList = createSqlNodeList(getListSqlNode(node.getPartitionBy(), context), pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth introducing new method for createSqlNodeList(getListSqlNode(x)) together.

We can rename getListSqlNode to toListOfSqlNode, and implement createSqlNodeList(getListSqlNode(x)) in toSqlNodeList.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wenruimeng wenruimeng force-pushed the wenrui/presto_to_sqlnode branch from c99f66d to 2d06db8 Compare October 30, 2021 01:11
@wmoustafa
Copy link
Contributor

Thank you so much @wenruimeng for this great work!

@wmoustafa wmoustafa merged commit 5dc967b into linkedin:master Nov 13, 2021
@antumbde
Copy link
Contributor

Very nice work! Thank you for working on this.

@antumbde
Copy link
Contributor

I am out of touch with slack conversation so trying to understand the plan:

  1. Current impl seems to be first phase to convert Trino parse tree to calcite parse tree. Is there WIP to convert that to rel ?
  2. It's using HiveFunctionResolver and static function registry. We should change that to TrinoResolver and TrinoFunctionRegistry. (Worth discussing how to organize code to support to/from conversions). Maybe this is already planned.

@wmoustafa
Copy link
Contributor

wmoustafa commented Nov 15, 2021

  1. Current impl seems to be first phase to convert Trino parse tree to calcite parse tree. Is there WIP to convert that to rel ?

It has not started, but yes, conversion to RelNode should be addressed in the second phase.

  1. It's using HiveFunctionResolver and static function registry. We should change that to TrinoResolver and TrinoFunctionRegistry. (Worth discussing how to organize code to support to/from conversions). Maybe this is already planned.

We had a couple of discussions around this, and HiveFunctionResolver is just a placeholder in this patch. It should go way in the step to convert to RelNode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants