Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Spark to_json function #11995

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

wecharyu
Copy link
Contributor

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 31, 2024
Copy link

netlify bot commented Dec 31, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d468270
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67c1313fc983e3000856e91d

@zhouyuan
Copy link
Contributor

zhouyuan commented Jan 2, 2025

CC @rui-mo

Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake 👍

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added some initial comments.

@@ -44,3 +44,14 @@ JSON Functions
SELECT json_object_keys(''); -- NULL
SELECT json_object_keys(1); -- NULL
SELECT json_object_keys('"hello"'); -- NULL

.. spark:function:: to_json(json_object) -> jsonString
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use camel-case naming style for 'json_object'.


namespace facebook::velox::functions::sparksql {

class ToJsonCallToSpecialForm : public exec::FunctionCallToSpecialForm {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the result type is determined (varchar), we might don't need to implement 'to_json' as a special form. Special form is only used for the functions whose result type cannot be inferred from input type. For this case, we can implement it as a simple function.
More information on simple functions: https://github.com/facebookincubator/velox/blob/main/velox/docs/develop/scalar-functions.rst.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the special form, pls take a look again.

@wecharyu wecharyu requested a review from rui-mo January 22, 2025 11:23
});
}

class ToJsonFunction final : public exec::VectorFunction {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing a function, simple function is preferred unless the implementation of vector function provides a significant performance gain which can be demonstrated with a benchmark.

Hi @wecharyu, thanks for the update! My comment is still about the way of implementation. As documented at https://github.com/facebookincubator/velox/blob/main/velox/docs/develop/scalar-functions.rst#vector-functions, simple function would be preferred if no benchmark to prove a large performance improvement. I'm wondering if there is any block issue you have met to prevent implementing it as simple function. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I cannot find a way to register the Row input for different size of input schemas when writing it as a simple function. For example, for struct(bigint), I need register a function as follows:

  registerFunction<ToJsonFunction, Varchar, Row<int64_t>>(
      {prefix + "to_json"});

and for other schemas like struct(bigint, string), I need register a new function. Is there a way to register a generic function for Row type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point. It makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may check this:

// For arbitrary nested complex types.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Converts a Json object (ROW, ARRAY or MAP) into a JSON string.
The current implementation has following limitations.

* Does not support user provided options. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please provide an example for the unsupported case?

const SimpleVector<T>& input,
int row,
std::string& result,
const TypePtr& type) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add specilizations to handle different types.

result.data()[resultSize + 1] = '"';
} else if constexpr (std::is_same_v<T, UnknownValue>) {
VELOX_FAIL(
"Convert UNKNOWN to JSON: Vextors of UNKNOWN type should not contain non-null rows");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo for Vextors

result.clear();
generateJsonTyped<T>(*inputVector, row, result, inputVector->type());

flatResult.set(row, StringView(result));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the to-string functions, usually we can optimize by estimating total size, allocating the data buffer once and writing to the data buffer directly. This helps avoid generating intermediate strings and improve the performance. Some example:
https://github.com/facebookincubator/velox/blob/main/velox/expression/CastExpr-inl.h#L609-L624

});
}

// Extra length for commas and brackets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: document are full sentences and should end with .

size_t elementsStringSize = 0;
context.applyToSelectedNoThrow(rows, [&](auto row) {
if (inputArray->isNullAt(row)) {
// "null" will be inlined in the StringView.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, please fix it throughout

@rui-mo
Copy link
Collaborator

rui-mo commented Feb 17, 2025

Hi @zhli1142015, would you like to take a review? Thanks.

@zhli1142015
Copy link
Contributor

Hi @zhli1142015, would you like to take a review? Thanks.

Sure.

@@ -16,6 +16,7 @@
#include "velox/functions/lib/RegistrationHelpers.h"
#include "velox/functions/sparksql/GetJsonObject.h"
#include "velox/functions/sparksql/JsonObjectKeys.h"
#include <expression/VectorFunction.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this include is not necessary.

auto value = input.valueAt(row);

if constexpr (std::is_same_v<T, StringView>) {
size_t resultSize = escapedStringSize(value.data(), value.size());
Copy link
Contributor

@zhli1142015 zhli1142015 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is not available, please sync,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, do you mean we can not use this function here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not available in main branch, I think you need to find the replacement and update your code.

std::nullopt});
auto input = makeRowVector({"a"}, {data});
auto expected = makeFlatVector<std::string>(
{R"({"a":"1970-01-01T00:00:00.000000000"})",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think timestamp in spark is with micros precision, can you confirm if this matches the spark output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark would display the milliseconds by default, and can change the format by options. Since we do not support user provided options yet, we can force the Timestamp format to be the same as Spark default behavior.
https://github.com/apache/spark/blob/412da42ab2e91a643980634d032fec0738a59590/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L646-L659

result->clearNulls(rows);
auto* rawResults = result->as<FlatVector<StringView>>();

VELOX_DYNAMIC_TYPE_DISPATCH_ALL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think column names are not always passed from Spark to this function. Instead, col_XX is used as a placeholder. We may need to explicitly pass the column names.

#include <functions/prestosql/json/JsonStringUtil.h>
#include <type/Conversions.h>
#include <type/DecimalUtil.h>
#include <type/StringView.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unnessary include here.

static std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() {
// T(ROW/ARRAY/MAP) -> varchar
return {exec::FunctionSignatureBuilder()
.typeVariable("T")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only allow ROW/ARRAY/MAP types as input types, we should add some check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the input type check in apply() method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the type before calling apply. Please perform the check in the initialize method or the constructor.
Please follow either this or this.


.. spark:function:: to_json(jsonObject) -> jsonString

Converts a Json object (ROW, ARRAY or MAP) into a JSON string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test your change with JsonFunctionsSuite and JsonExpressionsSuite, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants