-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Spark to_json function #11995
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
CC @rui-mo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added some initial comments.
velox/docs/functions/spark/json.rst
Outdated
@@ -44,3 +44,14 @@ JSON Functions | |||
SELECT json_object_keys(''); -- NULL | |||
SELECT json_object_keys(1); -- NULL | |||
SELECT json_object_keys('"hello"'); -- NULL | |||
|
|||
.. spark:function:: to_json(json_object) -> jsonString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use camel-case naming style for 'json_object'.
|
||
namespace facebook::velox::functions::sparksql { | ||
|
||
class ToJsonCallToSpecialForm : public exec::FunctionCallToSpecialForm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the result type is determined (varchar), we might don't need to implement 'to_json' as a special form. Special form is only used for the functions whose result type cannot be inferred from input type. For this case, we can implement it as a simple function.
More information on simple functions: https://github.com/facebookincubator/velox/blob/main/velox/docs/develop/scalar-functions.rst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the special form, pls take a look again.
}); | ||
} | ||
|
||
class ToJsonFunction final : public exec::VectorFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing a function, simple function is preferred unless the implementation of vector function provides a significant performance gain which can be demonstrated with a benchmark.
Hi @wecharyu, thanks for the update! My comment is still about the way of implementation. As documented at https://github.com/facebookincubator/velox/blob/main/velox/docs/develop/scalar-functions.rst#vector-functions, simple function would be preferred if no benchmark to prove a large performance improvement. I'm wondering if there is any block issue you have met to prevent implementing it as simple function. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I cannot find a way to register the Row
input for different size of input schemas when writing it as a simple function. For example, for struct(bigint)
, I need register a function as follows:
registerFunction<ToJsonFunction, Varchar, Row<int64_t>>(
{prefix + "to_json"});
and for other schemas like struct(bigint, string)
, I need register a new function. Is there a way to register a generic function for Row
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point. It makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may check this:
velox/velox/functions/sparksql/Comparisons.h
Line 179 in 5d0e276
// For arbitrary nested complex types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Converts a Json object (ROW, ARRAY or MAP) into a JSON string. | ||
The current implementation has following limitations. | ||
|
||
* Does not support user provided options. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please provide an example for the unsupported case?
const SimpleVector<T>& input, | ||
int row, | ||
std::string& result, | ||
const TypePtr& type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add specilizations to handle different types.
velox/functions/sparksql/ToJson.cpp
Outdated
result.data()[resultSize + 1] = '"'; | ||
} else if constexpr (std::is_same_v<T, UnknownValue>) { | ||
VELOX_FAIL( | ||
"Convert UNKNOWN to JSON: Vextors of UNKNOWN type should not contain non-null rows"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo for Vextors
velox/functions/sparksql/ToJson.cpp
Outdated
result.clear(); | ||
generateJsonTyped<T>(*inputVector, row, result, inputVector->type()); | ||
|
||
flatResult.set(row, StringView(result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the to-string functions, usually we can optimize by estimating total size, allocating the data buffer once and writing to the data buffer directly. This helps avoid generating intermediate strings and improve the performance. Some example:
https://github.com/facebookincubator/velox/blob/main/velox/expression/CastExpr-inl.h#L609-L624
velox/functions/sparksql/ToJson.cpp
Outdated
}); | ||
} | ||
|
||
// Extra length for commas and brackets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: document are full sentences and should end with .
size_t elementsStringSize = 0; | ||
context.applyToSelectedNoThrow(rows, [&](auto row) { | ||
if (inputArray->isNullAt(row)) { | ||
// "null" will be inlined in the StringView. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, please fix it throughout
Hi @zhli1142015, would you like to take a review? Thanks. |
Sure. |
@@ -16,6 +16,7 @@ | |||
#include "velox/functions/lib/RegistrationHelpers.h" | |||
#include "velox/functions/sparksql/GetJsonObject.h" | |||
#include "velox/functions/sparksql/JsonObjectKeys.h" | |||
#include <expression/VectorFunction.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this include is not necessary.
velox/functions/sparksql/ToJson.cpp
Outdated
auto value = input.valueAt(row); | ||
|
||
if constexpr (std::is_same_v<T, StringView>) { | ||
size_t resultSize = escapedStringSize(value.data(), value.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is not available, please sync,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, do you mean we can not use this function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not available in main branch, I think you need to find the replacement and update your code.
std::nullopt}); | ||
auto input = makeRowVector({"a"}, {data}); | ||
auto expected = makeFlatVector<std::string>( | ||
{R"({"a":"1970-01-01T00:00:00.000000000"})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think timestamp in spark is with micros precision, can you confirm if this matches the spark output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark would display the milliseconds by default, and can change the format by options. Since we do not support user provided options yet, we can force the Timestamp format to be the same as Spark default behavior.
https://github.com/apache/spark/blob/412da42ab2e91a643980634d032fec0738a59590/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L646-L659
result->clearNulls(rows); | ||
auto* rawResults = result->as<FlatVector<StringView>>(); | ||
|
||
VELOX_DYNAMIC_TYPE_DISPATCH_ALL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think column names are not always passed from Spark to this function. Instead, col_XX is used as a placeholder. We may need to explicitly pass the column names.
velox/functions/sparksql/ToJson.cpp
Outdated
#include <functions/prestosql/json/JsonStringUtil.h> | ||
#include <type/Conversions.h> | ||
#include <type/DecimalUtil.h> | ||
#include <type/StringView.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove unnessary include
here.
static std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() { | ||
// T(ROW/ARRAY/MAP) -> varchar | ||
return {exec::FunctionSignatureBuilder() | ||
.typeVariable("T") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only allow ROW/ARRAY/MAP types as input types, we should add some check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the input type check in apply()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
.. spark:function:: to_json(jsonObject) -> jsonString | ||
|
||
Converts a Json object (ROW, ARRAY or MAP) into a JSON string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test your change with JsonFunctionsSuite and JsonExpressionsSuite, thanks.
df9dbda
to
8834290
Compare
Support Spark
to_json
function: https://docs.databricks.com/en/sql/language-manual/functions/to_json.html