Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] array_size(null) results inconsistent with vanilla spark #5248

Closed
wForget opened this issue Apr 2, 2024 · 9 comments · Fixed by #6014
Closed

[VL] array_size(null) results inconsistent with vanilla spark #5248

wForget opened this issue Apr 2, 2024 · 9 comments · Fixed by #6014
Labels
bug Something isn't working triage

Comments

@wForget
Copy link
Member

wForget commented Apr 2, 2024

Backend

VL (Velox)

Bug description

array_size(null) results inconsistent with vanilla spark.

test sql:

SELECT array_size(null)

native engine returns: -1
vanilla spark returns: null

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@wForget wForget added bug Something isn't working triage labels Apr 2, 2024
@wForget
Copy link
Member Author

wForget commented Apr 2, 2024

spark replaces array_size with size and specifies legacySizeOfNull as false, however the velox's size function respects spark.sql.legacy.sizeOfNull session conf.

@wForget
Copy link
Member Author

wForget commented Apr 2, 2024

I want to add a new SizeExpressionTransformer to convert size function result. If legacySizeOfNull is false, convert the result -1 to null.
@PHILO-HE Is this feasible?

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Apr 2, 2024

Hi @wForget, thanks for bringing up this issue!

Looks velox has a config to control the behavior. https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/Size.cpp#L35

I note Gluten sets it according to Spark's config to align with Spark's "Size" function. But, for "ArraySize" function, we expect it's always false.
https://github.com/apache/incubator-gluten/blob/main/cpp/velox/compute/WholeStageResultIterator.cc#L482

For performance consideration, it may be better to directly do some changes in velox's size function, e.g., add support for two args here. The extra arg is legacySizeOfNull flag. If Velox finds it is specified, it will use this flag and dismiss the config setting. Then on Gluten side, SizeExpressionTransformer can check whether legacySizeOfNull is consistent with Spark conf. If not, pass the flag along with the input to Velox. Does this make sense?

@wForget
Copy link
Member Author

wForget commented Apr 2, 2024

Hi @wForget, thanks for bringing up this issue!

Looks velox has a config to control the behavior. https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/Size.cpp#L35

I note Gluten sets it according to Spark's config to align with Spark's "Size" function. For "ArraySize" function, we expect it's always false. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/compute/WholeStageResultIterator.cc#L482

For performance consideration, it may be better to directly do some changes in velox's size function, e.g., add support for two args here. The extra arg is legacySizeOfNull flag. If Velox finds it is specified, it will use this flag and dismiss the config setting. Then on Gluten side, SizeExpressionTransformer can check whether legacySizeOfNull is consistent with Spark conf. If not, pass the flag along with the input to Velox. Does this make sense?

@PHILO-HE Thanks for your guidance, this makes sense to me, I will try it.

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Apr 2, 2024

This is an example to let a function struct cover different inputs. Similarly, you need to add an extra initialize & call method, then register it.
https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/DateTimeFunctions.h#L999

@wForget
Copy link
Member Author

wForget commented Apr 2, 2024

This is an example to let a function struct cover different inputs. Similarly, you need to add an extra initialize & call method, then register it. https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/DateTimeFunctions.h#L999

Thanks, I'm trying to do it that way, and will send a pr later.

@gaoyangxiaozhu
Copy link
Contributor

I think before the wForget's PR ready in velox, at least we can fallback if sparkLegacySizeOfNull been set to true, right ? @PHILO-HE

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jun 4, 2024

I think before the wForget's PR ready in velox, at least we can fallback if sparkLegacySizeOfNull been set to true, right ?

@wForget, do you have any progress? I can take over it if you have no bandwidth. @gaoyangxiaozhu, let's wait two or three days.

@wForget
Copy link
Member Author

wForget commented Jun 11, 2024

I think before the wForget's PR ready in velox, at least we can fallback if sparkLegacySizeOfNull been set to true, right ?

@wForget, do you have any progress? I can take over it if you have no bandwidth. @gaoyangxiaozhu, let's wait two or three days.

Sorry, I was interrupted by something else, please feel free to send PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants