Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] #4039

Open
54 of 99 tasks
PHILO-HE opened this issue Dec 14, 2023 · 87 comments
Open
54 of 99 tasks
Labels
enhancement New feature or request

Comments

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Dec 14, 2023

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function.
You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:


  • percentile_approx/approx_percentile (WIP, guangxin)
  • concat_ws (PR ready, feat: Add Spark concat_ws function facebookincubator/velox#8854)
  • unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
  • locate
  • parse_url (PR drafted, not merged)
  • urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
  • normalizenanandzero
  • arrayintersects
  • default.json_split (udf, no need to impl.): "external UDF"
  • parsejsonarray: "external UDF"
  • struct
  • percentile (@Yohahaha)
  • first/first_value (@JkSelf)
  • last/last_value (@JkSelf)
  • posexplode (WIP, @marin-ma)
  • trunc (WIP, HannanKan)
  • months_between (PR ready)
  • date_trunc (WIP, HannanKan)
  • stack
  • grouping_id
  • printf (@Surbhi-Vijay)
  • space (WIP, rhh777)
  • inline (WIP, @marin-ma)
  • to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
  • from_csv
  • from_json (feat: Add Spark from_json function facebookincubator/velox#11709)
  • to_json (@wecharyu)
  • json_object_keys
  • json_tuple
  • schema_of_csv
  • schema_of_json
  • to_csv
  • make_ym_interval (WIP, @marin-ma)
  • make_timestamp (WIP, @marin-ma)
  • make_interval
  • make_dt_interval
  • monotonically_increasing_id
  • from_utc_timestamp (@acvictor)
  • extract
  • exists (@lyy-pineapple)
  • date_part
  • zip_with
  • transform (@Yohahaha)
  • transform_keys
  • transform_values
  • map_from_entries (WIP, MaYan)
  • map_filter (WIP, MaYan)
  • map_entries (Done, by MaYan)
  • map_concat
  • forall (@lyy-pineapple)
  • flatten (@ivoson)
  • filter
  • filter (array) (@ivoson)
  • width_bucket
  • array_sort (@boneanxs)
  • xpath
  • xpath_boolean
  • xpath_double
  • xpath_float
  • xpath_int
  • xpath_long
  • xpath_number
  • xpath_short
  • xpath_string
  • unbase64 (WIP, @fyp711)
  • decode (partially supported if translated to caseWhen. WIP Cody)
  • initcap (WIP, velox PR: 8676)
  • unix_date (velox PR 8725, completed)
  • count_min_sketch
  • bool_and/every (@mskapilks)
  • bool_or/any/some (@mskapilks)
  • shuffle (completed)
  • bround (@xumingming)
  • format_string (@gaoyangxiaozhu)
  • format_number (@gaoyangxiaozhu)
  • soundex (@zhli1142015)
  • levenshtein (@zhli1142015)
  • cot (@honeyhexin)
  • expm1 (@Donvi)
  • stack (generator function, @xumingming)
  • randn (@Donvi)
  • empty2null (internal function, @jinchengchenghh)
  • toprettystring (internal function, @jinchengchenghh)
  • AtLeastNNonNulls (internal funciton, @zhli1142015)
  • GetStructField (internal funciton)
  • Since Spark-3.3 (related to ML, low priority)
  • regr_count
  • regr_avgx
  • regr_avgy
  • regr_r2
  • regr_sxx
  • regr_sxy
  • regr_syy
  • regr_slope
  • regr_intercept
  • Since Spark-3.3

  • Since Spark-3.4

@PHILO-HE PHILO-HE added the enhancement New feature or request label Dec 14, 2023
@PHILO-HE PHILO-HE pinned this issue Dec 14, 2023
@PHILO-HE PHILO-HE changed the title [VL] Spark function support list [please leave comment/mark if you plan to implement] [VL] Unsupported spark function list [please leave comment/mark if you plan to implement] Dec 15, 2023
@PHILO-HE PHILO-HE changed the title [VL] Unsupported spark function list [please leave comment/mark if you plan to implement] [VL] Unsupported spark function list [please leave a comment if you plan to pick some] Dec 15, 2023
@Yohahaha
Copy link
Contributor

Yohahaha commented Dec 29, 2023

I'd like support hex and unhex.

update: hex and unhex has already supported in Gluten.

@zwangsheng
Copy link
Contributor

Hi i'd like to give a try with hour function.

@konjac
Copy link
Contributor

konjac commented Jan 4, 2024

Hi, I'd like to have a look into map_keys

@fyp711
Copy link
Contributor

fyp711 commented Jan 11, 2024

Hi I'd like to support find_in_set in velox

@HannanKan
Copy link
Contributor

Hi, I'd like to support date_trunc/trunc.

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

Hi, I'd like to support dense_rank.

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

dense_rank already supported in velox facebookincubator/velox#6289.

@zhztheplayer
Copy link
Member

  • percentile_approx
  • approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

@PHILO-HE
Copy link
Contributor Author

  • percentile_approx
  • approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

Yes, they are one thing. Just unify them into one checkbox. Thanks!

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

I will take a look ntile window function.

@zhouyuan
Copy link
Contributor

ubase64:
#4482

@zjuwangg
Copy link
Contributor

Is there any plan to suppport from_json function?

@yma11
Copy link
Contributor

yma11 commented Jan 29, 2024

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

@acvictor
Copy link
Contributor

I'd like to give date_from_unix_date a shot

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Feb 21, 2024

Just removed the below functions from the list, since they have been supported. Thanks! @acvictor, @Yohahaha, @fyp711, @zwangsheng, @JkSelf, etc.

to_date hour mod pow ifnull add_months next_day dense_rank find_in_set hex ntile
date_from_unix_date array_repeat array_position array_except array_distinct weekday
year month day

@acvictor
Copy link
Contributor

acvictor commented Feb 21, 2024

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

@Surbhi-Vijay
Copy link
Contributor

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

@PHILO-HE
Copy link
Contributor Author

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Thanks so much for your feedback! Just removed it from the list.

@acvictor
Copy link
Contributor

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Will do minute as well.

@rui-mo
Copy link
Contributor

rui-mo commented Feb 26, 2024

I'd like to work on locate and arrayintersect.

@mskapilks
Copy link
Contributor

I would like to work on bool_and, bool_or

@zhztheplayer
Copy link
Member

zhztheplayer commented Feb 29, 2024

  • collect_list (velox supported, needs Gluten to enable array for project plan node)
  • collect_set

@PHILO-HE Should we uncheck these two? I ran a test and the two functions are both fallen back (in 3.3).

@Surbhi-Vijay
Copy link
Contributor

I would like to give printf a try.

@zhli1142015
Copy link
Contributor

I'd like to pick up mode, thanks

@jinchengchenghh
Copy link
Contributor

Can you add empty2null to the list? @PHILO-HE

@PHILO-HE
Copy link
Contributor Author

Can you add empty2null to the list? @PHILO-HE

Just added.

@jinchengchenghh
Copy link
Contributor

Thanks!

@jinchengchenghh
Copy link
Contributor

jinchengchenghh commented Jun 25, 2024

Can you add the function toprettystring to the list? Thanks! @PHILO-HE
This query will use it
I will take it.

select        sum(hash(floor(l_extendedprice)) *l_discount + hash(l_orderkey) + hash(l_partkey) + hash(l_suppkey) + hash(l_linenumber) + hash(l_comment) + hash(l_shipinstruct)) as revenue from      lineitem;

@zhli1142015
Copy link
Contributor

I would lie to take AtLeastNNonNulls, thanks.

@jinchengchenghh
Copy link
Contributor

Here list some other functions that not support:
https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L62
Here list some function some data type or some behavior does not aligns with Spark.
https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L188

@zml1206
Copy link
Contributor

zml1206 commented Oct 18, 2024

Hi, I'd like to support date_trunc/trunc.

@HannanKan Are you still doing this? If you don't have time, I can take over, thank you.

@zjuwangg
Copy link
Contributor

zjuwangg commented Nov 7, 2024

@PHILO-HE I can try to support array_sort if no one picked, we internally need this function :)

@boneanxs How about this issue goes? If you don't have time, I'd like to investigate in it.

@boneanxs
Copy link
Contributor

boneanxs commented Nov 7, 2024

@PHILO-HE I can try to support array_sort if no one picked, we internally need this function :)

@boneanxs How about this issue goes? If you don't have time, I'd like to investigate in it.

@zjuwangg can see this pr: facebookincubator/velox#10138, still under reviewing

@wecharyu
Copy link
Contributor

wecharyu commented Dec 5, 2024

@PHILO-HE I'd like to support from_json and to_json.

@rui-mo
Copy link
Contributor

rui-mo commented Dec 5, 2024

I'd like to support from_json and to_json.

@wecharyu Thanks for the update. We've got one PR for from_json: facebookincubator/velox#11709. Just assigned to_json to you.

@zjuwangg
Copy link
Contributor

zjuwangg commented Dec 10, 2024

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

@yma11
Do you have further update on the consistency issue about map_from_entries?

@ayushi-agarwal
Copy link
Contributor

Can you add the function get_struct_field to the list? Thanks! @PHILO-HE
Even after adding support for from_json this query fallsback to Spark due to get_struct_field
val schema = new StructType().add("platformId", StringType).add("userId", StringType).add("sessionId", StringType)
val filteredDF = parquetDF.withColumn("parsed_json", from_json(col("json_column"), schema)).filter(col("parsed_json.platformId") === "IPHONE")

@PHILO-HE
Copy link
Contributor Author

@ayushi-agarwal, GetStructField is an internal function. I just added it in the list.

@rui-mo
Copy link
Contributor

rui-mo commented Dec 23, 2024

@ayushi-agarwal I assume GetStructField is supported via SelectionNode in https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/expression/ExpressionTransformer.scala#L54. Would you provide more details on the fallback reason? Perhaps a new issue could be opened for the unexpected fallback. Thanks.

@PHILO-HE
Copy link
Contributor Author

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

@yma11 Do you have further update on the consistency issue about map_from_entries?

@zjuwangg, Ma Yan has no bandwidth on this now. I just created one Velox pr based on her work. See facebookincubator/velox#11934.

@ayushi-agarwal
Copy link
Contributor

ayushi-agarwal commented Dec 23, 2024

@ayushi-agarwal I assume GetStructField is supported via SelectionNode in https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/expression/ExpressionTransformer.scala#L54. Would you provide more details on the fallback reason? Perhaps a new issue could be opened for the unexpected fallback. Thanks.

@rui-mo It fails here https://github.com/apache/incubator-gluten/blob/eeca5729b8612675a88e17ba7ba1f82b1cbd3955/backends-velox/src/main/scala/org/apache/gluten/expression/ExpressionTransformer.scala#L72C7-L74C73, this is the case where child node is of ScalarFunctionNode type.
This is the original expression:
24/12/23 10:00:23 WARN GlutenFallbackReporter: Validation failed for plan: Filter[QueryId=4], due to: Unsupported child expression of GetStructField: from_json(StructField(platformId,StringType,true), json_column#7, Some(America/Los_Angeles)).platformId

Opened a new ticket #8306

@zhouyuan
Copy link
Contributor

I'd like to support from_json and to_json.

@wecharyu Thanks for the update. We've got one PR for from_json: facebookincubator/velox#11709. Just assigned to_json to you.

CC @zhli1142015 for his awareness

@ayushi-agarwal
Copy link
Contributor

@wecharyu Do you have a patch for to_json support? We also wanted this functionality so wanted to check with you before starting.

@boneanxs
Copy link
Contributor

@wecharyu Do you have a patch for to_json support? We also wanted this functionality so wanted to check with you before starting.

can check it here: facebookincubator/velox#11995

@ayushi-agarwal
Copy link
Contributor

I see array_join function mentioned in here https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md#function-support but I don't find it here in the list? Shall we add it here also @PHILO-HE ?

@PHILO-HE
Copy link
Contributor Author

I see array_join function mentioned in here https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md#function-support but I don't find it here in the list? Shall we add it here also @PHILO-HE ?

@ayushi-agarwal, this list is for tracking unsupported or working in progress functions, not for listing all Spark functions and their status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests