Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GpuMapConcat support for nested type keys. #6290

Merged
merged 1 commit into from
Aug 26, 2022

Conversation

HaoYang670
Copy link
Collaborator

Signed-off-by: remzi [email protected]
This is a subtask of #5559.

The GpuMaxConcat could support array typed keys and struct typed keys.

Signed-off-by: remzi <[email protected]>
@HaoYang670
Copy link
Collaborator Author

build

frame => {
import frame.sparkSession.implicits._
frame.select(map_concat($"col1", $"col2"))
}
}

testGpuFallback(
"MapConcat with Struct keys fall back",
Copy link
Contributor

@andygrove andygrove Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a corresponding test for MapType keys as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark doesn't support MapType as a key type. If you use Map as a key, the Spark catalyst will throw an error.

@andygrove
Copy link
Contributor

The changes look good, but I think we should also add integration tests so that we test this on all supported platforms.

execsAllowedNonGpu = Seq("ProjectExec", "ShuffleExchangeExec")) {
testSparkResultsAreEqual(
"MapConcat with Array keys",
ArrayKeyMapDF) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add in some integration tests as well or instead of these? It would also be good to try and add/update the data gen for the tests so it is much more likely to have collisions. This is especially true for nested types like arrays or structs where the probability of a collision goes way down.

Copy link
Collaborator Author

@HaoYang670 HaoYang670 Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add in some integration tests as well or instead of these? It would also be good to try and add/update the data gen for the tests so it is much more likely to have collisions.

If what you mean is adding tests in Python Integration test, unfortunately, it is difficult. Because List or Struct is not hashable in Python.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> schema = StructType([
...     StructField('name', StringType(), True),
...     StructField('properties', MapType(ArrayType(StringType()),StringType()),True)])
>>> dataDictionary = [
...         ('James',{['hair', 'hair']:'black',['eye', 'eye']:'brown'})]
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
TypeError: unhashable type: 'list'

Copy link
Collaborator Author

@HaoYang670 HaoYang670 Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @revans2, is there any workaround way that can generate random MapType data in IT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @revans2, I find it is a little difficult to update the MapGen to support the nested keys because currently the MapGen uses dict however, list and struct are unhashable in dict.

A workaround way is firstly generating lists of structs using ListGen and StructGen and then convert the data to map type. But I am not sure whether this is a good workaround.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine. Sorry I didn't respond sooner. I have been stuck on something else.

@HaoYang670 HaoYang670 self-assigned this Aug 15, 2022
@sameerz sameerz added the task Work required that improves the product but is not user facing label Aug 15, 2022
@HaoYang670 HaoYang670 merged commit 8f4b4cb into NVIDIA:branch-22.10 Aug 26, 2022
@HaoYang670 HaoYang670 deleted the map_concat_nested_keys branch August 26, 2022 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants