Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid creating too many string objects in TaskDataWriter #44

Closed
vivek-balakrishnan-rovio opened this issue Sep 25, 2023 · 1 comment
Closed

Comments

@vivek-balakrishnan-rovio
Copy link
Collaborator

vivek-balakrishnan-rovio commented Sep 25, 2023

In recent release of the library (1.0.5), we introduced a change to read value as Java String type for all String columns. This was introduced as Spark's internal UTF8String is not compatible with DataSketches.

However, this resulted in performance degradation as too many objects are created. We noticed that this is problematic while re-ingesting a big dataset with over 10 years of data with lots of String dimensions.

We are working on a fix to coerce value to Java string only of sketch columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant