Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request : Need for different hashing algorithm functions in queries #3105

Closed
HectorPascual opened this issue Jan 7, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@HectorPascual
Copy link

HectorPascual commented Jan 7, 2025

Request for hashing with xxhash64 in a merge operation (SQL context)

Hi,

I am using the deltalake port to Python with Polars and deltalake lib, currently in need of hashing some columns on a Merge operation with xxhash64 algorithm (via the update parameter) but the SQL context accepts only certain hashing functions, based on the rust code I traced down a list of accepted functions after realising it's based in datafusion expression API (derived from the rust library imports).

Is there any way or plans to expand this list with other hashing algorithms? Or to register an UDF in the python API that I can use in the SQL context of a Merge?

I saw the following rust crate is also imported in the project : https://crates.io/crates/twox-hash. And this contains the implementation for the hash algorithm in my request.

Thanks for handling my request.

Use Case
Hashing with xxhash64 or xxh3 type algorithms : https://github.com/Cyan4973/xxHash

@HectorPascual HectorPascual added the enhancement New feature or request label Jan 7, 2025
@ion-elgreco
Copy link
Collaborator

Any reason why you can't precompute the hashes?

I have written a Polars plugin for that a while ago

@HectorPascual
Copy link
Author

Any reason why you can't precompute the hashes?

I have written a Polars plugin for that a while ago

Hi, thanks for the quick reply.

I can precompute the hashes, it's no problem, but since I am performing a Merge operation to a delta table and I need to hash the rows in the destination table, it would be more efficient to compute them in the update predicate of the merge, rather than reading the target table twice (one for computing the hashes, another for merging).

I was just curious if there was a way to create an UDF somehow or to expand the hashing algorithms list available.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jan 8, 2025

Registering udfs is likely possible. You can take a look at Datafusion-python and then see if you can replicate the registering functionality into our code base, I'm open for a PR for this

@HectorPascual
Copy link
Author

From what I saw, your implementation in Python does not directly use datafusion, datafusion in your codebase is used on the rust level, before the port to Python.

Is there any plan or chance so the udf functionality gets implemented in delta-rs and ported to python ? https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html

Thank you again for the reply!

@HectorPascual
Copy link
Author

HectorPascual commented Jan 23, 2025

It seems like datafusion project started development for this feature, it'd be nice to include it in delta-rs after they release it.

@ion-elgreco
Copy link
Collaborator

It seems like datafusion project start development for this feature, it'd be nice to include it in delta-rs after they release it.

We don't have to explicitly include it. If they do a new major release than we have to bump the dependency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants