Initial implementation of generators #1

SemyonSinchenko · 2024-08-01T20:47:31Z

The main idea is to allow use implemented in rust functions in any environment (locally via some python glue, on Apache Spark cluster via mapInArrow, etc.)

Each function is related to the corresponding part of the original R code;
Each function is seed-based, so running it multiple time with the same seed will constantly return the same result;
Join-datasets generation contains also keys_seed that should be the same for the whole sessions (all the batches);
To allow parallel generation / out-of-core generation, etc each function return a batch of the requested size instead of the whole data like in the original implementation in R;
Input of the rust functions is assumed to be valid (like described in docstrings) just because it is much easier to validate inputs in python glue;

Python API, PySpark API, Deltalake API, etc. will be done after finalizing a concept in rust-part.

On branch initial-implementation Changes to be committed: new file: Cargo.lock new file: Cargo.toml new file: pyproject.toml new file: src/lib.rs

Initial implementation of generators

e907b38

On branch initial-implementation Changes to be committed: new file: Cargo.lock new file: Cargo.toml new file: pyproject.toml new file: src/lib.rs

SemyonSinchenko merged commit 55de2bf into main Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of generators #1

Initial implementation of generators #1

SemyonSinchenko commented Aug 1, 2024

Initial implementation of generators #1

Initial implementation of generators #1

Conversation

SemyonSinchenko commented Aug 1, 2024