Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of generators #1

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

SemyonSinchenko
Copy link
Collaborator

The main idea is to allow use implemented in rust functions in any environment (locally via some python glue, on Apache Spark cluster via mapInArrow, etc.)

  • Each function is related to the corresponding part of the original R code;
  • Each function is seed-based, so running it multiple time with the same seed will constantly return the same result;
  • Join-datasets generation contains also keys_seed that should be the same for the whole sessions (all the batches);
  • To allow parallel generation / out-of-core generation, etc each function return a batch of the requested size instead of the whole data like in the original implementation in R;
  • Input of the rust functions is assumed to be valid (like described in docstrings) just because it is much easier to validate inputs in python glue;

Python API, PySpark API, Deltalake API, etc. will be done after finalizing a concept in rust-part.

 On branch initial-implementation
 Changes to be committed:
	new file:   Cargo.lock
	new file:   Cargo.toml
	new file:   pyproject.toml
	new file:   src/lib.rs
@SemyonSinchenko SemyonSinchenko merged commit 55de2bf into main Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant