-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add_s3_connection_remote_loading_s3uri_feature #1054
add_s3_connection_remote_loading_s3uri_feature #1054
Conversation
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple small improvements
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a couple recommendations that I'm testing locally -- will add suggested changes (namely to the testing suite)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple small addition recommendations
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice tests! lgtm.
* add_s3_connection_remote_loading_s3uri_feature * pre-commit fix * created S3Helper class and refactored data_utils and unit test * enhanced test_data.py with test_read_s3_uri * enhanced unit tests and refactored is_s3_uri * refactored some unit-tests structure * rename TestCreateS3Client to TestS3Helper
* Add null ratio to column stats (#1052) * Delay transforming priority_order into ndarray (#1045) In the changed code, we had a mypy error because numpy ndarrays are not compatible with random.Random.shuffle() (expected argument type is MutableSequence[Any]) We fix this by first instantiating priority_order as a list, then shuffling it, then creating an ndarray from it afterwards. * Rename references to degree of freedom from df to deg_of_free (#1056) * change references to degrees of freedom in chi2 from df to deg_of_free * reformated using black pre-commit hook * add_s3_connection_remote_loading_s3uri_feature (#1054) * add_s3_connection_remote_loading_s3uri_feature * pre-commit fix * created S3Helper class and refactored data_utils and unit test * enhanced test_data.py with test_read_s3_uri * enhanced unit tests and refactored is_s3_uri * refactored some unit-tests structure * rename TestCreateS3Client to TestS3Helper * fix directions for contrib branch (#1059) * Feature: Plugins (#1060) * Reservoir sampling (#826) * add code for reservoir sampling and insert sample_nrows options * pre commit fix * add tests for reservoir sampling * fixed mypy issues * fix import to relative path --------- Co-authored-by: Taylor Turner <[email protected]> Co-authored-by: Richard Bann <[email protected]> * plugins loading + preset plugin fetching implementation (#911) * test * Plugin implementation * comments added to functions * plugin test implementation for plugin presets * forgot an import * added None catch * preset plugin test * removing stuff I forgot to delete * snake_case function names * relative path * relative path * made new file for plugin testing * forgot to delete function from old file * now ive fixed if statement * ok this should be it * Plugin testing (#947) * test * plugin test implementation for plugin presets * forgot an import * added None catch * preset plugin test * snake_case function names * relative path * relative path * forgot to delete function from old file * nothing yet, just want this in two different repos * new test for plugins feature and small update to plugin init * pass * didnt want dir to be overwritten * forgot a dir * fix isort pre-commit * reservoir sample * fix imports * fix testing * fix req to match dev --------- Co-authored-by: Rushabh Vinchhi <[email protected]> Co-authored-by: Richard Bann <[email protected]> Co-authored-by: Liz Smith <[email protected]> * version bump (#1064) * empty test --------- Co-authored-by: Suprabhat Gurrala <[email protected]> Co-authored-by: Junho Lee <[email protected]> Co-authored-by: Main Uddin Khan <[email protected]> Co-authored-by: Mohammad Motamedi <[email protected]> Co-authored-by: Rushabh Vinchhi <[email protected]> Co-authored-by: Richard Bann <[email protected]> Co-authored-by: Liz Smith <[email protected]>
Pull Request Summary:
This PR introduces the new S3 connection feature to the DataProfiler repository. It enables DataProfiler to read data directly from remote S3 paths (s3_uri), enhancing its flexibility and data source compatibility.
Changes Made:
Added
S3Helper
class to facilitate S3 connectivity for DataProfiler.The class accommodates various scenarios:
Added a new unit test
test_s3_helper.py
module to ensure the functionality of the new S3 connection feature. Also, enhanced the existingtest_data.py
andtest_data_utils.py
unit tests.Details:
create_s3_connection: The new function is introduced to create an S3 connection for DataProfiler. It provides flexibility in setting AWS credentials and obtaining IAM permissions for various use cases.
Input Parameters: The function accepts input parameters for AWS access key, secret key, session token, and region name. This allows for explicit credential provisioning.
Environment Variables: In cases where input parameters are not provided, the function falls back to using environment variables for AWS credentials. This provides an alternative for setting credentials.
Unit Test Added (TestS3Helper):
test_s3_connection: A new unit test has been added to ensure the functionality and correctness of the
create_s3_connection
function. This test covers various scenarios, including different input parameter combinations.Data Class Testing: The
Data
class has been tested successfully to load various data files using S3 URIs in theeditable_repo
. This demonstrates the practical usability of the S3 connection feature.This PR enhances the S3 connectivity of DataProfiler, making it more versatile in handling different AWS credential scenarios. The unit test (
test_s3_connection.py, test_data.py and test_data_utils.py
) and executing a number of Data load operations for various data types (i.e.CSV, Parquet, TXT and JSON
) by installing the library in editable mode validate the functionality.Please review and test the changes, and let us know your feedback.