Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_s3_connection_remote_loading_s3uri_feature #1054

Conversation

mhmotamedi
Copy link
Contributor

@mhmotamedi mhmotamedi commented Oct 26, 2023

Pull Request Summary:

This PR introduces the new S3 connection feature to the DataProfiler repository. It enables DataProfiler to read data directly from remote S3 paths (s3_uri), enhancing its flexibility and data source compatibility.

Changes Made:

  1. Added S3Helper class to facilitate S3 connectivity for DataProfiler.

  2. The class accommodates various scenarios:

    • Accepting input parameters for AWS access key, secret key, session token, and region name.
    • Utilizing environment variables for AWS credentials.
  3. Added a new unit test test_s3_helper.py module to ensure the functionality of the new S3 connection feature. Also, enhanced the existing test_data.py and test_data_utils.py unit tests.

Details:

  • create_s3_connection: The new function is introduced to create an S3 connection for DataProfiler. It provides flexibility in setting AWS credentials and obtaining IAM permissions for various use cases.

  • Input Parameters: The function accepts input parameters for AWS access key, secret key, session token, and region name. This allows for explicit credential provisioning.

  • Environment Variables: In cases where input parameters are not provided, the function falls back to using environment variables for AWS credentials. This provides an alternative for setting credentials.

Unit Test Added (TestS3Helper):

  • test_s3_connection: A new unit test has been added to ensure the functionality and correctness of the create_s3_connection function. This test covers various scenarios, including different input parameter combinations.

  • Data Class Testing: The Data class has been tested successfully to load various data files using S3 URIs in the editable_repo. This demonstrates the practical usability of the S3 connection feature.

This PR enhances the S3 connectivity of DataProfiler, making it more versatile in handling different AWS credential scenarios. The unit test (test_s3_connection.py, test_data.py and test_data_utils.py) and executing a number of Data load operations for various data types (i.e. CSV, Parquet, TXT and JSON) by installing the library in editable mode validate the functionality.

Please review and test the changes, and let us know your feedback.

@mhmotamedi mhmotamedi requested a review from a team as a code owner October 26, 2023 01:12
@CLAassistant
Copy link

CLAassistant commented Oct 26, 2023

CLA assistant check
All committers have signed the CLA.

@taylorfturner taylorfturner added High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable New Feature A feature addition not currently in the library labels Oct 26, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) October 26, 2023 01:53
ksneab7
ksneab7 previously approved these changes Oct 26, 2023
dataprofiler/tests/data_readers/test_s3_connection.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
@ksneab7 ksneab7 self-requested a review October 26, 2023 13:13
auto-merge was automatically disabled October 27, 2023 06:27

Head branch was pushed to by a user without write access

@taylorfturner taylorfturner enabled auto-merge (squash) October 27, 2023 13:06
Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple small improvements

.gitignore Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
auto-merge was automatically disabled October 27, 2023 23:35

Head branch was pushed to by a user without write access

Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a couple recommendations that I'm testing locally -- will add suggested changes (namely to the testing suite)

Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple small addition recommendations

dataprofiler/tests/data_readers/test_data.py Show resolved Hide resolved
dataprofiler/tests/data_readers/test_data.py Show resolved Hide resolved
dataprofiler/tests/data_readers/test_data.py Show resolved Hide resolved
@taylorfturner taylorfturner enabled auto-merge (squash) October 30, 2023 23:53
auto-merge was automatically disabled October 31, 2023 00:16

Head branch was pushed to by a user without write access

@taylorfturner taylorfturner enabled auto-merge (squash) October 31, 2023 00:35
Copy link
Contributor

@tyfarnan tyfarnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice tests! lgtm.

@taylorfturner taylorfturner merged commit 5f52fff into capitalone:dev Oct 31, 2023
4 checks passed
taylorfturner pushed a commit to taylorfturner/DataProfiler that referenced this pull request Nov 13, 2023
* add_s3_connection_remote_loading_s3uri_feature

* pre-commit fix

* created S3Helper class and refactored data_utils and unit test

* enhanced test_data.py with test_read_s3_uri

* enhanced unit tests and refactored is_s3_uri

* refactored some unit-tests structure

* rename TestCreateS3Client to TestS3Helper
micdavis pushed a commit that referenced this pull request Nov 13, 2023
* Add null ratio to column stats (#1052)

* Delay transforming priority_order into ndarray (#1045)

In the changed code, we had a mypy error because numpy ndarrays are not
compatible with random.Random.shuffle() (expected argument type is
MutableSequence[Any])

We fix this by first instantiating priority_order as a list, then
shuffling it, then creating an ndarray from it afterwards.

* Rename references to degree of freedom from df to deg_of_free (#1056)

* change references to degrees of freedom in chi2 from df to deg_of_free

* reformated using black pre-commit hook

* add_s3_connection_remote_loading_s3uri_feature (#1054)

* add_s3_connection_remote_loading_s3uri_feature

* pre-commit fix

* created S3Helper class and refactored data_utils and unit test

* enhanced test_data.py with test_read_s3_uri

* enhanced unit tests and refactored is_s3_uri

* refactored some unit-tests structure

* rename TestCreateS3Client to TestS3Helper

* fix directions for contrib branch (#1059)

* Feature: Plugins (#1060)

* Reservoir sampling (#826)

* add code for reservoir sampling and insert sample_nrows options

* pre commit fix

* add tests for reservoir sampling

* fixed mypy issues

* fix import to relative path

---------

Co-authored-by: Taylor Turner <[email protected]>
Co-authored-by: Richard Bann <[email protected]>

* plugins loading + preset plugin fetching implementation (#911)

* test

* Plugin implementation

* comments added to functions

* plugin test implementation for plugin presets

* forgot an import

* added None catch

* preset plugin test

* removing stuff I forgot to delete

* snake_case function names

* relative path

* relative path

* made new file for plugin testing

* forgot to delete function from old file

* now ive fixed if statement

* ok this should be it

* Plugin testing (#947)

* test

* plugin test implementation for plugin presets

* forgot an import

* added None catch

* preset plugin test

* snake_case function names

* relative path

* relative path

* forgot to delete function from old file

* nothing yet, just want this in two different repos

* new test for plugins feature and small update to plugin init

* pass

* didnt want dir to be overwritten

* forgot a dir

* fix isort pre-commit

* reservoir sample

* fix imports

* fix testing

* fix req to match dev

---------

Co-authored-by: Rushabh Vinchhi <[email protected]>
Co-authored-by: Richard Bann <[email protected]>
Co-authored-by: Liz Smith <[email protected]>

* version bump (#1064)

* empty test

---------

Co-authored-by: Suprabhat Gurrala <[email protected]>
Co-authored-by: Junho Lee <[email protected]>
Co-authored-by: Main Uddin Khan <[email protected]>
Co-authored-by: Mohammad Motamedi <[email protected]>
Co-authored-by: Rushabh Vinchhi <[email protected]>
Co-authored-by: Richard Bann <[email protected]>
Co-authored-by: Liz Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants