Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: added parquet sampling #1070

Merged
merged 11 commits into from
Dec 12, 2023
Merged

Conversation

menglinw
Copy link
Contributor

@menglinw menglinw commented Nov 14, 2023

parquet sampling function developed in data_utils.py;
Added sample_nrows argument in ParquetData class;
Added test_len_sampled_data in test_parquet_data.py

@menglinw menglinw requested a review from a team as a code owner November 14, 2023 23:59
@CLAassistant
Copy link

CLAassistant commented Nov 14, 2023

CLA assistant check
All committers have signed the CLA.

@taylorfturner taylorfturner changed the title feat: added parquet sampling Feature: added parquet sampling Nov 15, 2023
@taylorfturner
Copy link
Contributor

taylorfturner commented Nov 15, 2023

@menglinw two things:

  • you'll want to open the PR into dev per docs here
  • also you'll want to rebase onto dev to resolve the conflict with dataprofiler/data_readers/data_utils.py

@menglinw menglinw changed the base branch from main to dev November 15, 2023 17:58
…ows argument in ParquetData class; Added test_len_sampled_data in test_parquet_data.py
@taylorfturner taylorfturner added the New Feature A feature addition not currently in the library label Nov 16, 2023
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple comments / thoughts

dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/tests/data_readers/test_parquet_data.py Outdated Show resolved Hide resolved
1. added type of return in sample_parquet function;
2. changed variable names in sample_parquet function to more descriptive names (select -> sample_index, out -> sample_df);
3. created convert_unicode_col_to_utf8 function to reduce repeating code in sample_parquet and read_parquet_df functions
…a_utils.py) to be more descriptive (types -> input_column_types, col -> iter_column), other part unchanged

2. test_parquet_data.py, move import statement to the top of file

3. test_parquet_data.py, merged all tests about parquet sample feature to their original tests
micdavis
micdavis previously approved these changes Dec 8, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) December 8, 2023 21:38
Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update the requirements.tct

requirements.txt Outdated Show resolved Hide resolved
auto-merge was automatically disabled December 11, 2023 16:34

Head branch was pushed to by a user without write access

@taylorfturner taylorfturner enabled auto-merge (squash) December 11, 2023 16:59
dataprofiler/data_readers/data_utils.py Outdated Show resolved Hide resolved
dataprofiler/tests/data_readers/test_parquet_data.py Outdated Show resolved Hide resolved
dataprofiler/tests/data_readers/test_parquet_data.py Outdated Show resolved Hide resolved
dataprofiler/tests/data_readers/test_parquet_data.py Outdated Show resolved Hide resolved
Comment on lines 156 to +174
if data_format == "dataframe":
import pandas as pd

self.assertIsInstance(data, pd.DataFrame)
elif data_format in ["records", "json"]:
self.assertIsInstance(data, list)
self.assertIsInstance(data[0], str)

input_data_obj_sampled = Data(
input_file["path"], options={"sample_nrows": 100}
)
for data_format in list(input_data_obj_sampled._data_formats.keys()):
input_data_obj_sampled.data_format = data_format
self.assertEqual(input_data_obj_sampled.data_format, data_format)
data_sampled = input_data_obj_sampled.data
if data_format == "dataframe":
self.assertIsInstance(data_sampled, pd.DataFrame)
elif data_format in ["records", "json"]:
self.assertIsInstance(data_sampled, list)
self.assertIsInstance(data_sampled[0], str)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads a little verbose.... wondering if needed since its adding an option to the test...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is necessary. The new parquet sampling feature is using a different parquet reading method and then format, so I think it might be safer to check if the sampled data still meet our format requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind too much if tests are verbose, as long as their readable. Better to be verbose than miss edge cases imo.

2. test_len_data method keep one sample length test
3. remove sampling test in test_specifying_data_type
4. remove sampling test in test_reload_data
auto-merge was automatically disabled December 11, 2023 19:37

Head branch was pushed to by a user without write access

@taylorfturner taylorfturner enabled auto-merge (squash) December 11, 2023 22:25
@taylorfturner taylorfturner merged commit 0d56dac into capitalone:dev Dec 12, 2023
5 checks passed
micdavis pushed a commit that referenced this pull request Jan 11, 2024
* Feature: added parquet sampling (#1070)

* parquet sampling function developed in data_utils.py; Added sample_nrows argument in ParquetData class; Added test_len_sampled_data in test_parquet_data.py

* resolved conflict with dev, added more tests

* fixed sample empty column bug

* fixed comments in data_utils.py, including:
1. added type of return in sample_parquet function;
2. changed variable names in sample_parquet function to more descriptive names (select -> sample_index, out -> sample_df);
3. created convert_unicode_col_to_utf8 function to reduce repeating code in sample_parquet and read_parquet_df functions

* 1. renamed variable names in covert_unicode_col_to_utf8 function (data_utils.py) to be more descriptive (types -> input_column_types, col -> iter_column), other part unchanged

2. test_parquet_data.py, move import statement to the top of file

3. test_parquet_data.py, merged all tests about parquet sample feature to their original tests

* checked the datatype and input file path before and after reload with sampling option enabled

* test

* delete test edit in avro_data.py, updated fastavro version in  requirment.txt

* remove fastavro.reader type

* change fastavro version back to original

* 1. sample_parquet function description
2. test_len_data method keep one sample length test
3. remove sampling test in test_specifying_data_type
4. remove sampling test in test_reload_data

* Depedency: `matplotlib` version bump  (#1072)

* bump tag matplotlib

* bumpt to most recent

* 3.9.0 update

* Bump actions/setup-python from 4 to 5 (#1078)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Turner <[email protected]>

* Make _assimilate_histogram not use self (#1071)

Co-authored-by: Taylor Turner <[email protected]>

* version bump

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: WML <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Junho Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.10.8 New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants