Skip to content

Commit

Permalink
Documentation: feature/options branch docs updates (#921)
Browse files Browse the repository at this point in the history
* reset ignore, update .gitignore, update documentation on presets (#874)

* reset ignore

* taylor's requested change

* taylor's requested change

* Fixed documentation for `sampling_ratio` option (#873)

* Add documentation for `sampling_ratio` option

* Update `sample_ratio` to `sampling_ratio` in documentations

Co-authored-by: Taylor Turner <[email protected]>

* Update `sampling_ratio` documentation

Co-authored-by: Taylor Turner <[email protected]>

* Update `sampling_ratio` documentation

* Updated `sampling_ratio` documentation

* Updated `sampling_ratio` documentation

* Updated `sampling_ratio` documentation

---------

Co-authored-by: Taylor Turner <[email protected]>

* update (#882)

* Add documentation for `median_abs_deviation` option (#881)

* Add documentation for `median_abs_deviation` option

* Updated `median_abs_deviation` documentation

* Row statistics option documentation (#883)

* updated documentation on new row_statistic options

* added documentation for row_statistics_options

* fixed typing of hll and included the note that it activates when hll is chosen as the hashing method

* removed space

* fixed quotation mark

* added micdavis comments

* fixed doc descriptions for unique_count

* small changes to docs

---------

Co-authored-by: JGSweets <[email protected]>

* rendering issue

* rendering issue

* update to fix rendering

---------

Co-authored-by: Liz Smith <[email protected]>
Co-authored-by: clee1152 <[email protected]>
Co-authored-by: Richard Bann <[email protected]>
Co-authored-by: JGSweets <[email protected]>
  • Loading branch information
5 people authored Jun 27, 2023
1 parent 5a254f4 commit fbee256
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 5 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ smart_data_profiler\.egg-info/

venv3/

data_profiler.egg-info/*
*egg-info/*

**~
venv/
Expand Down
58 changes: 54 additions & 4 deletions docs/source/profiler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -593,13 +593,43 @@ must be specified as structured or unstructured when setting (ie. datalabeler op
Below is an breakdown of all the options.
* **ProfilerOptions** - The top-level options class that contains options for the Profiler class
* **presets** - A pre-configured mapping of a string name to group of options: "complete", "data_types", and "numeric_stats_disabled".
By default is None
* **presets** - A pre-configured mapping of a string name to group of options:
* **default is None**
* **"complete"**
.. code-block:: python
options = ProfilerOptions(presets="complete")
* **"data_types"**
.. code-block:: python
options = ProfilerOptions(presets="data_types")
* **"numeric_stats_disabled"**
.. code-block:: python
options = ProfilerOptions(presets="numeric_stats_disabled")
* **"lower_memory_sketching"**
.. code-block:: python
options = ProfilerOptions(presets="lower_memory_sketching")
* **structured_options** - Options responsible for all structured data
* **multiprocess** - Option to enable multiprocessing. Automatically selects the optimal number of processes to utilize based on system constraints.
* is_enabled - (Boolean) Enables or disables multiprocessing
* **sampling_ratio** - A percentage, as a decimal, ranging from greater than 0 to less than or equal to 1 indicating how much input data to sample. Default value set to 0.2.
* **int** - Options for the integer columns
* is_enabled - (Boolean) Enables or disables the integer operations
Expand Down Expand Up @@ -629,6 +659,9 @@ Below is an breakdown of all the options.
* kurtosis - Finds kurtosis of all values in a column
* is_enabled - (Boolean) Enables or disables kurtosis
* median_abs_deviation - Finds median absolute deviation of all values in a column
* is_enabled - (Boolean) Enables or disables median absolute deviation
* num_zeros - Finds the count of zeros in a column
* is_enabled - (Boolean) Enables or disables num_zeros
Expand Down Expand Up @@ -680,6 +713,9 @@ Below is an breakdown of all the options.
* kurtosis - Finds kurtosis of all values in a column
* is_enabled - (Boolean) Enables or disables kurtosis
* median_abs_deviation - Finds median absolute deviation of all values in a column
* is_enabled - (Boolean) Enables or disables median absolute deviation
* is_numeric_stats_enabled - (Boolean) enable or disable all numeric stats
* num_zeros - Finds the count of zeros in a column
Expand Down Expand Up @@ -730,6 +766,9 @@ Below is an breakdown of all the options.
* kurtosis - Finds kurtosis of all values in a column
* is_enabled - (Boolean) Enables or disables kurtosis
* median_abs_deviation - Finds median absolute deviation of all values in a column
* is_enabled - (Boolean) Enables or disables median absolute deviation
* bias_correction - Applies bias correction to variance, skewness, and kurtosis calculations
* is_enabled - (Boolean) Enables or disables bias correction
Expand Down Expand Up @@ -771,12 +810,23 @@ Below is an breakdown of all the options.
* data_labeler_dirpath - (String) Directory path to data labeler
* data_labeler_object - (BaseDataLabeler) Datalabeler to replace
the default labeler
* max_sample_size - (Int) The max number of samples for the data
* max_sample_size - (Int) The max number of samples for the data
labeler
* **correlation** - option set for correlation profiling
* **correlation** - Option set for correlation profiling
* is_enabled - (Boolean) Enables or disables performing correlation profiling
* columns - Columns considered to calculate correlation
* **row_statistics** - (Boolean) Option to enable/disable row statistics calculations
* unique_count - (UniqueCountOptions) Option to enable/disable unique row count calculations
* is_enabled - (Bool) Enables or disables options for unique row count
* hashing_method - (String) Property to specify row hashing method ("full" | "hll")
* hll - (HyperLogLogOptions) Options for alternative method of estimating unique row count (activated when `hll` is the selected hashing_method)
* seed - (Int) Used to set HLL hashing function seed
* register_count - (Int) Number of registers is equal to 2^register_count
* null_count - (Boolean) Option to enable/disable functionalities for row_has_null_ratio and row_is_null_ratio
* **chi2_homogeneity** - Options for the chi-squared test matrix
* is_enabled - (Boolean) Enables or disables performing chi-squared tests for homogeneity between the categorical columns of the dataset.
Expand Down

0 comments on commit fbee256

Please sign in to comment.