Skip to content

Commit

Permalink
Staging/dev/profile serialization (capitalone#940)
Browse files Browse the repository at this point in the history
* initial changes to categoricalColumn decoder (capitalone#818)

* Implemented decoding for numerical stats mixin and integer profiles (capitalone#844)

* hot fixes for encode and decode of numeric stats mixin and intcol profiler (capitalone#852)

* Float column profiler encode decode (capitalone#854)

* hot fixes for encode and decode of numeric stats mixin and intcol profiler

* cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes

* Added docstring to the _load_stats_helper function

* Update dataprofiler/profilers/numerical_column_stats.py

Co-authored-by: Taylor Turner <[email protected]>

* Update dataprofiler/profilers/numerical_column_stats.py

* fix for nan values issue in pytesting

* Implementation of float profiler encode and decode process

---------

Co-authored-by: Taylor Turner <[email protected]>

* Json decode date time column (capitalone#861)

* more verbose error log with types for easy debug

* add load_from_dict to handle tiimestamps

* add json decode tests

* include DateTimeColumn class

* Added decoding for encoding of ordered column profiles (capitalone#864)

* Added ordered col test to ensure correct response to update when different ordering of values is introduced (capitalone#868)

* added decode text_column_profiler functionality and tests (capitalone#870)

* Created encoder for the datalabelercolumn (capitalone#869)

* feat: add test and compiler serialization (capitalone#884)

* [WIP] Adds tests validating serialization with Primitive type for compiler (capitalone#885)

* feat: add test and compiler serialization

* fix: move primitive tests to own class

* feat: add primitive col compiler save tests

* fix: float serializers asserts

* Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (capitalone#886)

* feat: add test and compiler serialization

* fix: move primitive tests to own class

* feat: add primitive col compiler save tests

* fix: float serializers asserts

* feat: add tests and allow primitive compiler to deserialize

* fix: bug in numeric stats deserial

* fix: missing `)` after conflict resolution

* Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (capitalone#887)

* fix: organize categorical and add get function

* refactor: reorganize tests and add stats test

* feat: order typing

* feat: add serial and deserial for stats compiler

* fix: bug when sample_size == 0

* ready datalabeler for deserialization and improvement on serialization for datalabeler (capitalone#879)

* Deserialization of datalabeler (capitalone#891)

* Added initial profiler decoding for datalabeler column (WIP)

* Intialial implementation for deserialization of datalabelercolumn

* Fix LSP violations (capitalone#840)

* Make profiler superclasses generic

Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and
BaseCompiler generic, to avoid casting in subclass diff() methods and
violating LSP in principle.

* Add needed cast import

---------

Co-authored-by: Junho Lee <[email protected]>

* Encode Options (capitalone#875)

* encode testing

* encode dataLabeler testing

* encode structuredOptions testing

* cleaned up datalabeler test

* added text options

* [WIP] ColumnDataLabelerCompiler: serialize / deserialize (capitalone#888)

* formatting

* update formatting

* setting up full test suite for DataLabelerCompiler

* update isort

* updates to test -- still failing

* update

* Quick Test update (capitalone#893)

* update

* string in list

* formatting

* Decode options (capitalone#894)

* refactored options encode testing

* updated test name

* updated class names

* fixing test

* initial base option decode

* inital tests

* refactor: allow options to go through all (capitalone#902)

* refactor: allow options to go through all

* fix: bug

* StructuredColProfiler Encode / Decode  (capitalone#901)

* refactor: allow options to go through all

* fix: bug

* update

* update

* update

* updates

* update

* Fixes for taylors StructuredCol Issue

* update

* update

* remove try/except

---------

Co-authored-by: Jeremy Goodsitt <[email protected]>
Co-authored-by: ksneab7 <[email protected]>

* fix: bug and add tests for structuredcolprofiler (capitalone#904)

* fix: bug and add tests

* fix: limit scipy requirements till problem understood and fixed

* Stuctured profiler encode decode (capitalone#903)

* refactor: allow options to go through all

* fix: bug in loading options

* update

* update

* Fixes for taylors StructuredCol Issue

* Created load and save code from structuredprofiler

* intermidiate commit for fixing structured profile

---------

Co-authored-by: Jeremy Goodsitt <[email protected]>
Co-authored-by: taylorfturner <[email protected]>

* [WIP] Added NoImplementationError for UnstructuredProfiler (capitalone#907)

* refactor: allow options to go through all

* fix: bug in loading options

* update

* update

* Fixes for taylors StructuredCol Issue

* Created load and save code from structuredprofiler

* intermidiate commit for fixing structured profile

* test fix

* mypy fixes for typing issues

* fix for none case of the datalabler in options

* Added mock of datalabeler to structured profile test

* Added tests for encoding of the Structured profiler

* Update dataprofiler/profilers/json_decoder.py

Co-authored-by: Michael Davis <[email protected]>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <[email protected]>

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Michael Davis <[email protected]>

* Pr fixes

* Fixed typo in test

* Update dataprofiler/profilers/json_decoder.py

Co-authored-by: Taylor Turner <[email protected]>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <[email protected]>

* Update dataprofiler/tests/profilers/utils.py

Co-authored-by: Taylor Turner <[email protected]>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <[email protected]>

* Fixes for unneeeded callout for _profile check

* small change

---------

Co-authored-by: Jeremy Goodsitt <[email protected]>
Co-authored-by: taylorfturner <[email protected]>
Co-authored-by: ksneab7 <[email protected]>
Co-authored-by: ksneab7 <[email protected]>

* Added testing for values for test_json_decode_after_update (capitalone#915)

* Reuse passed labeler (capitalone#924)

* refactor: loading labeler for reuse and abstract loading

* refactor: use for DataLabelerColumn as well

* fix: don't error if doesn't exist

* refactor: allow for config dict to be passed entire way

* fix: compiler tests

* fix: structCol tests

* fix: test

* BaseProfiler save() for json (capitalone#923)

* added save for top level and tests

* small refactor

* small fix

* refactor: use seed for sample for consistency (capitalone#927)

* refactor: use seed for sample for consistency

* fix: formatting and variables

* WIP top level load (capitalone#925)

* quick hot fix for input validation on save() save_metho (capitalone#931)

* BaseProfiler: `load_method` hotfix (capitalone#932)

* added load_method

* updated tests

* fix: null_rep mat should calculate even if datetime (capitalone#933)

* Notebook Example save/load Profile (capitalone#930)

* update example data profiler demo save/load

* update notebook cells

* Update examples/data_profiler_demo.ipynb

* Update examples/data_profiler_demo.ipynb

* fix: order bug (capitalone#939)

* fix: typo on rebase

* fix: typing and bugs from rebase

* fix: options tests due to merge and loading new options

---------

Co-authored-by: Michael Davis <[email protected]>
Co-authored-by: ksneab7 <[email protected]>
Co-authored-by: Taylor Turner <[email protected]>
Co-authored-by: Tyler <[email protected]>
Co-authored-by: Junho Lee <[email protected]>
Co-authored-by: ksneab7 <[email protected]>
  • Loading branch information
7 people authored and clee1152 committed Aug 1, 2023
1 parent 64bd57e commit 0bc2a42
Show file tree
Hide file tree
Showing 3 changed files with 111 additions and 0 deletions.
16 changes: 16 additions & 0 deletions dataprofiler/profilers/base_column_profilers.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,23 +252,39 @@ def report(self, remove_disabled_flag: bool = False) -> dict:
def load_from_dict(
cls: type[BaseColumnProfilerT],
data: dict[str, Any],
<<<<<<< HEAD
config: dict | None = None,
=======
options: dict | None = None,
>>>>>>> 28d65fc (Staging/dev/profile serialization (#940))
) -> BaseColumnProfilerT:
"""
Parse attribute from json dictionary into self.
:param data: dictionary with attributes and values.
:type data: dict[string, Any]
<<<<<<< HEAD
:param config: config for loading column profiler params from dictionary
:type config: Dict | None
=======
:param options: options for loading column profiler params from dictionary
:type options: Dict | None
>>>>>>> 28d65fc (Staging/dev/profile serialization (#940))
:return: Profiler with attributes populated.
:rtype: BaseColumnProfiler
"""
<<<<<<< HEAD
if config is None:
config = {}

class_options = config.get(cls.__name__)
=======
if options is None:
options = {}

class_options = options.get(cls.__name__)
>>>>>>> 28d65fc (Staging/dev/profile serialization (#940))
profile: BaseColumnProfilerT = cls(data["name"], class_options)

time_vals = data.pop("times")
Expand Down
91 changes: 91 additions & 0 deletions dataprofiler/profilers/json_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,97 @@ def load_column_profile(
"""
Construct subclass of BaseColumnProfiler given a serialized JSON.
Expected format of serialized_json (see json_encoder):
{
"class": <str name of class that was serialized>
"data": {
<attr1>: <value1>
<attr2>: <value2>
...
}
}
:param serialized_json: JSON representation of column profiler that was
# serialized using the custom encoder in profilers.json_encoder
:type serialized_json: a dict that was created by calling json.loads on
a JSON representation using the custom encoder
:param config: config for overriding data params when loading from dict
:type config: Dict | None
:return: subclass of BaseColumnProfiler that has been deserialized from
JSON
"""
column_profiler_cls: type[
BaseColumnProfiler[BaseColumnProfiler]
] = get_column_profiler_class(serialized_json["class"])
return column_profiler_cls.load_from_dict(serialized_json["data"], config)


def load_compiler(
serialized_json: dict, config: dict | None = None
) -> col_pro_compiler.BaseCompiler:
"""
Construct subclass of BaseCompiler given a serialized JSON.
Expected format of serialized_json (see json_encoder):
{
"class": <str name of class that was serialized>
"data": {
<attr1>: <value1>
<attr2>: <value2>
...
}
}
:param serialized_json: JSON representation of profile compiler that was
serialized using the custom encoder in profilers.json_encoder
:type serialized_json: a dict that was created by calling json.loads on
a JSON representation using the custom encoder
:param config: config for overriding data params when loading from dict
:type config: Dict | None
:return: subclass of BaseCompiler that has been deserialized from
JSON
"""
column_profiler_cls: type[col_pro_compiler.BaseCompiler] = get_compiler_class(
serialized_json["class"]
)
return column_profiler_cls.load_from_dict(serialized_json["data"], config)


def load_option(serialized_json: dict, config: dict | None = None) -> BaseOption:
"""
Construct subclass of BaseOption given a serialized JSON.
Expected format of serialized_json (see json_encoder):
{
"class": <str name of class that was serialized>
"data": {
<attr1>: <value1>
<attr2>: <value2>
...
}
}
:param serialized_json: JSON representation of option that was
serialized using the custom encoder in profilers.json_encoder
:type serialized_json: a dict that was created by calling json.loads on
a JSON representation using the custom encoder
:param config: config for overriding data params when loading from dict
:type config: Dict | None
:return: subclass of BaseOption that has been deserialized from
JSON
"""
option_cls: type[BaseOption] = get_option_class(serialized_json["class"])
return option_cls.load_from_dict(serialized_json["data"], config)


def load_profiler(serialized_json: dict, config=None) -> BaseProfiler:
"""
Construct subclass of BaseProfiler given a serialized JSON.
Expected format of serialized_json (see json_encoder):
{
"class": <str name of class that was serialized>
Expand Down
4 changes: 4 additions & 0 deletions dataprofiler/profilers/unstructured_text_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,11 @@
from numpy import ndarray
from pandas import DataFrame, Series

<<<<<<< HEAD
from . import profiler_utils
=======
from . import utils
>>>>>>> 28d65fc (Staging/dev/profile serialization (#940))
from .base_column_profilers import BaseColumnProfiler
from .profiler_options import TextProfilerOptions

Expand Down

0 comments on commit 0bc2a42

Please sign in to comment.