Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove enum_values keywords in serialization #565

Closed
wants to merge 8 commits into from

Conversation

FabiPi3
Copy link
Contributor

@FabiPi3 FabiPi3 commented Mar 13, 2024

Instead of storing only the value of an Enum in the output, I suggest to store the whole enum class in order to be able to recreate it properly. This will be possible due to the new enum support in monty and after merging materialsvirtuallab/monty#640 this.

@utf
Copy link
Member

utf commented Mar 13, 2024

I'm concerned this feature might cause an issue for downstream packages. Often we use Enums as keys for dictionaries. This feature in monty seems like it will break this behaviour since dictionaries cannot be hashed and therefore cannot be used as keys. Have you tried running the atomate2 tests with this new branch of jobflow?

@FabiPi3
Copy link
Contributor Author

FabiPi3 commented Mar 13, 2024

@utf Thanks for your reply. I tried running the atomate2 tests locally and I can't see any issues arising when changing to my latest changes in monty and jobflow

@utf
Copy link
Member

utf commented Mar 13, 2024

Thanks for checking @FabiPi3

Copy link

codecov bot commented Mar 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.42%. Comparing base (eda2a65) to head (978e66f).
Report is 14 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #565   +/-   ##
=======================================
  Coverage   99.42%   99.42%           
=======================================
  Files          21       21           
  Lines        1564     1564           
  Branches      425      425           
=======================================
  Hits         1555     1555           
  Misses          9        9           
Files Coverage Δ
src/jobflow/core/job.py 100.00% <100.00%> (ø)
src/jobflow/core/reference.py 100.00% <100.00%> (ø)

@FabiPi3
Copy link
Contributor Author

FabiPi3 commented Mar 13, 2024

In general, I would guess that your described use case is still covered. Looking into the actual implementation of jsanitize, we see:

    if isinstance(obj, dict):
        return {
            str(k): jsanitize(
                v,
                strict=strict,
                allow_bson=allow_bson,
                enum_values=enum_values,
                recursive_msonable=recursive_msonable,
            )
            for k, v in obj.items()
        }

which means for me that only the values are serialized, the keys are still taken as strings. Hope this helps.

@gpetretto
Copy link
Contributor

Hi @FabiPi3, I am also thinking that there might some consequences to this change. To illustrate this I made an example with different objects and (de)serialization, including pydantic base models, that are often present in jobflow's output.

Here is the code:

from monty.json import MSONable, jsanitize
from enum import Enum
from pydantic import BaseModel

class E(Enum):
    A = "A"

class ME(MSONable, Enum):
    B = "B"

class BM1(BaseModel):
    e: E = E.A
    
class BM2(BaseModel):
    me: ME = ME.B


d = {"e": E.A, "me": ME.B}

print("Dictionary")
for obj in [{"e": E.A}, {"me": ME.B}]:
    for enum_values in (True, False):
        try:
            print(f"{obj}, {enum_values}:", jsanitize(obj, strict=True, allow_bson=True, enum_values=enum_values))
        except AttributeError:
            print(f"{obj}, {enum_values}: failed")


print("\npydantic")
for cls in [BM1, BM2]:
    for enum_values in (True, False):
        try:
            print(f"{cls}, {enum_values}:", jsanitize(cls(), strict=True, allow_bson=True, enum_values=enum_values))
            try:
                print(cls.model_validate(jsanitize(cls(), strict=True, allow_bson=True, enum_values=enum_values)))
            except Exception as e:
                print(f"{obj}, {enum_values}: pydantic failed: {e}")    
        except AttributeError:
            print(f"{obj}, {enum_values}: jsanitize failed")

Here is the output with the standard monty version:

Dictionary
{'e': <E.A: 'A'>}, True: {'e': 'A'}
{'e': <E.A: 'A'>}, False: failed
{'me': <ME.B: 'B'>}, True: {'me': 'B'}
{'me': <ME.B: 'B'>}, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}}

pydantic
<class '__main__.BM1'>, True: {'e': 'A', '@module': '__main__', '@class': 'BM1', '@version': None}
e=<E.A: 'A'>
{'me': <ME.B: 'B'>}, False: jsanitize failed
<class '__main__.BM2'>, True: {'me': 'B', '@module': '__main__', '@class': 'BM2', '@version': None}
{'me': <ME.B: 'B'>}, True: pydantic failed: 1 validation error for BM2
me
  Value error, Must provide ME, the as_dict form, or the proper [type=value_error, input_value='B', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error
<class '__main__.BM2'>, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}, '@module': '__main__', '@class': 'BM2', '@version': None}
me=<ME.B: 'B'>

Here is the output of your new monty version:

Dictionary
{'e': <E.A: 'A'>}, True: {'e': 'A'}
{'e': <E.A: 'A'>}, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}}
{'me': <ME.B: 'B'>}, True: {'me': 'B'}
{'me': <ME.B: 'B'>}, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}}

pydantic
<class '__main__.BM1'>, True: {'e': 'A', '@module': '__main__', '@class': 'BM1', '@version': None}
e=<E.A: 'A'>
<class '__main__.BM1'>, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}, '@module': '__main__', '@class': 'BM1', '@version': None}
{'me': <ME.B: 'B'>}, False: pydantic failed: 1 validation error for BM1
e
  Input should be 'A' [type=enum, input_value={'value': 'A', '@module':...: 'E', '@version': None}, input_type=dict]
<class '__main__.BM2'>, True: {'me': 'B', '@module': '__main__', '@class': 'BM2', '@version': None}
{'me': <ME.B: 'B'>}, True: pydantic failed: 1 validation error for BM2
me
  Value error, Must provide ME, the as_dict form, or the proper [type=value_error, input_value='B', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error
<class '__main__.BM2'>, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}, '@module': '__main__', '@class': 'BM2', '@version': None}
me=<ME.B: 'B'>

and just to hopefully make it clearer, here is the diff of the two outputs:

3c3
< {'e': <E.A: 'A'>}, False: failed
---
> {'e': <E.A: 'A'>}, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}}
10c10,13
< {'me': <ME.B: 'B'>}, False: jsanitize failed
---
> <class '__main__.BM1'>, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}, '@module': '__main__', '@class': 'BM1', '@version': None}
> {'me': <ME.B: 'B'>}, False: pydantic failed: 1 validation error for BM1
> e
>   Input should be 'A' [type=enum, input_value={'value': 'A', '@module':...: 'E', '@version': None}, input_type=dict]

Considering that you propose to pass from jsanitize(strict=True, allow_bson=True, enum_values=True) to jsanitize(strict=True, allow_bson=True, enum_values=False), I can see some key differences

  1. the jsanitize for {'e': E.A} that failed, now passes and is consistent with what the MontyEncoder does. Which is good.

  2. The jsanitized document is also used for storing the output. This means that in all the cases we pass from a serialized version of the form {'e': 'A'} to one of the form {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}} in the output database. So, any mongodb query of the type {"e": "A"} would be broken and should be modified in {"e.value": "A"}. I am not sure how many such queries could be out there, but I suppose this could definitely break something.

  3. In the case of BM1, going from enum_values=True to enum_values=False in the new version of monty also means that the pydantic model cannot be generated with model_validate from the serialized dictionary. In Jobflow MontyDecoder is used to deserialize, and thus keeps working correctly. So it is not strictly an issue for jobflow, but I am wondering if there are some cases where this could fail (there are in jobflow-remote, but I suppose can be fixed, if needed).

In general there is definitely a benefit to this, but some issues may arise.

I would also suggest to run the tests of emmet (and maybe maggma?) with the new monty version, to check that eveything works fine there as well.

@FabiPi3
Copy link
Contributor Author

FabiPi3 commented Mar 19, 2024

@gpetretto Thanks for your detailed answer. I cannot really judge how large those issues would be, but in my opinion everything should be fixable.

Concerning the tests of emmet and maggma, everything that worked locally for me also worked with the new monty version. At least I couldn't see any issues arising there with the changes in monty. Hope this helps.

@FabiPi3
Copy link
Contributor Author

FabiPi3 commented Apr 5, 2024

The pull request in monty was accepted. We could go ahead and merge this now. If I understand, there are no really big issues preventing this. Please comment again about your opinion.

Else I would close this (and adapt my application code). Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants