Tidy3dBaseModel mutability #377

momchil-flex · 2022-05-23T02:25:19Z

momchil-flex
May 23, 2022
Maintainer

I think we should have another serious discussion on whether we want to keep models mutable. I can offer three things I think would work better if they were immutable.

We have seen that using that the approach of a static context manger + _hash to try to speed things up when caching properties may not be as successful as we were hoping in the case of large models, where the hash function itself can be quite costly. Furthermore there are still issues with making it recursively work with all models. If models were immutable instead, we could just compute self._cached_property once upon init and then just always return that upon self.property.
We have also seen that it is not possible to support two different ways of defining things for backwards compatibility. This always raises the problem of what overwrites what, and leaves holes for misuse where things are not correctly set. If the models were immutable, we could support multiple ways of defining things, and raise deprecation warnings for several releases before removing the old arguments.
Accidental misuse. There are some scenarios where unexpected things may happen; all of those are technically wrong usage by the user, but it would be harder to do that with immutable models. A very common example:

sims = []
for source in sources:
    sim.sources = [source]
    sims.append(sim)
# Final list sims will contain identical simulations

A more contrived but still possible scenario:

sim_data = web.load(task_id)
# Maybe I want to see what the simulation looks like with a different grid
sim_data.simulaton.grid_spec = td.GridSpec.uniform(dl=0.2) 
... # do something
# Then I forgot about this and I wanted to save my data
sim_data.to_file("my_file.hdf5")
# Now the data stored in the file does't correspond to the simulation

As far as I can see there are two main arguments to keep mutability.

It is sometimes more convenient and saves some lines of code. I personally think this is not that important compared to the arguments above.
The main one imo: how difficult would it be to enforce immutability now. One approach would be to freeze Tidy3dBaseModel, and to have all attributes either be a subclass of that, or a Tuple. This will probably be hard to implement for e.g. numpy and xarrays. I think a better approach is for everything to be a model, tuple, or private attribute. For example:

class Simulation(Tidy3dBaseModel):
    _structures: List[structure] = pd.PrivateAttr([])
    def __init__(structures):
        """ Only private attributes are set in init, and they are not touched anywhere else."""
        self._structures = structures
    @property
    def structures(self):
        return self._structures

This same principle can be applied to numpy arrays, data arrays, doesn't matter. One problem I can envision is that the documentation generation function may have to be modified a bit. Another one is that I'm not sure how the validators work exactly in this case, but I'm sure plenty of people have made immutable pydantic models and there will be plenty of information how to make it work.

tylerflex · 2022-05-23T16:25:33Z

tylerflex
May 23, 2022
Maintainer

Intro

Thanks for writing this up, it is good to have an ongoing discussion on this topic. Of course, we should always make the best design decision given our needs. I think there are good arguments for taking a more immutable approach. However, I feel like there are many good reasons for keeping mutability, beyond “saving a few lines”. There are reasons dicts, lists, sets, numpy arrays, etc. are mutable objects. It tends to make them much easier to work with. This is also a very major change, before we make such a big change, we should continue discussing and make sure we fully understand the ramifications.

A few facts (as far as I understand them) to agree on before moving onto the details:

Pydantic only supports a “faux-immutability” and, as they warn: “Immutability in python is never strict. If developers are determined/stupid they can always modify a so-called "immutable" object.” I therefore worry about making assumptions and optimizations based on immutability when they are, in fact, still mutable.
If the Simulation (or any models contained within) contain any mutable objects, the entire model becomes effectively mutable. So if we want to be more sure that the models are “static” (in the hash sense), everything must be converted to float, int, tuple, str, or bool. even then, I'm not sure if there is any pydantic weirdness that could potentially mutate the contents of a model instance.

Hash vs. Private Attributes

We have seen that using that the approach of a static context manger + _hash to try to speed things up when caching properties may not be as successful as we were hoping in the case of large models, where the hash function itself can be quite costly. Furthermore there are still issues with making it recursively work with all models. If models were immutable instead, we could just compute self._cached_property once upon init and then just always return that upon self.property.

For reference, what is the expensive part of hashing large models? Is it the self.json() or the hash() of that? Even if we make the Simulation mutable or “frozen”, we still can’t fully guarantee it has not changed without doing some sort of comparison of the .json() unless we take extreme steps to enforce every Field throughout the code is immutable (and even that might not be guaranteed).
Whether the models are mutable or not, we could always switch the current implementation to one based on private attributes (like a cached property) instead of hash. The private attributes could be set and retrieved in the static context. We lose a bit of generality compared to the hash approach because we may need to explicitly define these private attributes, but if the gains are large enough doing it this way, I’m sure we could. For example, we can create a different @cache(private_attr: str) decorator that accepts the name of the private attribute to store the value within.

Backwards compatibility

We have also seen that it is not possible to support two different ways of defining things for backwards compatibility. This always raises the problem of what overwrites what, and leaves holes for misuse where things are not correctly set. If the models were immutable, we could support multiple ways of defining things, and raise deprecation warnings for several releases before removing the old arguments.

I guess we could revisit this issue. A few things:

Note that a version translator could be useful here, once we have that.
As a side note: I wonder if something like this can be a good solution that we hadn’t though of

class MyModel(Tidy3dBaseModel):

  old_field = Field() # the one we are phasing out
  new_field = Field() # the one we'll use exclusively soon

  @pydantic.validator("old_field")
  def warn_and_convert(cls, val, values):
    """set the values['new_field']"""
    log.warning('old_field will not be supported')
    new_val = some_conversion(val)
    values['new_val'] = new_val
    return val # or None

Bugs

There are some scenarios where unexpected things may happen; all of those are technically wrong usage by the user, but it would be harder to do that with immutable models.

I agree that there are some potential issues. But isn’t this true in python generally? Both of your example seem like things that could be done in any python context. Why do we want to protect the user from the normal workings of the language? The first example seems like it will be very easy to discover and fix. The second example seems quite unlikely to occur.

I’m far more worried about us (the developers) making some mistake by assuming the Simulations are immutable when they have been changed. For example, using a cached property when some of the data has been modified (ie a polygon was dilated). These bugs seem much harder to identify. I guess there’s always going to be a tradeoff between performance and correctness.

My original design decision (models are mutable, use properties often) was to make sure everything was always “correct”, even when the model was changed. The immutability move scares me because it seems impossible to fully enforce, and means we are introducing the potential for bugs that we have no need to worry about currently.

Benefits of Mutability

As far as I can see there are two main arguments to keep mutability.

It is sometimes more convenient and saves some lines of code. I personally think this is not that important compared to the arguments above.

I feel like this is a bit of an oversimplification of the benefits. As just one example, usually a more convenient but also more robust way to construct variants of the same simulation. Consider:

sim = Simulation(...)
# many cells / lines later, I need to make one or more copies of the same simulation with a different run_time or some other fields.

# copy and paste the init method, what if I add`subpixel` argument to the call above?  I need to remember to propagate this to all Simulation() calls.
sim2 = Simulation(...)

# wrap everything in a function, this is cumbersome and all the call signatures need to be propagated if, say, a kwarg is added.
sim2 = make_sim(run_time)

# much more convenient and always works, even if original sim is modifie.
sim2 = sim.copy()
sim2.run_time = ...

This pattern shows up all the time: optimization loops, adjoint gradient, resolution / parameter scanning.

Enforcing immutability

2 The main one imo: how difficult would it be to enforce immutability now. One approach would be to freeze Tidy3dBaseModel, and to have all attributes either be a subclass of that, or a Tuple. This will probably be hard to implement for e.g. numpy and xarrays. I think a better approach is for everything to be a model, tuple, or private attribute.

Yea this also worries me the most as I’ve probably made clear. I dont even know how to begin with the numpy stuff. There is a way to set them as unwritable, but it's still not robust because any view into the data in an unwritable array can still allow mutation. We’d likely need to cast all numpy arrays to tuples and then cast them back to numpy arrays in the validators and methods, which will be annoying to work with and potentially have its own performance issues when dealing with large data (custom sources?).

Wrapping up

Anyway, I do recognize the advantages of taking an immutable approach.

If we can enforce it in a robust way, it would make the code more performant (we could essentially just compute and cache every property once on receiving the file).
It could cut down on errors from users not being aware of mutability, although I would argue this is just how python is.

However, there are almost mirroring advantages to mutability.

We always know things are correct (properties return up to date values, fields are always validated). There is no guessing, it is guaranteed.
Mutability makes it nice to write code that makes modifications of simulations with minimal effort or refactoring. It also makes it convenient to do things in notebooks (ie reduce the grid size and plot the simulation again). Basically, it’s less restrictive and more expressive, I feel like for the front end, at least, it feels the nicest to work with.

I’m probably still team “mutability” unless the following conditions can be met:

ensure without a doubt that the Simulation can not be modified (How to handle non uniform definition? #1 above).
we can’t come up with a private attribute approach that solves the performance issues.

As a side note: maybe we can compromise a bit and take a stab at a subclassed Tidy3dBaseModel for the backend that is immutable and performant, we could try to enforce this rule on the backend and I think it would make even more sense there.

3 replies

momchil-flex May 23, 2022
Maintainer Author

I actually don't agree with this

If the Simulation (or any models contained within) contain any mutable objects, the entire model becomes effectively mutable. So if we want to be more sure that the models are “static” (in the hash sense), everything must be converted to float, int, tuple, str, or bool. even then, I'm not sure if there is any pydantic weirdness that could potentially mutate the contents of a model instance.

and this

Yea this also worries me the most as I’ve probably made clear. I dont even know how to begin with the numpy stuff. There is a way to set them as unwritable, but it's still not robust because any view into the data in an unwritable array can still allow mutation. We’d likely need to cast all numpy arrays to tuples and then cast them back to numpy arrays in the validators and methods, which will be annoying to work with and potentially have its own performance issues when dealing with large data (custom sources?).

and I think we can

ensure without a doubt that the Simulation can not be modified (How to handle non uniform definition? #1 above).

Basically the rule is simple: every Field is either a subclass of Tidy3dBaseModel (which is frozen) or a private attribute. So the user cannot modify either of those. Like, there are ways, but it would for example be to subclass Simulation and write a method that modifies the private attribute? Technically possible, but that's really far-fetched.

So with numpy arrays for example my example works exactly the same:

class PolySlab(Tidy3dBaseModel):
    _vertices: Array[float] = pd.PrivateAttr([])
    def __init__(vertices: Array[float]):
        """ Only private attributes are set in init, and they are not touched anywhere else."""
        self._vertices = np.copy(vertices)
    @property
    def vertices(self):
        return self._vertices

Internally, the dangers are that a developer modifies the private attributes e.g. in some methods where they shouldn't be modified. But this does sound like something that can be caught (even pylint warns you) and simple rules can be established. It is not like we don't have to put a lot of effort and thought right now in ensuring that all validators work as expected to make sure that mutability is always well supported.

tylerflex May 23, 2022
Maintainer

every Field is either a subclass of Tidy3dBaseModel (which is frozen) or a private attribute.

Can you explain this? What about things like PolySlab.vertices?

momchil-flex May 23, 2022
Maintainer Author

Some other comments

Backwards compatibility

Note that a version translator could be useful here, once we have that.

It will help, but it would still mean that people have to change their scripts as soon as they upgrade to a new version.

As a side note: I wonder if something like this can be a good solution that we hadn’t though of

class MyModel(Tidy3dBaseModel):

  old_field = Field() # the one we are phasing out
  new_field = Field() # the one we'll use exclusively soon

  @pydantic.validator("old_field")
  def warn_and_convert(cls, val, values):
    """set the values['new_field']"""
    log.warning('old_field will not be supported')
    new_val = some_conversion(val)
    values['new_val'] = new_val
    return val # or None

The problem with this is that if I den do model.new_field = something_else, model.old_field becomes stale.

I feel like this is a bit of an oversimplification of the benefits. As just one example, usually a more convenient but also more robust way to construct variants of the same simulation. Consider:
sim = Simulation(...)
# many cells / lines later, I need to make one or more copies of the same simulation with a different run_time or some other fields.

# copy and paste the init method, what if I add`subpixel` argument to the call above?  I need to remember to propagate this to all Simulation() calls.
sim2 = Simulation(...)

# wrap everything in a function, this is cumbersome and all the call signatures need to be propagated if, say, a kwarg is added.
sim2 = make_sim(run_time)

# much more convenient and always works, even if original sim is modifie.
sim2 = sim.copy()
sim2.run_time = ...
This pattern shows up all the time: optimization loops, adjoint gradient, resolution / parameter scanning.

With immutable, you could also do something like

sim_dict = sim.dict()
sim_dict["run_time"] = new_value
new_sim = Simulation.parse_raw(sim_dict)

However, there are almost mirroring advantages to mutability.

We always know things are correct (properties return up to date values, fields are always validated). There is no guessing, it is guaranteed.

I am actually not at all sure that enforcing everything is correct with mutability (no stale values, etc.) is easier than enforcing immutability...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tidy3dBaseModel mutability #377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Backwards compatibility

Select a reply

Tidy3dBaseModel mutability #377

momchil-flex May 23, 2022 Maintainer

Replies: 1 comment · 3 replies

tylerflex May 23, 2022 Maintainer

Intro

Hash vs. Private Attributes

Backwards compatibility

Bugs

Benefits of Mutability

Enforcing immutability

Wrapping up

momchil-flex May 23, 2022 Maintainer Author

tylerflex May 23, 2022 Maintainer

momchil-flex May 23, 2022 Maintainer Author

Backwards compatibility

momchil-flex
May 23, 2022
Maintainer

Replies: 1 comment 3 replies

tylerflex
May 23, 2022
Maintainer

momchil-flex May 23, 2022
Maintainer Author

tylerflex May 23, 2022
Maintainer

momchil-flex May 23, 2022
Maintainer Author