Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract interface for Input classes #2211

Merged
merged 76 commits into from
Mar 14, 2022

Conversation

rkingsbury
Copy link
Contributor

After much discussion and feedback, here is a draft abstract interface for pymatgen I/O classes.

Thanks @utf @shyamd @arepstein @mkhorton and @computron for your thoughts so far, and feel free to add more here!

Motivation

By defining a minimal interface for all Input classes in pymatgen, we can

  1. Make it easier for users to learn pymatgen IO for different codes by providing more consistent interface
  2. Make it easier for users to adapt scripts and workflows for different codes
  3. Make FireTask or FireWork call signatures more consistent
  4. Make FireTasks and FireWorks more portable among different codes, lowering barriers to new workflow development

Objective

The vision here is to define a minimal common interface for all Input classes in pymatgen, whether for solid DFT, molecular DFT, classical MD, AIMD, etc.

There are two classes. Input is a data container holding all input data for a specific calculation. It defines write_inputs (required) and from_files (optional) methods , and an is_valid (optional) property to perform basic input validation.

InputSettings is a generator class that holds settings or instructions for how to create an Input, given a structure. It's only requirement is a get_input() method.

Each code will subclass Input one time, e.g. VaspInput(Input), and subclass InputSettings multiple times for each type of calculation, e.g. VaspRelaxation(InputSettings), VaspStatic(InputSettings), etc.

For barebones support of any code, it is possible to useInput directly without writing an associated InputSettings class (see the TemplateInput example).

Implementation

This PR is a vehicle for iterating on the interface. Refactoring existing I/O classes to use the interface is not in scope for this PR. However, I have included a pseudo-functional example VASP implementation to illustrate how this interface could be used in practice. I may add other pseudo-examples, but will leave functional implementations for later.

I also provide a functional concrete implementation of Input (TemplateInput) that gives a low-barrier way for new users to transition from bash scripts to pymatgen I/O classes.

Implications for atomate

Once we have a consistent call signature for Input, it should be possible to write a generic FireTask for writing inputs that will work with many different codes, e.g.

class WriteInput(FireTaskBase):
	"""
	Args:
		input_set: Input class
		calc_dir: Union[str, Path]
	"""
	required_params = ["input_set", "calc_dir"]

    def run_task(self, fw_spec):

        input_set = self["input_set"]
        calc_dir = self["calc_dir"]

        input_set.write_inputs(calc_dir)

This would dramatically lower the barrier to supporting new codes in atomate. As long as a code has the associated Input class defined, the FireTask could be used without modification. Of course customized and more complex FireTasks may still be needed and can be built just as they are today.

@shyuep
Copy link
Member

shyuep commented Aug 9, 2021

I am in favor of a more concrete API definition, but I am not sure about this implementation.

  1. Input conflicts with the Python built-in input method.
  2. I fail to see what the current Input offers beyond the VaspInputSet abstract class. Certainly we can abstract out parts of the VaspInputSet class to make it general. But there doesn't seem to be any further generalizability beyond VASP. It seems that the main definitions are from_files and write_inputs (which I would prefer to rename completely), which are already pretty standard throughout pymatgen.
  3. Finally, I think we should be clear about what input/output means. In the context of pymatgne, input/output generally means between Python objects and files/other sources. It does not mean VASP input files or output files. Both of those are actually inputs as far as pymatgen is concerned.

@rkingsbury
Copy link
Contributor Author

I am in favor of a more concrete API definition, but I am not sure about this implementation.

Thanks for your thoughts @shyuep! Let me ask some clarifying questions.

Input conflicts with the Python built-in input method.

Excellent point. Since we are trying to separate the data containers from the input settings I was trying to find a pair of names that were clear and evocative for this distinction. I worry that InputSet and InputSettings would be confusing. What would you think about InputSet (the container) and InputGenerator (for the settings)?

I fail to see what the current Input offers beyond the VaspInputSet abstract class. Certainly we can abstract out parts of the VaspInputSet class to make it general. But there doesn't seem to be any further generalizability beyond VASP. It seems that the main definitions are from_files and write_inputs (which I would prefer to rename completely), which are already pretty standard throughout pymatgen.

Yes, to be honest the number one thing I'm hoping to achieve with this is to standardize the method name and call signature of from_files and write_inputs. There are analogous methods throughout pymatgen but some go by different names and have different call signatures. How do you want to rename write_inputs?

The other big benefit beyond the current implementation of VaspInputSet (which @utf can probably elaborate on more) is to enforce a clear separation between the data container and the generator. This should remove many points of confusion w.r.t inheritance and streamline workflow development.

Finally, I think we should be clear about what input/output means. In the context of pymatgne, input/output generally means between Python objects and files/other sources. It does not mean VASP input files or output files. Both of those are actually inputs as far as pymatgen is concerned.

I see your point. Would the renaming convention in Item 1 address? Do you have other ideas about how we would clarify this distinction? We could go further and define the base class as CodeInputSet or something.

@utf
Copy link
Member

utf commented Aug 23, 2021

I fail to see what the current Input offers beyond the VaspInputSet abstract class. Certainly we can abstract out parts of the VaspInputSet class to make it general. But there doesn't seem to be any further generalizability beyond VASP. It seems that the main definitions are from_files and write_inputs (which I would prefer to rename completely), which are already pretty standard throughout pymatgen.

I think part of the confusion with VaspInputSet is that it is used as both a generator class and as a container for VASP inputs. This means initialising the class requires all the data needed to generate the inputs (structure, etc) but the actual inputs are generated dynamically (i.e., when accessing the incar attribute).

The pain points are then:

  1. If you have a VaspInputSet it is not obvious how to edit the INCAR or KPOINTS. I.e., you can't just do vis.incar["NSW"] = 10 because the inputs are generated on the fly.
  2. On the flip side, the structure is needed on class creation even though the inputs are generated dynamically, which means that you can't use the input set as a template for delayed input creation.
  3. It can be very hard to track down why a certain setting is in your INCAR/KPOINTS because every VaspInputSet subclass implements the Incar, kpoints, etc generation in slightly different ways. E.g., some subclasses choose to edit the k-points in the init method, some choose to do it in the kpoints attribute, some in the write_inputs function, some in a mixture of all them.

This proposal is to have:

  1. A fixed input set class that isn’t subclassed for specific calculation types. The init, write_input, and validate functions always do exactly the same thing regardless of whether your calculation is a GGA relaxation, hybrid line mode band structure, or were generated outside of pymatgen.
  2. Input set generators for generating the input set class. These contain all the crazy logic for generating the input sets. You can subclass the generators but these only have one function get_input_set, so the tracking down their behaviour should be much simpler.

As @computron wrote in an email:

the "VaspInput" class of pymatgen is almost like the Input class proposed. And the VaspInputSet would be like the generators proposed if it just ditched all the internal representations of kpoints, poscar, etc. as well as functions for writing out inputs itself, and basically just had the existing method get_vasp_input() retained.

I think this model can be used beyond the VASP module. The typical pattern used in pymatgen for VASP, QCHEM, CP2K, FEFF, etc is: 1) Make an abstract input set to represent input files, 2) create subclasses of this input set for different calculation types. But in each case, the input sets suffer from the generator/container intermixing and related issues.

@shyuep
Copy link
Member

shyuep commented Aug 23, 2021

I think part of the confusion with VaspInputSet is that it is used as both a generator class and as a container for VASP inputs. This means initialising the class requires all the data needed to generate the inputs (structure, etc) but the actual inputs are generated dynamically (i.e., when accessing the incar attribute).

There is a separate VaspInput class in pymatgen.vasp.inputs.

The pain points are then:

  1. If you have a VaspInputSet it is not obvious how to edit the INCAR or KPOINTS. I.e., you can't just do vis.incar["NSW"] = 10 because the inputs are generated on the fly.

The idea of an InputSet is to have a relatively rigid way of defining rules for how an input is generated. There is absolutely no reason why you cannot do what you just wrote. incar = vis.incar; incar["NSW"] = 10 works perfectly fine. So does user_incar_settings={"NSW": 10}.

  1. On the flip side, the structure is needed on class creation even though the inputs are generated dynamically, which means that you can't use the input set as a template for delayed input creation.

This has been debated ad nauseum. Previously the input sets did not require structure and generated on the fly. But that led to circular dependencies. E.g., if you are generating inputs for a Bandstructure calculation, you sometimes transform to standard symmetry directions. Basically, that means that if vis.get_incar, vis.get_kpoints, etc. will all call the same structure transformation - 4x for your standard set of inputs. Further, sometimes the incar setting or the kpoint setting can depend on each other. This is why an input set is tied specifically to a structure today. If you have a better idea of how to solve this problem, by all means propose it.

  1. It can be very hard to track down why a certain setting is in your INCAR/KPOINTS because every VaspInputSet subclass implements the Incar, kpoints, etc generation in slightly different ways. E.g., some subclasses choose to edit the k-points in the init method, some choose to do it in the kpoints property function, some in the write_inputs function, some in a mixture of all them.

This proposal is to have:

  1. A fixed input set class that isn’t subclassed for specific calculation types. The init, write_input, and validate functions always do exactly the same thing regardless of whether your calculation is a GGA relaxation, hybrid line mode band structure, or were generated outside of pymatgen.
  2. Input set generators for generating the input set class. These contain all the crazy logic for generating the input sets. You can subclass the generators but these only have one function get_input_set, so the tracking down their behaviour should be much simpler.

Again, this is exactly what the situation it is today. Your so-called fix input set class is basically what the VaspInput class is. Your generator is basically what the VaspInputSet is. We can of course do a rename and call it VaspInputGenerator instead of VaspInputSet, but that is semantics, not implementation.

As @computron wrote in an email:

the "VaspInput" class of pymatgen is almost like the Input class proposed. And the VaspInputSet would be like the generators proposed if it just ditched all the internal representations of kpoints, poscar, etc. as well as functions for writing out inputs itself, and basically just had the existing method get_vasp_input() retained.

I think this model can be used beyond the VASP module. The typical pattern used in pymatgen for VASP, QCHEM, CP2K, FEFF, etc is: 1) Make an abstract input set to represent input files, 2) create subclasses of this input set for different calculation types. But in each case, the input sets suffer from the generator/container intermixing and related issues.

@computron should be well-aware of the reasons why structure was included in the VaspInputSet in the first place. Solve the issue of the bandstructure input set generation and let me know when you have a good solution. You will find that both types of solutions have their flaws and pain points for different types of calculations.

@utf
Copy link
Member

utf commented Aug 23, 2021

The idea of an InputSet is to have a relatively rigid way of defining rules for how an input is generated. There is absolutely no reason why you cannot do what you just wrote. incar = vis.incar; incar["NSW"] = 10 works perfectly fine.

But now you no longer have the nice container class for writing all the inputs to file and validating that they are correct. Note that validating an INCAR is reasonable requires having access to the poscar and kpoints files too. Hence, the proposed api allows for the user to write vis.incar["NSW"] = 10; vis.validate() which is natural. The vis.user_incar_settings={"NSW": 10} example is messy and difficult to discover imo.

Previously the input sets did not require structure and generated on the fly. But that led to circular dependencies. E.g., if you are generating inputs for a Bandstructure calculation, you sometimes transform to standard symmetry directions. Basically, that means that if vis.get_incar, vis.get_kpoints, etc. will all call the same structure transformation - 4x for your standard set of inputs. Further, sometimes the incar setting or the kpoint setting can depend on each other. This is why an input set is tied specifically to a structure today. If you have a better idea of how to solve this problem, by all means propose it.

I think this example showcases the strength of current proposal nicely. You have an InputSetGenerator with a single function get_input_set. You generate the input set with InputSetGenerator.get_input_set(structure) and you get back an InputSet set that has fully consistent incar, kpoints, poscar and potcar objects. If the structure needs to be transformed, this is done in get_input_set and the kpoints, incar etc are all consistent with the transformed structure.

The reason why circular dependencies arose was because the old generators allowed you to return individual incar/poscar/kpoints objects. That is not possible with this proposal. You get a full input set back that is fully consistent.

@shyuep
Copy link
Member

shyuep commented Aug 23, 2021

But now you no longer have the nice container class for writing all the inputs to file and validating that they are correct. Note that validating an INCAR is reasonable requires having access to the poscar and kpoints files too. Hence, the proposed api allows for the user to write vis.incar["NSW"] = 10; vis.validate() which is natural. The vis.user_incar_settings={"NSW": 10} example is messy and difficult to discover imo.

Again, I would point out, as @computron has pointed out, that there is a VaspInput class. By all means add a validate() method in VaspInput if you wish. I don't even see the point of a validate method. A VaspInputSet/Generator is basically a rigid set of rules to generate consistent inputs. Why would you ever generate a consistent set of inputs and then go back to validate that your inputs generated are consistent? If proper unittests are written, the inputs generated must be consistent. Feel free to demonstrate an example "validate()" method that actually increases utility.

I think this example showcases the strength of current proposal nicely. You have an input set generator with a single function get_input_set. You generate the input set with generator.get_input_set(structure) and you get back a vasp input set that has fully consistent incar, kpoints, poscar and potcar objects. If the structure needs to be transformed, this is done in get_input_set and the kpoints, incar etc are all consistent with the transformed structure.

The reason why circular dependencies arose was because the old generators allowed you to return individual incar/poscar/kpoints objects. That is not possible with this proposal. You get a full input set back that is fully consistent.

Sure, if you want to go this route. The whole idea was to allow people to just generate individual components, rather than shoving the entire VaspInput down their throats whether they want it or not. There are situations where I just need one of the inputs and not all of them. The current implementation generates the inputs as needed.

Note that this is independent from the validate question. I personally think the validate() method is overengineering for no reason.

@utf
Copy link
Member

utf commented Aug 23, 2021

Feel free to demonstrate an example "validate()" method that actually increases utility.

There are two use cases that I have in mind:

  1. It is common to customise an input set generator with additional settings, e.g., the user_incar_settings={"NSW": 10} example mentioned above. Due to the complexity of some of the generators it is nice to have a quick way to check that the customised input is valid.
  2. I've had several conversations with VASP users who use pymatgen for managing VASP inputs but do not use the VaspInputSets and who wanted a way to validate their inputs. I see this feature (and the input set container in general) as making pymatgen more useful for these users.

Sure, if you want to go this route. The whole idea was to allow people to just generate individual components, rather than shoving the entire VaspInput down their throats whether they want it or not. There are situations where I just need one of the inputs and not all of them. The current implementation generates the inputs as needed.

To me, the naming and most common usage of the vasp input sets (at least in atomate) is to generate full input "sets" rather than individual inputs. As you mentioned, generating individual inputs can be dangerous as sometimes the inputs are not consistent. Regardless, in this proposal it is easy enough to get a single input using vis = InputSetGenerator.get_input_set(structure); vis.incar.

I think the fact that there is a lot of overlap between VaspInput/VaspInputSet and this proposal is a good thing, as ultimately we don't want to do anything radical and just improve the API for consistency/easier use.

Perhaps @rkingsbury can comment but I presume the idea is that these new input sets will be developed along side the existing sets, at least at first?

@shyuep
Copy link
Member

shyuep commented Aug 23, 2021

  1. I would like to see how you implement the validate method. If it is merely checking list of species in POTCAR and POSCAR and INCAR are consistent and U values listing, then I would say it is not worth doing at all or we can implement this simple method in the VaspInput method that works generically for all sets of vasp inputs. If it is to check structure consistency etc. like what the bandstructure input sets require, I would say people should just use the VaspInputSets rather than trying to be clever and to do something in a way that it is not intended. There are limits to what kind of protection we want to give users.
  2. Atomate is not the only "customer" of pymatgen. Generating individual inputs is dangerous only if you do not include the structure in the generation process. If the input set is tied to a structure (like the current VaspInputSets), then the individual inputs are always consistent to the extent that the individual generators use the same processed structure. Your suggested "solutions" means that all four inputs are generated, even if I only want the INCAR file.

@utf
Copy link
Member

utf commented Aug 24, 2021

Going back to the original proposal, my thoughts are:

  1. What about InputSet and InputSetGenerator as names for the container and generator abstract classes. The generator method would then be called InputSetGenerator.get_input_set().
  2. Input.from_files is misleading as the function accepts a directory as input. What about Input.from_directory instead?
  3. In write_inputs, some of the argument names are quite long and some override python builtins (e.g., dir). My recommendations would be:
    • dir -> directory (and also in from_files).
    • make_dir_if_not_present -> make_dir.
    • overwrite_if_present -> overwrite.
  4. What happens if a code doesn't require a structure/atomic coordinates as input or if it doesn't make sense to override from a previous directory. I don't think there should be any required/optional arguments for the get_input_set method.

Other than that, I think this proposal is on the right track. I especially like the template code. Obviously the VASP specific code is very rough. If we are happy to move forward with this proposal then I will add some more detailed comments on that section.

@mkhorton
Copy link
Member

Thanks everyone for the discussion here so far. I agree with @shyuep in the first post that I am also in favor of a more concrete API definition, so thank you all for your input here and I hope we can agree upon a solution and get this merged.

Note on language

I want to request that we keep threads civil, discussion and disagreements are great if worded politely and constructively. I want to avoid language like "shoving [...] down their throats" as not appropriate for this venue.

Making individual files/objects accessible like .incar

From @shyuep:

Your suggested "solutions" means that all four inputs are generated, even if I only want the INCAR file.

And context from @utf:

The reason why circular dependencies arose was because the old generators allowed you to return individual incar/poscar/kpoints objects. That is not possible with this proposal. You get a full input set back that is fully consistent.

I do see the benefit in being able to access the individual classes as necessary, e.g. I might sometimes create an input set just to see which POTCARs it uses, and then access the Potcar object for additional information.

@utf, can you clarify why this restriction would be useful? I see your note that about confusion because "some subclasses choose to edit the k-points in the init method, some choose to do it in the kpoints attribute, some in the write_inputs function, some in a mixture of all them." I agree this is an issue. Perhaps we can retain .incar, .kpoints etc., but maybe there's a way to enforce that these simply return the objects from the input set and so cannot contain any logic within the property itself?

Inclusion of validate method

I would like to see how you implement the validate method [...] There are limits to what kind of protection we want to give users.

Agreed, I actually also think that the logic within the input set should ensure that inputs are sensible and internally-consistent. However, given the amount of customization possible, I can imagine scenarios where this might be useful. At worst, this is harmless. At best, it might catch some dangerous side cases.

Naming of classes and methods

From @utf:

  1. What about InputSet and InputSetGenerator as names for the container and generator abstract classes. The generator method would then be called InputSetGenerator.get_input_set().

Ok.

  1. Input.from_files is misleading as the function accepts a directory as input. What about Input.from_directory instead?

Agreed.

  1. In write_inputs, some of the argument names are quite long and some override python builtins (e.g., dir). My recommendations would be:
    dir -> directory (and also in from_files).
    make_dir_if_not_present -> make_dir.
    overwrite_if_present -> overwrite.

Agreed.

What happens if a code doesn't require a structure/atomic coordinates as input or if it doesn't make sense to override from a previous directory. I don't think there should be any required/optional arguments for the get_input_set method.

I think this makes sense too, it makes it very general but that's what a good base class should be.

@utf
Copy link
Member

utf commented Aug 26, 2021

Perhaps we can retain .incar, .kpoints etc., but maybe there's a way to enforce that these simply retrn the objects from the input set and so cannot contain any logic within the property itself?

The .incar etc attributes only work if the generator has already been initialised with the structure. In this proposal, the structure is not needed to initialise the generator, and the generator can be serialized/deserialized and used to generate many sets of input files, not just one set belonging to a certain structure. The .incar, .kpoints attributes are on the InputSet class instead. If you look at how these input sets will be used there is little difference between vis = InputSetGenerator().get_input_set(structure, prev_dir); vis.incar (proposed usage) instead of vis = VaspInputSet.from_prev(structure, prev_dir); vis.incar (current usage).

Based on this discussion there are two approaches for the (VASP) input sets. I've tried to summarise the pros and cons of each approach.

Structure on initialisation (current approach)

Usage:

vis = VaspInputSet(structure, user_incar_settings=uis)
# or, for some input sets that support overriding from previous calculations
vis = VaspInputSet.from_prev_calc(prev_dir, structure, user_incar_settings=uis)
vis.incar
vis.write_inputs()

Pros:

  • Individual inputs accessible from generator.
  • Single object.

Cons:

  • Hybrid generator/container approach.
  • Structure needed to initialise input set. This means, i) the generator can only generate inputs for a single structure; and ii) you cannot use a customised input set as a generic generator for multiple structures.
  • Messy overrides of .kpoints, etc in subclasses, with code to generate inputs split between init, from_prev and .incar, .kpoints, etc attributes.
  • No generic container for VASP inputs, i.e., cannot be used by users who just want to manage VASP inputs but not generate them using pymatgen.

Pure generator approach (this proposal)

Usage:

vis = InputSetGenerator(user_incar_settings=uis).get_input_set(structure, prev_dir)
vis.incar
vis.write_inputs()

Pros:

  • Generator can be initialised with custom options, serialised and passed around, allowing multiple input sets to be generated for different structures. Very useful for atomate/passing input sets to collaborators.
  • All code to generate input set, with/without a previous directory, lives in one place (get_input_set).
  • Clean separation between generator code (i.e., logic to generate inputs) and the representation of the input set itself (which works without the generator).

Cons:

  • Two objects required (generator and container).

Note that the difference between the approaches is not a question of computational efficiency. While in the current approach you can type .incar to access just the Incar object, calling this attribute also calls .kpoints and .potcar behind the scenes, even though the KPOINTS or POTCAR objects are not returned.

Validate method

If it is merely checking list of species in POTCAR and POSCAR and INCAR are consistent and U values listing ... we can implement this simple method in the VaspInput method that works generically for all sets of vasp inputs.

This is exactly what this proposal is suggesting. I.e., the validate function should be generic enough to work on all sets of VASP inputs. From the top of my head there are many other things we can check for:

  • KSPACING and Kpoints files conflicts.
  • Num k-points compatible with ISMEAR (i.e., > 4 k-points if using tetrahedron method).
  • NKRED compatible with KPOINTS file
  • MAGMOM settings reasonable (also w.r.t. SOC)
  • POTCAR functional matches GGA tag in INCAR.
  • etc

I actually also think that the logic within the input set should ensure that inputs are sensible and internally-consistent.

This could be achieved by calling .validate() before returning the input set.

@mkhorton
Copy link
Member

Thanks @utf, so with the generator approach, the .incar attribute remains accessible (once the input set has been generated), in this case I don't see the problem? The generator approach certainly seems cleaner overall.

@utf
Copy link
Member

utf commented Aug 26, 2021

so with the generator approach, the .incar attribute remains accessible (once the input set has been generated)

Yes, exactly.

@rkingsbury
Copy link
Contributor Author

Thanks all for the robust discussion and especially @utf for the thorough explanations.

I've pushed some changes implementing @utf and @mkhorton 's suggestions on naming conventions and updated TemplateInputSet to more closely follow the generator / container paradigm we're articulating here.

On the vision

From @utf

I think the fact that there is a lot of overlap between VaspInput/VaspInputSet and this proposal is a good thing, as ultimately we don't want to do anything radical and just improve the API for consistency/easier use.

Perhaps @rkingsbury can comment but I presume the idea is that these new input sets will be developed along side the existing sets, at least at first?

Exactly. I don't think we necessarily need to adopt this interface in the VASP input classes right away (or maybe not ever). Those classes are the oldest, most established, and most complex in pymatgen. While there might be advantages to moving toward this framework as @utf has said, it doesn't need to happen right away. I don't have anything to add that hasn't already been said re: advantages or disadvantages over the existing VASP infrastructure.

I think the biggest value to this interface at this time is to ensure greater consistency among input classes as pymatgen continues to grow. We are seeing continued interest in supporting new codes in pymatgen (xtb, CREST, orca, HOOMD just to name a few I am personally aware of) so I think defining this sooner rather than later will lower the barrier to additional contributions while also streamlining downstream workflows (including bespoke python scripts as well as mainstream atomate workflows).

Moreover, it seems to me that molecular codes are considerably more "interchangeable" than solid state codes, which provides strong motivation for a consistent interface. I feel that if someone uses pymatgen to set up a basic calculation in QChem, it should be relatively trivial to port it to Gaussian or NWChem using pymatgen tools (i.e., the InputSet arguments should be similar). Right now each of these has a different interface even though the underlying input file structure and arguments are not that different. This PR lays a foundation for making these more consistent.

On the validate method

I agree that the internals of the InputSetGenerator should ensure that the input is valid. However, there are some limits and edge cases I have run into where this breaks down. I'll give 2 examples I have in mind

  1. QChem: Suppose I have a list of small molecules I want to calculate in DFT. I loop through the list and create a OptSet for each structure. If one of my molecules happens to be a single atom, I will still get a valid InputSet, but the calculation will fail because QChem doesn't like "optimizations" to run on single atoms (they are really single points). Instead of waiting until my calculation fails, this could be caught, either on creation or on the backend, by a validate method.

  2. It would be very easy for a future developer to subclassTemplateInputSetGenerator to support a new code (in fact, the existing write_lammps_inputs method is basically equivalent to this). I could imagine someone wanting not to change the variable substitution logic that's provided in TemplateInputSet, but maybe they want to add some rules to the validate method as a way to catch non-sensical variable substitutions.

I personally think having a separate validate method to contain these checks and then calling it on creation of a new InputSet would provide a clear separation of concerns, but if the consensus is that this is overly complex, we can certainly just leave it to InputSetGenerator developers to integrate that logic into the respective classes.

on get_input_set

From @utf and @mkhorton :

What happens if a code doesn't require a structure/atomic coordinates as input or if it doesn't make sense to override from a previous directory. I don't think there should be any required/optional arguments for the get_input_set method.

I think this makes sense too, it makes it very general but that's what a good base class should be.

OK, I removed the required arguments. Actually this was needed even to make my TemplateInputSetGenerator example work within the paradigm we have here. Outside of that esoteric example though, how likely is it that we will have a code that doesn't require some kind of coordinates? Would a reasonable alternative be to make the coordinates arg Optional and let developers set it to None in the relatively rare cases when a structure isn't required?

Because I think get_input_set will be used extensively in scripts and atomate workflows, I just feel a little attached to the idea that enforcing that the first arg has to be coordinates, and specifying the types that are allowed, could have a lot of value in terms of code consistency. Maybe that can be saved for later refinement though.

@shyuep
Copy link
Member

shyuep commented Aug 31, 2021

Let me suggest a path forward based on the comments.

  1. Build up VaspInput in pymatgen.vasp.inputs to support validation etc. as @utf has suggested.
  2. Develop Mix-in classes specifying what is expected of an input/output class. This can be in a pymatgen/io/core module (we cannot use __init__.py since pymatgen/io is now a namespace package). I think this is better than using a strict hierarchy. This can be based on @rkingsbury initial implementation in this PR.
  3. Since everyone has such strong feelings about not including structure, we will develop a parallel equivalent of VaspInputGenerators that does what VaspInputSets do today, but in a no-structure format. This can be in a pymatgne/io/vasp/generators.py module. Once we are satisifed that the generators are a complete replacement for VaspInputSets and have demonstratable benefits, we will deprecate VaspInputSets.

A final note that I always favor a simple implementation over a complex one. Most other codes are far too serpentine and complex and difficult to use. Do not overengineer something because a 0.01% of people want to do this specialized thing, even if that 0.01% happens to be someone in the Materials Project. If the 0.01% wnat to do something special, they can write their own specialized method. Pymatgen caters to the 80% use case with 20% effort.

@utf
Copy link
Member

utf commented Sep 6, 2021

Let me suggest a path forward based on the comments.

This sounds very reasonable to me.

@rkingsbury If you're happy going this route and can update this PR to address point 2, I can submit a PR for points 1 and 3.

@rkingsbury
Copy link
Contributor Author

Let me suggest a path forward based on the comments.

Thank you very much for the proposal @shyuep; apologies for my delayed response.

@rkingsbury If you're happy going this route and can update this PR to address point 2, I can submit a PR for points 1 and 3.

Thank you @utf ! I am definitely supportive of some type of stateless generator approach, although after reflecting on this discussion for a while I do have some concerns that we haven't yet converged to the right solution. I'll elaborate below.

Reviewing our motivation

I really appreciate all the thoughtful comments so far. As this discussion has evolved, it's clear that our two objectives are to

  1. Define a more consistent interface for pymatgen Input classes
  2. Make it possible to serialize and pass around instructions for generating input sets without having to include structures and specific customizations.

Before we proceed with anything, I want to make sure that we're honoring those objectives. Although we're getting there, I don't feel the paradigm currently under discussion accomplishes either of them particularly well, but I think it can with a few modifications. Let me elaborate below (apologies for the long post!)

Concern that the Generator / Container paradigm does not accomplish our objectives

I'm supportive of a stateless input generator class. However, as I've reflected on our discussion, tried to explain the Generator / Container paradigm to others in our group, and developed the TemplateInputSet classes, I am concerned that the Generator | Container duality is going to be difficult for the average user or volunteer developer to understand. This relates to @shyuep comment about keeping the code as simple as possible. I do see the value in the two class approach for our operations within MP, and I'm concerned that having two classes is going to create more ambiguity about where to put input construction code rather than less.

With the current paradigm under discussion, we could easily wind up with Input creation code in two places if we do not do more to enforce the separation of concerns we intend here. For example, based only on what is required by respective abstract classes, I could do something like this:

class VaspInputSetGenerator(InputSetGenerator):

     def get_input_set(structure):
            return VaspInputSet(structure)

class VaspInputSet(InputSet):

    def __init__(structure):
          self.structure = structure
          # code for creating POTCAR, INCAR, etc. here

But this would defeat the purpose of having the two classes, since all the input construction code still resides in VaspInputSet. Instead, according to our discussion, the proper way to go about this would be:

class VaspInputSetGenerator(InputSetGenerator):

     def get_input_set(structure):
           # code that creates POSCAR, POTCAR, INCAR, KPOINTS here
            return VaspInputSet(poscar, potcar, incar, kpoints)

class VaspInputSet(InputSet):

    def __init__(poscar, potcar, incar, kpoints):
          self.poscar = poscar
          self.incar = incar
          etc.

I'm not sure that this PR or the proposed path forward do enough to enforce Approach No. 2, and what I think could happen is that we wind up with Input creation code split arbitrarily across the Generator and Container classes in different ways for different codes, and then we would have introduced additional complexity with limited benefit. In fact, if you look at my TemplateInputSetGenerator / TemplateInputSet implementation as it stands right now, I have some code in the InputSet __init__ method that really should be in the Generator class because I've had a hard time wrapping my head around the "right" way to divide things.

Approach No. 2 also makes InputSet less natural to use by themselves, because more user-friendly input arguments have to be passed to the InputSetGenerator which constructs less user-friendly pymatgen objects to pass to InputSet. So we effectively force all users to understand the Generator / Container duality.

If I put on the "hat" of a user that's experienced running calculations in some code but relatively new to pymatgen and to high throughput, that user is going to think about calculation inputs first and foremost as sets of files. If that user wants to use pymatgen to generate input files, our paradigm will require navigating two different classes. To take VASP as an example:

VaspInputSetGenerator(user_incar_settings) -> VaspInputSet(structure) -> write_inputs()

instead of the current one-step process:

VaspInputSet -> write_inputs()

This creates a potential source of confusion about where to pass different kwargs - do they go to the generator or the container or the write_inputs method? Thinking beyond VASP, in some codes like LAMMPS there are kwargs that are more natural to customize when specific input files are created (like timestep, run time, etc.) rather than on class creation, whereas others like force field settings should definitely be specified at generation. With the paradigm under discussion these might be spread out among 3 places, e.g.:

LammpsInputGenerator(force_field_settings) -> LammpsInputSet(structure) -> write_inputs(timestep, run_time, etc.)

So again there is much potential for confusion about what arguments go where; although possibly one that could be mitigated with excellent documentation. Still, I think minimizing this kind of confusion was a major objective of @utf original suggestion for the Generator / Container paradigm, so I think we to refine it a bit to keep things as simple as possible.

Proposed modified paradigm that accomplishes both objectives

Goal: Accomplish easy serialization / stateless generation without requiring two classes AND facilitate a consistent interface

I propose that we make the Container class optional and enforce an optional get_input_data method in the InputSetGenerator that will yield a container, if defined. Although Container classes are critical for high-throughput work, parsing to databases, etc., they also create a lot of additional development overhead and are not necessary for users that don't use atomate and only want to automate job setup using pymatgen.

The key change here is that write_inputs would move into the generator class so that an intermediate Container class is not necessary (but still allowed). validate and from_directory would only make sense in the context of a Container class and would stay where they are.

For VASP, this would look like:

class VaspInputSetGenerator(InputSetGenerator):
     
     def __init__(user_incar_settings, etc.):
          ....
     
     # optional, not needed by every class
     def get_input_data(structure) -> VaspInputSet:
           # code that creates POSCAR, POTCAR, INCAR, KPOINTS here
            return VaspInputSet(poscar, potcar, incar, kpoints)
    
    # optional, would require definition of a container class
     def from_directory(dir) -> VaspInputSet:
            ...

     # required
     def write_inputs(structure):
           # code that writes files to disk
           # can make use of get_input_data if it's defined, but does not have to
           vis = self.get_input_data()
           for d in [vis.incar, vis.poscar, vis.kpoints, etc.]:
                vis.write_file()

and for LAMMPS, with no container class:

class LammpsInputGenerator(InputSetGenerator):

    def __init__(force_field_settings, etc.):
          ...

     # required
     def write_inputs(structure, timestep, run_time, etc.):
           # code that creates necessary inputs from structure and kwargs and also writes files to disk

If we adopt this approach, we need to think carefully about class names because the generator class would be the primary user-facing class while the container class would primarily exist to facilitate internal work, databasing, atomate workflow, etc. In that case, I'd propose that we use the name InputSet for the generator class (currently InputSetGenerator) and come up with something new like InputContainer for container classes (currently InputSet).

We could also leave InputContainer (currently InputSet) out of the abstract interface entirely and leave it to specific code developers to implement those if they wish.

This approach retains more flexibility by not being overly prescriptive about the structure of the Container classes, and also allows us to enforce a minimal interface to InputSetGenerator (or whatever we call it). By locating the input generation logic in write_inputs(), we make it possible to serialize the class independently from structure or customizations without requiring two user-facing classes for every code.

Regarding implementation:

Develop Mix-in classes specifying what is expected of an input/output class. This can be in a pymatgen/io/core module (we cannot use init.py since pymatgen/io is now a namespace package). I think this is better than using a strict hierarchy. This can be based on @rkingsbury initial implementation in this PR.

@shyuep I'm not sure I understand Mix-in classes very well. Based on Stringify, it looks like these can define additional class methods but not define interfaces (i.e., can't use @abc.abstractmethod inside a Mix-in class). Is that right, or am I misunderstanding? Here, for example InputSet.write_inputs is necessary for every Input class, but there is no way we could define that in a mix-in class because its contents will vary from code to code. Hence, it would need to be an abstract method.

I believe this proposed modification will accomplish both of our stated objectives. @utf @shyuep please let me know what you think, and I'll move forward.

@utf
Copy link
Member

utf commented Sep 14, 2021

Hi @rkingsbury, thanks for the detailed response. I have some concerns with your proposal and still feel that having the container/generator split will provide the most clarity down the line.

To start with, most users are used to the idea of their inputs as files. Having a pymatgen class that closely maps the input files to objects requires minimal cognitive load.

I think could happen is that we wind up with Input creation code split arbitrarily across the Generator and Container classes in different ways for different codes

The separation between input set container and generator shouldn't be so difficult to decide in practice. E.g., the VaspInputSet shouldn't take a structure as input, only Incar, Kpoints, Potcar and Poscar objects. To put it another way, you should be able to generate the container class from a directory containing the input files, there is no ambiguity in this case over what the input set contains and so the init method shouldn't contain any logic for customising the settings.

In fact, if you look at my TemplateInputSetGenerator / TemplateInputSet implementation as it stands right now, I have some code in the InputSet init method that really should be in the Generator class because I've had a hard time wrapping my head around the "right" way to divide things.

I think part of the confusion is that the TemplateInputSet you wrote doesn't need a generator. The generator is only to create an input set with a certain set of defaults. In this case, I would just have the TemplateInputSet class and no generator.

Thinking beyond VASP, in some codes like LAMMPS there are kwargs that are more natural to customize when specific input files are created (like timestep, run time, etc.) rather than on class creation

I am a little concerned that some settings are only controlled when writing the files:

  • I imagine it will get messy deciding how to split up the settings between the generator init and the write_inputs function. And later on down the line when someone else wants easier control of a certain setting, they will add it to the write_inputs function leading to kwarg bloat.
  • What happens if a user specifies one of the settings in the class init as well, how do you decide which one to use (note, respecting user_incar_settings is currently broken in several of the VASP input sets. I think this is a tricky problem to handle and having custom write_input functions will make this harder).
  • I can also imagine this leading to a divergence between subclasses of the generators. I.e., some generators accept some kwargs and other generators accept others.
  • This would break the recipe for a generic write input set fire task that you suggested in your original message.

This is one of the issues avoided by having a single container object with a single write_inputs function. There is only one way to write the inputs and it is the same whichever input generator you're using.

@rkingsbury
Copy link
Contributor Author

Thanks @utf ; your points are well taken. Based on what you said about the TemplateInputSet, what would you think about turning my proposal on its head and saying that InputSet (the container) is always required, whereas the generator class is optional?

I think part of the confusion is that the TemplateInputSet you wrote doesn't need a generator. The generator is only to create an input set with a certain set of defaults. In this case, I would just have the TemplateInputSet class and no generator.

If a generator class is defined, all the customization code should go in there. If it is not defined, it will just go in the __init__ for the Container. The container class can then have the previously-discussed from_directory and write_inputs methods with no customziation.

I'll update my examples to illustrate:

class VaspInputSetGenerator(InputSetGenerator):
     
     def __init__(user_incar_settings, etc.):
          # all code to construct intermediate objects (e.g. incar) goes here
          # all customization is done with this method
          ....
    
    def get_input_set(structure) -> VaspInputSet:
        # code that generates incar, poscar, kpoints, etc.
        # based on user_incar_settings
        return VaspInputSet(incar, potcar, kpoints, structure)


class VaspInputSet(InputSet):
    def __init__(incar, potcar, kpoints, structure):
          # since we have a generator, the init is based on pre-built objects
          self.incar = incar
          self.potcar = potcar
          etc.

     def write_inputs(directory, make_dir, overwrite):
          ...

     def from_directory(directory) -> VaspInputSet:
            ...

    ...

And for LAMMPS, if we omit the Generator class:

class LammpsInputSet(InputSet):

    def __init__(structure, timestep, run_time, force_field_settings, etc.):
          # In this case all the code that customizes and generates intermediate objects (if needed) is here
          # could be moved to a Generator class in the future if we adopt one
          ...

     def write_inputs(directory, make_dir, overwrite):
          ...

     def from_directory(directory) -> LammpsInputSet:
            ...

So to summarize, if we adopted what I'm proposing in this post, what would change relative to the current state of affairs is:

  1. We standardize the interface for write_inputs and from_directory for all InputSet classes
  2. We have the option to make InputSetGenerator classes, in which case a lot of __init__ code for customization would move from the InputSet to the InputSetGenerator. This is probably an option that will be most appropriate for the older, more complex, and more established Input classes and less appropriate for the newer / simpler ones.

I personally like this approach a lot; I think it achieves both of our stated objectives and should also be intelligible to new users. What do you think?

@utf
Copy link
Member

utf commented Sep 17, 2021

I think that makes sense. I think you only start needing generators when you have multiple default input settings. I.e., let's say you had different input settings for a tight LAMMPS run vs coarse LAMMPS run, these should go into separate generators.

@coveralls
Copy link

coveralls commented Sep 17, 2021

Coverage Status

Coverage decreased (-0.6%) to 83.511% when pulling eefa879 on rkingsbury:abstract_input into 3376f27 on materialsproject:master.

@rkingsbury
Copy link
Contributor Author

Good morning team. I wanted to revive this discussion in hopes of moving this PR towards a resolution.

I've been working with several grad students who are starting to adopt this interface in IO work involving packmol, LAMMPS, and OpenMM, and have gotten positive feedback on the idea of separating the settings or "recipe" from the system-specific information like coordinates. So I think the interface is solid.

It seems the following issues still require some attention before we can merge. @shyuep @mkhorton please let me know how you'd like to proceed on these, and let me know if there are other matters to be discussed.

VASP implementation

@utf has kindly agreed to provide the VASP implmentation of this, which I think should be kept in a separate PR. However, there are some useful bits of code in this PR (code stubs, documentation of the new Generator approach, etc.) that relate to VASP. Would it be best to 1) move those bits to a new branch that @utf can pull from, or 2) wait for @utf to open a PR built on this one, and merge the two concurrently? If 2), I would like to get this PR to the point that the maintainers are ready to merge as soon as possible (even if the merge doesn't happen for a while), so that I know the interface is stable.

Documentation

Related to the above, I would like to make sure we add new top-level documentation for pymatgen IO. I have that in this PR, but it references VASP input generators. We could either 1) revise to use Packmol as the example (since that generator is part of this PR) or 2) hold the docs updates until the VASP implementation is done.

Naming

We need to make a final decision on whether we're sticking with InputSet or going for a different name like InputDeck, InputDict, InputCollection, or InputGroup. Personally, I think when most users hear "set" they will see that as equivalent to "collection" or "group" and are not going to think specifically about the python Set class. For that reason and for historical reasons I think we can keep it InputSet, but I don't have a strong opinion.

@rkingsbury
Copy link
Contributor Author

@utf I have removed all changes to the VASP infrastructure from this PR (reset to the master branch), as I see you're working on the VASP implementation elsewhere. Do you have any additional comments on the interface defined here?

@utf
Copy link
Member

utf commented Feb 4, 2022

Looks good to me @rkingsbury

@rkingsbury
Copy link
Contributor Author

@mkhorton @shyuep, now that @utf has signed off on the interface and has working VASP implementations in progress elsewhere, I believe this is ready for a final review / decision.

I will need a bit of assistance making the packmol CI tests works properly on Mac and Windows (I believe it's an issue with how to install packmol in those environments); other than that I believe everything is in good shape.

@shyuep
Copy link
Member

shyuep commented Mar 7, 2022

Pls remove packmol as a requirement on Mac and Windows. And you should structure the tests such that they are skipped if packmol doesn't exist.

@rkingsbury
Copy link
Contributor Author

Pls remove packmol as a requirement on Mac and Windows. And you should structure the tests such that they are skipped if packmol doesn't exist.

Done! All tests passing. I'll note that there is a Windows executable of packmol that I think would be valuable to test given the different way Windows handles pathnames, but it isn't conda installable so setting it up in the CI might be difficult. If that's something we want to try to tackle I'd suggest we do it in a separate PR.

@shyuep shyuep merged commit 3a83f48 into materialsproject:master Mar 14, 2022
@rkingsbury rkingsbury deleted the abstract_input branch August 10, 2022 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants