-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract interface for Input classes #2211
Conversation
I am in favor of a more concrete API definition, but I am not sure about this implementation.
|
Thanks for your thoughts @shyuep! Let me ask some clarifying questions.
Excellent point. Since we are trying to separate the data containers from the input settings I was trying to find a pair of names that were clear and evocative for this distinction. I worry that
Yes, to be honest the number one thing I'm hoping to achieve with this is to standardize the method name and call signature of The other big benefit beyond the current implementation of
I see your point. Would the renaming convention in Item 1 address? Do you have other ideas about how we would clarify this distinction? We could go further and define the base class as |
I think part of the confusion with VaspInputSet is that it is used as both a generator class and as a container for VASP inputs. This means initialising the class requires all the data needed to generate the inputs (structure, etc) but the actual inputs are generated dynamically (i.e., when accessing the The pain points are then:
This proposal is to have:
As @computron wrote in an email:
I think this model can be used beyond the VASP module. The typical pattern used in pymatgen for VASP, QCHEM, CP2K, FEFF, etc is: 1) Make an abstract input set to represent input files, 2) create subclasses of this input set for different calculation types. But in each case, the input sets suffer from the generator/container intermixing and related issues. |
There is a separate VaspInput class in pymatgen.vasp.inputs.
The idea of an InputSet is to have a relatively rigid way of defining rules for how an input is generated. There is absolutely no reason why you cannot do what you just wrote.
This has been debated ad nauseum. Previously the input sets did not require structure and generated on the fly. But that led to circular dependencies. E.g., if you are generating inputs for a Bandstructure calculation, you sometimes transform to standard symmetry directions. Basically, that means that if vis.get_incar, vis.get_kpoints, etc. will all call the same structure transformation - 4x for your standard set of inputs. Further, sometimes the incar setting or the kpoint setting can depend on each other. This is why an input set is tied specifically to a structure today. If you have a better idea of how to solve this problem, by all means propose it.
Again, this is exactly what the situation it is today. Your so-called fix input set class is basically what the VaspInput class is. Your generator is basically what the VaspInputSet is. We can of course do a rename and call it VaspInputGenerator instead of VaspInputSet, but that is semantics, not implementation.
@computron should be well-aware of the reasons why structure was included in the VaspInputSet in the first place. Solve the issue of the bandstructure input set generation and let me know when you have a good solution. You will find that both types of solutions have their flaws and pain points for different types of calculations. |
But now you no longer have the nice container class for writing all the inputs to file and validating that they are correct. Note that validating an INCAR is reasonable requires having access to the poscar and kpoints files too. Hence, the proposed api allows for the user to write
I think this example showcases the strength of current proposal nicely. You have an The reason why circular dependencies arose was because the old generators allowed you to return individual incar/poscar/kpoints objects. That is not possible with this proposal. You get a full input set back that is fully consistent. |
Again, I would point out, as @computron has pointed out, that there is a VaspInput class. By all means add a validate() method in VaspInput if you wish. I don't even see the point of a validate method. A VaspInputSet/Generator is basically a rigid set of rules to generate consistent inputs. Why would you ever generate a consistent set of inputs and then go back to validate that your inputs generated are consistent? If proper unittests are written, the inputs generated must be consistent. Feel free to demonstrate an example "validate()" method that actually increases utility.
Sure, if you want to go this route. The whole idea was to allow people to just generate individual components, rather than shoving the entire VaspInput down their throats whether they want it or not. There are situations where I just need one of the inputs and not all of them. The current implementation generates the inputs as needed. Note that this is independent from the validate question. I personally think the validate() method is overengineering for no reason. |
There are two use cases that I have in mind:
To me, the naming and most common usage of the vasp input sets (at least in atomate) is to generate full input "sets" rather than individual inputs. As you mentioned, generating individual inputs can be dangerous as sometimes the inputs are not consistent. Regardless, in this proposal it is easy enough to get a single input using I think the fact that there is a lot of overlap between VaspInput/VaspInputSet and this proposal is a good thing, as ultimately we don't want to do anything radical and just improve the API for consistency/easier use. Perhaps @rkingsbury can comment but I presume the idea is that these new input sets will be developed along side the existing sets, at least at first? |
|
Going back to the original proposal, my thoughts are:
Other than that, I think this proposal is on the right track. I especially like the template code. Obviously the VASP specific code is very rough. If we are happy to move forward with this proposal then I will add some more detailed comments on that section. |
Thanks everyone for the discussion here so far. I agree with @shyuep in the first post that I am also in favor of a more concrete API definition, so thank you all for your input here and I hope we can agree upon a solution and get this merged. Note on languageI want to request that we keep threads civil, discussion and disagreements are great if worded politely and constructively. I want to avoid language like "shoving [...] down their throats" as not appropriate for this venue. Making individual files/objects accessible like
|
The Based on this discussion there are two approaches for the (VASP) input sets. I've tried to summarise the pros and cons of each approach. Structure on initialisation (current approach)Usage: vis = VaspInputSet(structure, user_incar_settings=uis)
# or, for some input sets that support overriding from previous calculations
vis = VaspInputSet.from_prev_calc(prev_dir, structure, user_incar_settings=uis)
vis.incar
vis.write_inputs() Pros:
Cons:
Pure generator approach (this proposal)Usage: vis = InputSetGenerator(user_incar_settings=uis).get_input_set(structure, prev_dir)
vis.incar
vis.write_inputs() Pros:
Cons:
Note that the difference between the approaches is not a question of computational efficiency. While in the current approach you can type Validate method
This is exactly what this proposal is suggesting. I.e., the validate function should be generic enough to work on all sets of VASP inputs. From the top of my head there are many other things we can check for:
This could be achieved by calling |
Thanks @utf, so with the generator approach, the |
Yes, exactly. |
Thanks all for the robust discussion and especially @utf for the thorough explanations. I've pushed some changes implementing @utf and @mkhorton 's suggestions on naming conventions and updated On the visionFrom @utf
Exactly. I don't think we necessarily need to adopt this interface in the VASP input classes right away (or maybe not ever). Those classes are the oldest, most established, and most complex in pymatgen. While there might be advantages to moving toward this framework as @utf has said, it doesn't need to happen right away. I don't have anything to add that hasn't already been said re: advantages or disadvantages over the existing VASP infrastructure. I think the biggest value to this interface at this time is to ensure greater consistency among input classes as pymatgen continues to grow. We are seeing continued interest in supporting new codes in pymatgen (xtb, CREST, orca, HOOMD just to name a few I am personally aware of) so I think defining this sooner rather than later will lower the barrier to additional contributions while also streamlining downstream workflows (including bespoke python scripts as well as mainstream atomate workflows). Moreover, it seems to me that molecular codes are considerably more "interchangeable" than solid state codes, which provides strong motivation for a consistent interface. I feel that if someone uses pymatgen to set up a basic calculation in QChem, it should be relatively trivial to port it to Gaussian or NWChem using pymatgen tools (i.e., the InputSet arguments should be similar). Right now each of these has a different interface even though the underlying input file structure and arguments are not that different. This PR lays a foundation for making these more consistent. On the validate methodI agree that the internals of the
I personally think having a separate on
|
Let me suggest a path forward based on the comments.
A final note that I always favor a simple implementation over a complex one. Most other codes are far too serpentine and complex and difficult to use. Do not overengineer something because a 0.01% of people want to do this specialized thing, even if that 0.01% happens to be someone in the Materials Project. If the 0.01% wnat to do something special, they can write their own specialized method. Pymatgen caters to the 80% use case with 20% effort. |
This sounds very reasonable to me. @rkingsbury If you're happy going this route and can update this PR to address point 2, I can submit a PR for points 1 and 3. |
Thank you very much for the proposal @shyuep; apologies for my delayed response.
Thank you @utf ! I am definitely supportive of some type of stateless generator approach, although after reflecting on this discussion for a while I do have some concerns that we haven't yet converged to the right solution. I'll elaborate below. Reviewing our motivationI really appreciate all the thoughtful comments so far. As this discussion has evolved, it's clear that our two objectives are to
Before we proceed with anything, I want to make sure that we're honoring those objectives. Although we're getting there, I don't feel the paradigm currently under discussion accomplishes either of them particularly well, but I think it can with a few modifications. Let me elaborate below (apologies for the long post!) Concern that the Generator / Container paradigm does not accomplish our objectivesI'm supportive of a stateless input generator class. However, as I've reflected on our discussion, tried to explain the Generator / Container paradigm to others in our group, and developed the With the current paradigm under discussion, we could easily wind up with Input creation code in two places if we do not do more to enforce the separation of concerns we intend here. For example, based only on what is required by respective abstract classes, I could do something like this:
But this would defeat the purpose of having the two classes, since all the input construction code still resides in
I'm not sure that this PR or the proposed path forward do enough to enforce Approach No. 2, and what I think could happen is that we wind up with Input creation code split arbitrarily across the Generator and Container classes in different ways for different codes, and then we would have introduced additional complexity with limited benefit. In fact, if you look at my Approach No. 2 also makes If I put on the "hat" of a user that's experienced running calculations in some code but relatively new to pymatgen and to high throughput, that user is going to think about calculation inputs first and foremost as sets of files. If that user wants to use pymatgen to generate input files, our paradigm will require navigating two different classes. To take VASP as an example:
instead of the current one-step process:
This creates a potential source of confusion about where to pass different kwargs - do they go to the generator or the container or the write_inputs method? Thinking beyond VASP, in some codes like LAMMPS there are kwargs that are more natural to customize when specific input files are created (like timestep, run time, etc.) rather than on class creation, whereas others like force field settings should definitely be specified at generation. With the paradigm under discussion these might be spread out among 3 places, e.g.:
So again there is much potential for confusion about what arguments go where; although possibly one that could be mitigated with excellent documentation. Still, I think minimizing this kind of confusion was a major objective of @utf original suggestion for the Generator / Container paradigm, so I think we to refine it a bit to keep things as simple as possible. Proposed modified paradigm that accomplishes both objectivesGoal: Accomplish easy serialization / stateless generation without requiring two classes AND facilitate a consistent interface I propose that we make the Container class optional and enforce an optional The key change here is that For VASP, this would look like:
and for LAMMPS, with no container class:
If we adopt this approach, we need to think carefully about class names because the generator class would be the primary user-facing class while the container class would primarily exist to facilitate internal work, databasing, atomate workflow, etc. In that case, I'd propose that we use the name We could also leave This approach retains more flexibility by not being overly prescriptive about the structure of the Container classes, and also allows us to enforce a minimal interface to Regarding implementation:
@shyuep I'm not sure I understand Mix-in classes very well. Based on I believe this proposed modification will accomplish both of our stated objectives. @utf @shyuep please let me know what you think, and I'll move forward. |
Hi @rkingsbury, thanks for the detailed response. I have some concerns with your proposal and still feel that having the container/generator split will provide the most clarity down the line. To start with, most users are used to the idea of their inputs as files. Having a pymatgen class that closely maps the input files to objects requires minimal cognitive load.
The separation between input set container and generator shouldn't be so difficult to decide in practice. E.g., the VaspInputSet shouldn't take a structure as input, only Incar, Kpoints, Potcar and Poscar objects. To put it another way, you should be able to generate the container class from a directory containing the input files, there is no ambiguity in this case over what the input set contains and so the init method shouldn't contain any logic for customising the settings.
I think part of the confusion is that the TemplateInputSet you wrote doesn't need a generator. The generator is only to create an input set with a certain set of defaults. In this case, I would just have the TemplateInputSet class and no generator.
I am a little concerned that some settings are only controlled when writing the files:
This is one of the issues avoided by having a single container object with a single write_inputs function. There is only one way to write the inputs and it is the same whichever input generator you're using. |
Thanks @utf ; your points are well taken. Based on what you said about the
If a generator class is defined, all the customization code should go in there. If it is not defined, it will just go in the I'll update my examples to illustrate:
And for LAMMPS, if we omit the Generator class:
So to summarize, if we adopted what I'm proposing in this post, what would change relative to the current state of affairs is:
I personally like this approach a lot; I think it achieves both of our stated objectives and should also be intelligible to new users. What do you think? |
I think that makes sense. I think you only start needing generators when you have multiple default input settings. I.e., let's say you had different input settings for a tight LAMMPS run vs coarse LAMMPS run, these should go into separate generators. |
5d0f6e9
to
e799e25
Compare
Good morning team. I wanted to revive this discussion in hopes of moving this PR towards a resolution. I've been working with several grad students who are starting to adopt this interface in IO work involving packmol, LAMMPS, and OpenMM, and have gotten positive feedback on the idea of separating the settings or "recipe" from the system-specific information like coordinates. So I think the interface is solid. It seems the following issues still require some attention before we can merge. @shyuep @mkhorton please let me know how you'd like to proceed on these, and let me know if there are other matters to be discussed. VASP implementation@utf has kindly agreed to provide the VASP implmentation of this, which I think should be kept in a separate PR. However, there are some useful bits of code in this PR (code stubs, documentation of the new Generator approach, etc.) that relate to VASP. Would it be best to 1) move those bits to a new branch that @utf can pull from, or 2) wait for @utf to open a PR built on this one, and merge the two concurrently? If 2), I would like to get this PR to the point that the maintainers are ready to merge as soon as possible (even if the merge doesn't happen for a while), so that I know the interface is stable. DocumentationRelated to the above, I would like to make sure we add new top-level documentation for pymatgen IO. I have that in this PR, but it references VASP input generators. We could either 1) revise to use Packmol as the example (since that generator is part of this PR) or 2) hold the docs updates until the VASP implementation is done. NamingWe need to make a final decision on whether we're sticking with |
…into abstract_input
… into abstract_input
@utf I have removed all changes to the VASP infrastructure from this PR (reset to the master branch), as I see you're working on the VASP implementation elsewhere. Do you have any additional comments on the interface defined here? |
Looks good to me @rkingsbury |
@mkhorton @shyuep, now that @utf has signed off on the interface and has working VASP implementations in progress elsewhere, I believe this is ready for a final review / decision. I will need a bit of assistance making the |
for more information, see https://pre-commit.ci
…into abstract_input
for more information, see https://pre-commit.ci
Pls remove packmol as a requirement on Mac and Windows. And you should structure the tests such that they are skipped if packmol doesn't exist. |
Done! All tests passing. I'll note that there is a Windows executable of packmol that I think would be valuable to test given the different way Windows handles pathnames, but it isn't conda installable so setting it up in the CI might be difficult. If that's something we want to try to tackle I'd suggest we do it in a separate PR. |
After much discussion and feedback, here is a draft abstract interface for pymatgen I/O classes.
Thanks @utf @shyamd @arepstein @mkhorton and @computron for your thoughts so far, and feel free to add more here!
Motivation
By defining a minimal interface for all
Input
classes in pymatgen, we canFireTask
orFireWork
call signatures more consistentFireTasks
andFireWorks
more portable among different codes, lowering barriers to new workflow developmentObjective
The vision here is to define a minimal common interface for all
Input
classes in pymatgen, whether for solid DFT, molecular DFT, classical MD, AIMD, etc.There are two classes.
Input
is a data container holding all input data for a specific calculation. It defineswrite_inputs
(required) andfrom_files
(optional) methods , and anis_valid
(optional) property to perform basic input validation.InputSettings
is a generator class that holds settings or instructions for how to create anInput
, given a structure. It's only requirement is aget_input()
method.Each code will subclass
Input
one time, e.g.VaspInput(Input)
, and subclassInputSettings
multiple times for each type of calculation, e.g.VaspRelaxation(InputSettings)
,VaspStatic(InputSettings)
, etc.For barebones support of any code, it is possible to use
Input
directly without writing an associatedInputSettings
class (see theTemplateInput
example).Implementation
This PR is a vehicle for iterating on the interface. Refactoring existing I/O classes to use the interface is not in scope for this PR. However, I have included a pseudo-functional example VASP implementation to illustrate how this interface could be used in practice. I may add other pseudo-examples, but will leave functional implementations for later.
I also provide a functional concrete implementation of
Input
(TemplateInput
) that gives a low-barrier way for new users to transition from bash scripts to pymatgen I/O classes.Implications for
atomate
Once we have a consistent call signature for
Input
, it should be possible to write a genericFireTask
for writing inputs that will work with many different codes, e.g.This would dramatically lower the barrier to supporting new codes in
atomate
. As long as a code has the associatedInput
class defined, theFireTask
could be used without modification. Of course customized and more complexFireTasks
may still be needed and can be built just as they are today.