Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'DataFrame' object has no attribute ‘oligomeric_detail’ when training on custom PDBs #25

Open
ntoxeg opened this issue Aug 19, 2024 · 4 comments

Comments

@ntoxeg
Copy link

ntoxeg commented Aug 19, 2024

After processing some PDB files from PINDER with process_pdb_files.py when running the training I get

python -W ignore experiments/train_se3_flows.py data.dataset=pdb
Error executing job with overrides: ['data.dataset=pdb']
Traceback (most recent call last):
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 111, in main
    exp = Experiment(cfg=cfg)
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 28, in __init__
    self._setup_dataset()
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 43, in _setup_dataset
    self._train_dataset, self._valid_dataset = eu.dataset_creation(
  File "/home/greg/protein-frame-flow/experiments/utils.py", line 179, in dataset_creation
    train_dataset = dataset_class(
  File "/home/greg/protein-frame-flow/data/datasets.py", line 332, in __init__
    metadata_csv = self._filter_metadata(self.raw_csv)
  File "/home/greg/protein-frame-flow/data/datasets.py", line 356, in _filter_metadata
    raw_csv.oligomeric_detail.isin(filter_cfg.oligomeric)]
  File "/home/greg/miniconda3/envs/fm/lib/python3.10/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute ‘oligomeric_detail'

While processing the script gives me warnings of UserWarning: Unlikely unit cell vectors detected in PDB file likely resulting from a dummy CRYST1 record. Discarding unit cell vectors. I’ve only found some old issue in the library’s repository but it seems it got solved years ago.

A side question, what do I need to do to make a .clusters file, as that doesn’t seem to get produced by the script either?

@jasonkyuyim
Copy link
Collaborator

jasonkyuyim commented Aug 19, 2024

Yes you will need to manually add the oligomeric_detail column and populate it with the corresponding oligomeric state. This column is provided in the mmCIF files but I realize now I didn't code up the PDB parser to read the oligomeric detail. As a quick fix, you can populate the column with a default value or remove the filter all together. Since PINDER is populated with multimers, you'll have to be implement the chain and residue indices properly.

Regarding the warnings, if it don't break then don't bother :). I ignore the warnings unless it seems bad.

The PINDER files are taken from PDB if I'm not mistaken. So the cluster file should have a entry for every PDB ID. The clusters I used are from 2021 so you can download a more recent cluster file from RCSBPDB https://www.rcsb.org/docs/programmatic-access/file-download-services#sequence-clusters-data. Note that if a cluster isn't found then a new cluster will be assigned.

@ntoxeg
Copy link
Author

ntoxeg commented Aug 20, 2024

There are sadly more issues, now I get a crash on pivot table generation

Traceback (most recent call last):
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 111, in main
    exp = Experiment(cfg=cfg)
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 28, in __init__
    self._setup_dataset()
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 43, in _setup_dataset
    self._train_dataset, self._valid_dataset = eu.dataset_creation(
  File "/home/greg/protein-frame-flow/experiments/utils.py", line 179, in dataset_creation
    train_dataset = dataset_class(
  File "/home/greg/protein-frame-flow/data/datasets.py", line 333, in __init__
    metadata_csv = self._filter_metadata(self.raw_csv)
  File "/home/greg/protein-frame-flow/data/datasets.py", line 363, in _filter_metadata
    data_csv = _rog_filter(data_csv, filter_cfg.rog_quantile)
  File "/home/greg/protein-frame-flow/data/datasets.py", line 26, in _rog_filter
    y_quant = y_quant.radius_gyration.to_numpy()
  File "/home/greg/miniconda3/envs/fm/lib/python3.10/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute ‘radius_gyration'

Inspection shows that the table generated by

    y_quant = pd.pivot_table(
        df,
        values='radius_gyration',
        index='modeled_seq_len',
        aggfunc=lambda x: np.quantile(x, quantile)
    )

is empty.
The column radius_gyration is present in the metadata file and so is modeled_seq_len.

@ntoxeg
Copy link
Author

ntoxeg commented Aug 22, 2024

Ok, it seems that I had to disable filtering based on oligomeric detail, otherwise it filtered down to nothing.

@jasonkyuyim
Copy link
Collaborator

Yeah it sounds like you can remove the oligomeric detail filtering for your purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants