Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic official FASTA file fetching, several new utility functions related to structure including flexible 3d alignment that supports different length chains to be aligned! #101

Merged
merged 147 commits into from
May 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
147 commits
Select commit Hold shift + click to select a range
7563e3e
adding deduplication and cluster generation generic tool
Mar 8, 2023
bef1c25
renamed few arguments
Mar 8, 2023
7140781
...
Mar 9, 2023
68647da
PR comments
Mar 9, 2023
3011c2c
printing key generating output files in cluster
Mar 10, 2023
8647a06
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Mar 10, 2023
b3bcd11
static code checker fixes
Mar 10, 2023
1485738
reduced dependencies and did some cleanup
Mar 10, 2023
91a98fd
added visualizations utils for antibodies
Mar 10, 2023
f77178b
static code check fixes
Mar 10, 2023
a7bccd3
black mypy flake8 fixes
Mar 10, 2023
8db8791
dealing with large fasta files
Mar 12, 2023
2305539
returning a consistent amount of elements in tuple
Mar 12, 2023
8705282
when clustering with mmseqs2, now also outputting a FASTA file with t…
Mar 15, 2023
19bd0fe
moved all mmseqs DB to a workspace to avoid clutter
Mar 15, 2023
bed6a90
renamed
Mar 15, 2023
9a0b4a4
solved conflicts
Mar 15, 2023
bfe4ba5
better conflict merge
Mar 15, 2023
5da1694
added splitting based on cluster.tsv
Mar 15, 2023
9aeeeb2
better docstring
Mar 15, 2023
a3f6033
static checkers fixes
Mar 15, 2023
41546df
PR coments
Mar 15, 2023
56fafdd
balanced sampling and mmap lines reader
Mar 16, 2023
8579b30
...
Mar 16, 2023
50c3dc8
solved a flipped file creations in cluster_using_mmseqs and refactori…
Mar 19, 2023
63f1cb4
added proper caching of return answer from cluster
Mar 22, 2023
c31a5f1
solved conflict
Mar 22, 2023
075b4c8
splits and clusters
Mar 27, 2023
1fa7507
pdb clustering related and also adding requirements
Mar 28, 2023
cb01191
pdb prepare_data related
Mar 29, 2023
1ab88e3
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Mar 29, 2023
e02c357
pdb prepare_data ddfb related
Mar 29, 2023
b10d033
static code checkers fixes
Mar 29, 2023
834e5bb
fixed cluster balancing code
Mar 31, 2023
4fc16d0
...
Apr 2, 2023
c1ddcd5
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Apr 2, 2023
3601591
...
Apr 2, 2023
25f42cc
removing a dep
Apr 2, 2023
250b6b5
solving CUDA mismatch issue in unit tests
Apr 6, 2023
72049a4
...
Apr 6, 2023
ff315fc
solving CI/CD stuff
Apr 14, 2023
fd7a0a2
solving auto tests issues
Apr 14, 2023
4293c89
...
Apr 14, 2023
2d66c04
restored pytoda
Apr 14, 2023
ac9b1c7
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Apr 14, 2023
771a7af
added a method to provide the max observed token ID in a fast tokenizer
Apr 15, 2023
1f95800
...
Apr 15, 2023
77ab96a
added another method to fast tokenizer, allowing to extract sentinal …
Apr 16, 2023
b530614
static code checkers
Apr 16, 2023
f660fbb
...
Apr 16, 2023
d69d6ce
unit tests
Apr 16, 2023
0d7cd36
...
Apr 24, 2023
a7cfb3c
solved conflicts
Apr 24, 2023
8b7b60d
...
Apr 24, 2023
d6588de
fixed a typo and added abnumber requirement
Apr 30, 2023
fd2468b
solved conflict
Apr 30, 2023
4ebc2b8
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
May 3, 2023
ec730fd
added pseudo beta loading
May 3, 2023
872f871
...
May 24, 2023
756f92b
...
May 24, 2023
b6a3f7c
adding ablang embedding extraction
May 27, 2023
5d96384
...
May 30, 2023
887c813
...
May 30, 2023
b5ad9f1
cleanup
May 30, 2023
c09b063
...
May 31, 2023
1292931
adding metrics for pairwise protein sequences comparison
Jun 2, 2023
68db74f
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 2, 2023
3921c50
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 6, 2023
500f371
added utility to get antibody regions
Jun 6, 2023
8be708c
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 9, 2023
f2c2abe
pairwise sequence metrics and added sapiens humanness score
Jun 13, 2023
f0aa9f2
static code
Jun 13, 2023
1edb7be
PR comments
Jun 14, 2023
5f08672
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 14, 2023
e49702e
drastic reduction of needed memory in indexed_fasta_custom hdf5 offse…
Jun 20, 2023
7ff1148
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Jun 20, 2023
0d1fb4f
avoiding parsing mmcif twice, added more flexible pdb writing allowin…
Jul 16, 2023
5eaf047
solved conflicts
Jul 16, 2023
0f40b5d
...
Jul 16, 2023
e4a9ef3
added the function that stores trajectory into a single PDB file
Jul 16, 2023
aa5cded
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jul 18, 2023
ff9da56
adding a class for protein complex which will also handle negative pairs
Jul 23, 2023
b58641a
added flatten to protein_complex
Jul 23, 2023
78d0855
multimer spatial crop
Jul 23, 2023
b95be12
deprecated get_chain_native_features()
Jul 24, 2023
1d1adb1
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jul 24, 2023
8f47dff
...
Jul 30, 2023
f6813bf
multimer
Jul 31, 2023
0210eae
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Jul 31, 2023
abc7477
multimer
Jul 31, 2023
1e83440
moving to tiny_openfold
Jul 31, 2023
af5847b
extracting residues types info, and information about presence of rna…
Aug 1, 2023
3e0010c
...
Aug 2, 2023
dfc12be
adding support in spatial crop for K>2 chains
Aug 2, 2023
624d176
basic spatial crop works
Aug 2, 2023
018cc53
spatial crop was tested on few cases and verified to work
Aug 2, 2023
480982b
...
Aug 2, 2023
7c97c6b
no spatial crop if already less than needed size
Aug 2, 2023
5f1f616
fallbacking to regular crop when there is no intersection
Aug 2, 2023
19352c1
added residue_index and chain_index
Aug 4, 2023
a6ab950
residue_index and chain_index were added twice
Aug 4, 2023
906372e
...
Aug 4, 2023
f44743b
fixed requirement and disallowed positional args in load_protein_stru…
Aug 28, 2023
c236bbc
fixing requirements
Aug 28, 2023
6bb199e
PR suggestions
Aug 28, 2023
1b16848
fixing requirements
Aug 29, 2023
f4cb0b6
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Aug 29, 2023
c45514a
removed readlines()
Aug 31, 2023
15df396
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Aug 31, 2023
e0f17a7
proper cores auto deduction
Sep 6, 2023
89d4f8e
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Sep 6, 2023
b10492a
extended modulartokenizer to support local (per section) max length
Sep 13, 2023
63131e3
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Sep 13, 2023
d3cfd36
PR fix
Sep 13, 2023
9db980e
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Sep 19, 2023
33199b6
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Sep 21, 2023
3f34a82
...
Oct 6, 2023
0d15e3d
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Oct 6, 2023
d8b73cb
...
Oct 6, 2023
98996a4
...
Jan 10, 2024
c4d9c65
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jan 10, 2024
769b249
...
Feb 7, 2024
35fa988
advancing on flexible multi chain alignment
Feb 7, 2024
d808fca
flexible alignment
Feb 8, 2024
fcceb44
flexible structure alignment is working well!
Feb 8, 2024
2b8fdca
...
Feb 14, 2024
25600ab
...
Feb 14, 2024
fe6105e
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Feb 14, 2024
7ddd58e
flexible align on multiple with table as input
Feb 18, 2024
e4b76bc
advanced on flexible align
Feb 18, 2024
00b17f5
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Feb 18, 2024
e9f6140
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Feb 18, 2024
ae667b6
flexible multiple align and also extracting and optionally renaming c…
Feb 18, 2024
1307305
...
Feb 21, 2024
8a280b3
supporting saving pdb when given atom37 pos as input
Mar 3, 2024
70d722d
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Mar 3, 2024
fb0cfe1
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Mar 5, 2024
6adcb61
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Mar 10, 2024
72ca1f0
complex protein
Mar 10, 2024
7c22f43
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Mar 10, 2024
a635acc
...
Mar 17, 2024
77f27ad
Merge branch 'yoels' of https://github.com/BiomedSciAI/fuse-drug into…
Mar 17, 2024
87e3ff5
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Mar 31, 2024
34c865a
static code tests fixes
Mar 31, 2024
72b2328
fixed handling of bfactors - they are per atom, not per residue!
Apr 4, 2024
a0a6c02
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
May 16, 2024
43a9a34
PR comments
May 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions fusedrug/data/protein/antibody/antibody.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import List, Dict
from typing import List, Dict, Optional
from fusedrug.data.protein.structure.sabdab import load_sabdab_dataframe
import pandas as pd
from collections import namedtuple
Expand Down Expand Up @@ -33,12 +33,16 @@ def get_antibody_regions(sequence: str, scheme: str = "chothia") -> Dict[str, st
return ans


def get_antibodies_info_from_sabdab(antibodies_pdb_ids: List[str]) -> List[Antibody]:
def get_antibodies_info_from_sabdab(
antibodies_pdb_ids: Optional[List[str]] = None,
) -> List[Antibody]:
"""
Collects information on all provided antibodies_pdb_ids based on SabDab DB.

"""
sabdab_df = load_sabdab_dataframe()
if antibodies_pdb_ids is None:
antibodies_pdb_ids = sabdab_df.pdb.unique().tolist()
antibodies = []
for pdb_id in antibodies_pdb_ids:
found = sabdab_df[sabdab_df.pdb == pdb_id]
Expand Down
38 changes: 38 additions & 0 deletions fusedrug/data/protein/sequence/official_pdb_fasta.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from io import StringIO
from Bio import SeqIO
from urllib.request import urlopen
from typing import Dict


def get_fasta_from_rcsb(pdb_id: str) -> Dict: # TODO: consider adding caching
"""
Given some pdb_id, (like "7vux"), we will retrieve its fasta file from rcsb database and return it as a dict {chain: sequence}.
"""
fasta_data = (
urlopen(f"https://www.rcsb.org/fasta/entry/{pdb_id.upper()}")
.read()
.decode("utf-8")
)
fasta_file_handle = StringIO(fasta_data)
chains_full_seq = SeqIO.to_dict(
SeqIO.parse(fasta_file_handle, "fasta"),
key_function=lambda rec: _description_to_author_chain_id(rec.description),
)
chains_full_seq = {k: str(d.seq) for (k, d) in chains_full_seq.items()}
return chains_full_seq


def _description_to_author_chain_id(description: str) -> str:
loc = description.find(" ")
assert loc >= 0
description = description[loc + 1 :]
loc = description.find(",")
if loc >= 0:
description = description[:loc]

token = "auth "
loc = description.find(token)
if loc >= 0:
return description[loc + len(token)]

return description[0]
104 changes: 104 additions & 0 deletions fusedrug/data/protein/structure/align_multiple_antibodies.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
from os.path import join, dirname
from fusedrug.data.protein.structure.flexible_align_chains_structure import (
flexible_align_chains_structure,
)
from jsonargparse import CLI
import pandas as pd
from typing import Optional
import numpy as np


def main(
input_excel_filename: str,
unique_id_column: str,
reference_heavy_chain_pdb_filename_column: str,
reference_heavy_chain_id_column: str,
heavy_chain_pdb_filename_column: str,
heavy_chain_id_column: str,
light_chain_pdb_filename_column: str,
light_chain_id_column: str,
aligned_using_only_heavy_chain: bool = True,
output_structure_file_prefix: str = "aligned_antibody_",
output_excel_filename: Optional[str] = None,
output_excel_aligned_heavy_chain_pdb_filename_column: str = "aligned_heavy_chain_pdb_filename",
output_excel_aligned_heavy_chain_id_column: str = None,
output_excel_aligned_light_chain_pdb_filename_column: str = "aligned_light_chain_pdb_filename",
output_excel_aligned_light_chain_id_column: str = None,
) -> pd.DataFrame:

assert (
aligned_using_only_heavy_chain
), "only supporting aligned_using_only_heavy_chain=True for now. Note that flexible_align_chains_structure is indeed flexible enough to support this, if needed."

df = pd.read_excel(input_excel_filename, index_col=unique_id_column)

# base = '/dccstor/dsa-ab-cli-val-0/2024_feb_delivery/top_100_with_indels/antibody_dimers_af2_predicted_structure'
# reference_heavy_chain = '/dccstor/dsa-ab-cli-val-0/targets/PD-1/7VUX/relaxed_complex/PD1_7VUX_H_eq.pdb'

df[output_excel_aligned_heavy_chain_pdb_filename_column] = np.nan
df[output_excel_aligned_heavy_chain_id_column] = np.nan
df[output_excel_aligned_light_chain_pdb_filename_column] = np.nan
df[output_excel_aligned_light_chain_id_column] = np.nan

for index, row in df.iterrows():
reference_heavy_chain_pdb_filename = row[
reference_heavy_chain_pdb_filename_column
]
reference_heavy_chain_id = row[reference_heavy_chain_id_column]
# reference_light_chain_id = row[reference_light_chain_id_column]

# heavy chain
heavy_chain_pdb_filename = row[heavy_chain_pdb_filename_column]
heavy_chain_id = row[heavy_chain_id_column] # 'A'
# light chain
light_chain_pdb_filename = row[light_chain_pdb_filename_column]
light_chain_id = row[light_chain_id_column] # 'B'

output_aligned_fn = join(
dirname(heavy_chain_pdb_filename), output_structure_file_prefix
)

if not isinstance(reference_heavy_chain_pdb_filename, str):
print(
f"ERROR: expected reference_heavy_chain_pdb_filename to be string, but got {reference_heavy_chain_pdb_filename} of type {type(reference_heavy_chain_pdb_filename)}"
)
continue

if len(reference_heavy_chain_pdb_filename) < 2:
print(
f'ERROR: expected reference_heavy_chain_pdb_filename to be string, but got a suspicious empty or extremely short one: "{reference_heavy_chain_pdb_filename}"'
)
continue

flexible_align_chains_structure(
dynamic_ordered_chains=[(heavy_chain_pdb_filename, heavy_chain_id)],
apply_rigid_transformation_to_dynamic_chain_ids=[
(heavy_chain_pdb_filename, heavy_chain_id),
(light_chain_pdb_filename, light_chain_id),
],
static_ordered_chains=[
(reference_heavy_chain_pdb_filename, reference_heavy_chain_id)
],
output_pdb_filename_extentionless=output_aligned_fn,
)

# heavy chain
df.loc[index, output_excel_aligned_heavy_chain_pdb_filename_column] = (
output_aligned_fn + f"_chain_{heavy_chain_id}.pdb"
)
df.loc[index, output_excel_aligned_heavy_chain_id_column] = heavy_chain_id
# light chain
df.loc[index, output_excel_aligned_light_chain_pdb_filename_column] = (
output_aligned_fn + f"_chain_{light_chain_id}.pdb"
)
df.loc[index, output_excel_aligned_light_chain_id_column] = light_chain_id

if output_excel_filename is not None:
df.to_excel(output_excel_filename)
print("saved ", output_excel_filename)

return df


if __name__ == "__main__":
CLI(main)
71 changes: 71 additions & 0 deletions fusedrug/data/protein/structure/extract_chains_to_pdbs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
from jsonargparse import CLI
from fusedrug.data.protein.structure.structure_io import (
load_pdb_chain_features,
save_structure_file,
)
from typing import Optional


def main(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this script for? please explain in the comments section.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this desc in the docstring:

"Takes an input PDB files and splits it into separate files, one per describe chain, allowing to rename the chains if desired"

*,
input_pdb_path: str,
orig_name_chains_to_extract: str,
output_pdb_path_extensionless: str,
output_chain_ids_to_extract: Optional[str] = None,
) -> None:
"""

Takes an input PDB files and splits it into separate files, one per describe chain, allowing to rename the chains if desired

Args:
input_pdb_path:
input_chain_ids_to_extract: '_' separated chain ids
output_chain_ids_to_extract: '_' separated chain ids
if not provided, will keep original chain ids

"""

orig_name_chains_to_extract = orig_name_chains_to_extract.split("_")
if output_chain_ids_to_extract is None:
output_chain_ids_to_extract = orig_name_chains_to_extract.split("_")
else:
output_chain_ids_to_extract = output_chain_ids_to_extract.split("_")

assert len(orig_name_chains_to_extract) > 0
assert len(orig_name_chains_to_extract) == len(output_chain_ids_to_extract)
assert len(orig_name_chains_to_extract[0]) == 1

loaded_chains = {}
for orig_chain_id in orig_name_chains_to_extract:
loaded_chains[orig_chain_id] = load_pdb_chain_features(
input_pdb_path, orig_chain_id
)

mapping = dict(zip(orig_name_chains_to_extract, output_chain_ids_to_extract))

loaded_chains_mapped = {
mapping[chain_id]: data for (chain_id, data) in loaded_chains.items()
}

save_structure_file(
output_filename_extensionless=output_pdb_path_extensionless,
pdb_id="unknown",
chain_to_atom14={
chain_id: data["atom14_gt_positions"]
for (chain_id, data) in loaded_chains_mapped.items()
},
chain_to_aa_str_seq={
chain_id: data["aasequence_str"]
for (chain_id, data) in loaded_chains_mapped.items()
},
chain_to_aa_index_seq={
chain_id: data["aatype"]
for (chain_id, data) in loaded_chains_mapped.items()
},
save_cif=False,
mask=None, # TODO: check
)


if __name__ == "__main__":
CLI(main)
Loading
Loading