Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GECCO parsing code to avoid reading files more than once #11

Merged
merged 1 commit into from
Jul 11, 2022
Merged

Update GECCO parsing code to avoid reading files more than once #11

merged 1 commit into from
Jul 11, 2022

Conversation

althonos
Copy link
Contributor

Hi Rauf!

This is just a small PR to update the parser for GECCO so that you don't have to manually extract the biosynthetic type. When you load a SeqRecord with Biopython, the GenBank structured comment will be available in record.annotations['structured_comment'] so you can get the GECCO-data directly from there.

@raufs
Copy link
Collaborator

raufs commented Jul 11, 2022

Changes look great, thank you for the easier / more-clean parsing suggestion. I am just going to add a try catch around the parsing because in a downstream application, I generate BGCs using lsaBGC-Expansion.py and parse it using the same function, might have a separate parser for those in the future.

@raufs raufs merged commit d95e713 into Kalan-Lab:main Jul 11, 2022
@raufs
Copy link
Collaborator

raufs commented Jul 11, 2022

Also, thank you for the advice to switch over from e-value reported in GECCO BGC Genbanks to using the p-value from the enrichment analysis involved in the training for determining "protcore-ish" domains/genes. I think the p-value "note" feature currently in the BGC genbanks is something different and relates to the e-value. Do you know how I could easily gather this specific enrichment p-value, is there a specific file/table in GECCO that provides this for the most current model or do you think it is possible to update the Genbanks in the future to feature this specific value?

@althonos
Copy link
Contributor Author

I think the p-value "note" feature currently in the BGC genbanks is something different and relates to the e-value.

Yes, the p-value in the note is the HMMER p-value, which is obtained directly after annotating each domain (E = p * Z, where E is the independent E-value, p is the p-value, and Z is the total number of HMM comparisons done).

Do you know how I could easily gather this specific enrichment p-value, is there a specific file/table in GECCO that provides this for the most current model or do you think it is possible to update the Genbanks in the future to feature this specific value?

If you have GECCO installed, you can get the significance list with the following code:

from gecco.crf import ClusterCRF
crf = ClusterCRF.trained()
crf.significance  # dictionary of Pfam accession to Fisher p-value

This could vary based on the version of the training data, but at the moment we do not plan on retraining before a long time. So you could use this code and generate a table which will be valid for GECCO v0.9.2 onwards.

Do you think it is possible to update the Genbanks in the future to feature this specific value?

I could actually add that, which would make it feasible to support GenBank files from different versions.

@althonos
Copy link
Contributor Author

Actually, I'm thinking that perhaps it would make more sense to use the CRF weights rather than the Fisher significance table, because with the CRF weights you know if a feature is positively or negatively associated with BGC regions.

To get it from the current model:

from gecco.crf import ClusterCRF
crf = ClusterCRF.trained()
weights = { acc:weight for (acc,state),weight in crf.model.state_features_.items() if state == '1' }

If you look at the 5 domain features with the highest weight, you get:

PF19151 11.385894 Sublancin
PF10439 11.43806 Bacteriocin class II with double-glycine leader peptide
PF19476 11.681125 N/A
PF19155 12.620316 Family of unknown function (DUF5837)
PF17951 12.654051 Fatty acid synthase meander beta sheet domain

The ones with the highest positive weights are the closest thing we have to "core domains" in GECCO 😃

@raufs
Copy link
Collaborator

raufs commented Jul 13, 2022

Awesome, Thank you again Martin! Really appreciate the quick catch here and solution, have updated the code to now use CRF weights instead of e-values!

raufs added a commit that referenced this pull request Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants