Update GECCO parsing code to avoid reading files more than once #11

althonos · 2022-07-11T15:31:46Z

Hi Rauf!

This is just a small PR to update the parser for GECCO so that you don't have to manually extract the biosynthetic type. When you load a SeqRecord with Biopython, the GenBank structured comment will be available in record.annotations['structured_comment'] so you can get the GECCO-data directly from there.

raufs · 2022-07-11T19:57:10Z

Changes look great, thank you for the easier / more-clean parsing suggestion. I am just going to add a try catch around the parsing because in a downstream application, I generate BGCs using lsaBGC-Expansion.py and parse it using the same function, might have a separate parser for those in the future.

raufs · 2022-07-11T20:06:02Z

Also, thank you for the advice to switch over from e-value reported in GECCO BGC Genbanks to using the p-value from the enrichment analysis involved in the training for determining "protcore-ish" domains/genes. I think the p-value "note" feature currently in the BGC genbanks is something different and relates to the e-value. Do you know how I could easily gather this specific enrichment p-value, is there a specific file/table in GECCO that provides this for the most current model or do you think it is possible to update the Genbanks in the future to feature this specific value?

althonos · 2022-07-11T22:44:25Z

I think the p-value "note" feature currently in the BGC genbanks is something different and relates to the e-value.

Yes, the p-value in the note is the HMMER p-value, which is obtained directly after annotating each domain (E = p * Z, where E is the independent E-value, p is the p-value, and Z is the total number of HMM comparisons done).

Do you know how I could easily gather this specific enrichment p-value, is there a specific file/table in GECCO that provides this for the most current model or do you think it is possible to update the Genbanks in the future to feature this specific value?

If you have GECCO installed, you can get the significance list with the following code:

from gecco.crf import ClusterCRF
crf = ClusterCRF.trained()
crf.significance  # dictionary of Pfam accession to Fisher p-value

This could vary based on the version of the training data, but at the moment we do not plan on retraining before a long time. So you could use this code and generate a table which will be valid for GECCO v0.9.2 onwards.

Do you think it is possible to update the Genbanks in the future to feature this specific value?

I could actually add that, which would make it feasible to support GenBank files from different versions.

althonos · 2022-07-11T22:55:03Z

Actually, I'm thinking that perhaps it would make more sense to use the CRF weights rather than the Fisher significance table, because with the CRF weights you know if a feature is positively or negatively associated with BGC regions.

To get it from the current model:

from gecco.crf import ClusterCRF
crf = ClusterCRF.trained()
weights = { acc:weight for (acc,state),weight in crf.model.state_features_.items() if state == '1' }

If you look at the 5 domain features with the highest weight, you get:

PF19151 11.385894 Sublancin
PF10439 11.43806 Bacteriocin class II with double-glycine leader peptide
PF19476 11.681125 N/A
PF19155 12.620316 Family of unknown function (DUF5837)
PF17951 12.654051 Fatty acid synthase meander beta sheet domain

The ones with the highest positive weights are the closest thing we have to "core domains" in GECCO 😃

raufs · 2022-07-13T01:15:10Z

Awesome, Thank you again Martin! Really appreciate the quick catch here and solution, have updated the code to now use CRF weights instead of e-values!

update from main

Update GECCO parsing code to avoid reading files more than once

f3aac6a

raufs merged commit d95e713 into Kalan-Lab:main Jul 11, 2022

raufs added a commit that referenced this pull request Apr 15, 2023

Merge pull request #11 from Kalan-Lab/main

dec1157

update from main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GECCO parsing code to avoid reading files more than once #11

Update GECCO parsing code to avoid reading files more than once #11

althonos commented Jul 11, 2022

raufs commented Jul 11, 2022

raufs commented Jul 11, 2022 •

edited

Loading

althonos commented Jul 11, 2022

althonos commented Jul 11, 2022

raufs commented Jul 13, 2022

Update GECCO parsing code to avoid reading files more than once #11

Update GECCO parsing code to avoid reading files more than once #11

Conversation

althonos commented Jul 11, 2022

raufs commented Jul 11, 2022

raufs commented Jul 11, 2022 • edited Loading

althonos commented Jul 11, 2022

althonos commented Jul 11, 2022

raufs commented Jul 13, 2022

raufs commented Jul 11, 2022 •

edited

Loading