-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement genome ranking functionality for position plot genomes #25
Implement genome ranking functionality for position plot genomes #25
Conversation
from scipy.stats import entropy | ||
|
||
|
||
def gini(array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be replaced with: https://scikit.bio/docs/dev/generated/skbio.diversity.alpha.gini_index.html ?
return (np.sum((2 * index - n - 1) * array)) / (n * np.sum(array)) | ||
|
||
|
||
def compute_entropy(group): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be replaced with https://github.com/scikit-bio/scikit-bio/blob/main/skbio/diversity/alpha/_base.py#L1197 ?
|
||
|
||
def rank_genome_of_interest(plotdir): | ||
all_files = glob.glob(f"{plotdir}/*.tsv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are temp files right? I see the issue with the current design, where the position data are organized during plotting and that organization is then used for ranking, however we don't have a great way (currently) to bubble that information out in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah agree. I thought about that but that requires some sort of refactoring on the plotting code to separate the information extraction and plotting part. Definitely see this as a valuable follow up.
@@ -279,6 +280,12 @@ def per_sample_group(parquet_coverage, sample_metadata, sample_metadata_column, | |||
sample_metadata_column, output, monte, monte_iters, | |||
target_names) | |||
|
|||
outdir = os.path.dirname(output) | |||
ranked_genomes = rank_genome_of_interest(outdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay ya, they look ephemeral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Is it a good idea to direct them to tmp directories for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, that would be a not fun refactor at the moment. The glob over the files should be more specific though, just picking up *.tsv
will have false positives. Are the individual TSVs large at all or just a few kb?
Note: This initial commit unfortunately needs a bit of pandas (
conda install pandas
). Will spend more time on converting pandas code to polars in _rank.py but making a quick commit now in case this is blocking other moving parts.How to use:
run the normal per-sample-group plots, e.g.
and in the output folder look for a tsv called genome_ranks.tsv, which should look like this
Top-ranked genomes technically have a more variable distribution of coverages along the genome in its corresponding sample group.
Key idea:
For each individual reference genome, rank all sample groups based on entropy, Gini coefficient, and coefficient of variation (CV) of the coverages along the genome.
The final ranking is computed using the sum of individual ranks:
receive the same rank, and the next rank is not skipped.
The metrics of the sample group with the most variable distribution of coverages are then compared with that of other reference genomes. These genomes are then ranked from highest to lowest coverage variability.