Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement genome ranking functionality for position plot genomes #25

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

sherlyn99
Copy link
Collaborator

Note: This initial commit unfortunately needs a bit of pandas (conda install pandas). Will spend more time on converting pandas code to polars in _rank.py but making a quick commit now in case this is blocking other moving parts.

How to use:
run the normal per-sample-group plots, e.g.

micov per-sample-group \
 --parquet-coverage "./example/parquet/example" \
 --sample-metadata "./example/metadata/sample_metadata.txt" \
 --sample-metadata-column "dog" \
 --features-to-keep "./example/metadata/feature_metadata.txt" \
 --output "./example/plots/per_sample_groups/example" \
 --plot

and in the output folder look for a tsv called genome_ranks.tsv, which should look like this

	filename	group	mean	std	entropy	gini	cv	ranking
1	dog.G001916215.G001916215.dog.position-plot-1_10000th-scale.tsv	No	1.8827646544181977	4.183058576530245	0.9243424908282577	0.4480621660210689	2.221763918668248	1.0
2	dog.G003524985.G003524985.dog.position-plot-1_10000th-scale.tsv	Yes	1.392638036809816	2.3025363975409947	0.6346803814643782	0.27010080808626796	1.6533631400845028	2.0
3	dog.G000378745.G000378745.dog.position-plot-1_10000th-scale.tsv	Yes	9.015510204081632	16.576193292599758	3.612127243379198	0.7112575301761415	1.838630639572139	3.0
4	dog.G000735515.G000735515.dog.position-plot-1_10000th-scale.tsv	No	7.396215257244235	9.121161291216286	3.9249983018906462	0.5910730913219644	1.233220096221855	4.0
5	dog.G001723545.G001723545.dog.position-plot-1_10000th-scale.tsv	No	1.32	0.9569756888948657	0.9239024399278413	0.2182535413304644	0.7249815824961103	5.0

Top-ranked genomes technically have a more variable distribution of coverages along the genome in its corresponding sample group.

Key idea:
For each individual reference genome, rank all sample groups based on entropy, Gini coefficient, and coefficient of variation (CV) of the coverages along the genome.

Ranking criteria:
- Entropy: Lower values indicate more variable distributions 
  (ranked in ascending order).
- Gini coefficient: Higher values indicate more uneven distributions 
  (ranked in descending order).
- Coefficient of Variation (CV): Higher values indicate greater 
  relative variation (ranked in descending order).

The final ranking is computed using the sum of individual ranks:

  • Groups with lower rank sums are considered more variable.
  • Ties are handled using the "dense" ranking method, meaning tied values
    receive the same rank, and the next rank is not skipped.

The metrics of the sample group with the most variable distribution of coverages are then compared with that of other reference genomes. These genomes are then ranked from highest to lowest coverage variability.

from scipy.stats import entropy


def gini(array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return (np.sum((2 * index - n - 1) * array)) / (n * np.sum(array))


def compute_entropy(group):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def rank_genome_of_interest(plotdir):
all_files = glob.glob(f"{plotdir}/*.tsv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are temp files right? I see the issue with the current design, where the position data are organized during plotting and that organization is then used for ranking, however we don't have a great way (currently) to bubble that information out in memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agree. I thought about that but that requires some sort of refactoring on the plotting code to separate the information extraction and plotting part. Definitely see this as a valuable follow up.

@@ -279,6 +280,12 @@ def per_sample_group(parquet_coverage, sample_metadata, sample_metadata_column,
sample_metadata_column, output, monte, monte_iters,
target_names)

outdir = os.path.dirname(output)
ranked_genomes = rank_genome_of_interest(outdir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay ya, they look ephemeral

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Is it a good idea to direct them to tmp directories for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that would be a not fun refactor at the moment. The glob over the files should be more specific though, just picking up *.tsv will have false positives. Are the individual TSVs large at all or just a few kb?

@sherlyn99 sherlyn99 merged commit 71aaa34 into biocore:main Feb 5, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants