Implement genome ranking functionality for position plot genomes #25

sherlyn99 · 2025-02-04T15:07:37Z

Note: This initial commit unfortunately needs a bit of pandas (conda install pandas). Will spend more time on converting pandas code to polars in _rank.py but making a quick commit now in case this is blocking other moving parts.

How to use:
run the normal per-sample-group plots, e.g.

micov per-sample-group \
 --parquet-coverage "./example/parquet/example" \
 --sample-metadata "./example/metadata/sample_metadata.txt" \
 --sample-metadata-column "dog" \
 --features-to-keep "./example/metadata/feature_metadata.txt" \
 --output "./example/plots/per_sample_groups/example" \
 --plot

and in the output folder look for a tsv called genome_ranks.tsv, which should look like this

	filename	group	mean	std	entropy	gini	cv	ranking
1	dog.G001916215.G001916215.dog.position-plot-1_10000th-scale.tsv	No	1.8827646544181977	4.183058576530245	0.9243424908282577	0.4480621660210689	2.221763918668248	1.0
2	dog.G003524985.G003524985.dog.position-plot-1_10000th-scale.tsv	Yes	1.392638036809816	2.3025363975409947	0.6346803814643782	0.27010080808626796	1.6533631400845028	2.0
3	dog.G000378745.G000378745.dog.position-plot-1_10000th-scale.tsv	Yes	9.015510204081632	16.576193292599758	3.612127243379198	0.7112575301761415	1.838630639572139	3.0
4	dog.G000735515.G000735515.dog.position-plot-1_10000th-scale.tsv	No	7.396215257244235	9.121161291216286	3.9249983018906462	0.5910730913219644	1.233220096221855	4.0
5	dog.G001723545.G001723545.dog.position-plot-1_10000th-scale.tsv	No	1.32	0.9569756888948657	0.9239024399278413	0.2182535413304644	0.7249815824961103	5.0

Top-ranked genomes technically have a more variable distribution of coverages along the genome in its corresponding sample group.

Key idea:
For each individual reference genome, rank all sample groups based on entropy, Gini coefficient, and coefficient of variation (CV) of the coverages along the genome.

Ranking criteria:
- Entropy: Lower values indicate more variable distributions 
  (ranked in ascending order).
- Gini coefficient: Higher values indicate more uneven distributions 
  (ranked in descending order).
- Coefficient of Variation (CV): Higher values indicate greater 
  relative variation (ranked in descending order).

The final ranking is computed using the sum of individual ranks:

Groups with lower rank sums are considered more variable.
Ties are handled using the "dense" ranking method, meaning tied values
receive the same rank, and the next rank is not skipped.

The metrics of the sample group with the most variable distribution of coverages are then compared with that of other reference genomes. These genomes are then ranked from highest to lowest coverage variability.

wasade · 2025-02-04T16:21:40Z

micov/_rank.py

+from scipy.stats import entropy
+
+
+def gini(array):


Can this be replaced with: https://scikit.bio/docs/dev/generated/skbio.diversity.alpha.gini_index.html ?

wasade · 2025-02-04T16:23:00Z

micov/_rank.py

+    return (np.sum((2 * index - n - 1) * array)) / (n * np.sum(array))
+
+
+def compute_entropy(group):


Can this be replaced with https://github.com/scikit-bio/scikit-bio/blob/main/skbio/diversity/alpha/_base.py#L1197 ?

wasade · 2025-02-04T16:26:49Z

micov/_rank.py

+
+
+def rank_genome_of_interest(plotdir):
+    all_files = glob.glob(f"{plotdir}/*.tsv")


I think these are temp files right? I see the issue with the current design, where the position data are organized during plotting and that organization is then used for ranking, however we don't have a great way (currently) to bubble that information out in memory.

yeah agree. I thought about that but that requires some sort of refactoring on the plotting code to separate the information extraction and plotting part. Definitely see this as a valuable follow up.

wasade · 2025-02-04T16:27:43Z

micov/cli.py

@@ -279,6 +280,12 @@ def per_sample_group(parquet_coverage, sample_metadata, sample_metadata_column,
                     sample_metadata_column, output, monte, monte_iters,
                     target_names)

+    outdir = os.path.dirname(output)
+    ranked_genomes = rank_genome_of_interest(outdir)


Okay ya, they look ephemeral

yes. Is it a good idea to direct them to tmp directories for now?

no, that would be a not fun refactor at the moment. The glob over the files should be more specific though, just picking up *.tsv will have false positives. Are the individual TSVs large at all or just a few kb?

added code for ranking genomes of interest for position plots.

46e0ee9

wasade reviewed Feb 4, 2025

View reviewed changes

sherlyn99 merged commit 71aaa34 into biocore:main Feb 5, 2025
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement genome ranking functionality for position plot genomes #25

Implement genome ranking functionality for position plot genomes #25

sherlyn99 commented Feb 4, 2025

wasade Feb 4, 2025

wasade Feb 4, 2025

wasade Feb 4, 2025

sherlyn99 Feb 4, 2025

wasade Feb 4, 2025

sherlyn99 Feb 4, 2025

wasade Feb 4, 2025

		return (np.sum((2 * index - n - 1) * array)) / (n * np.sum(array))


		def compute_entropy(group):



		def rank_genome_of_interest(plotdir):
		all_files = glob.glob(f"{plotdir}/*.tsv")

Implement genome ranking functionality for position plot genomes #25

Implement genome ranking functionality for position plot genomes #25

Conversation

sherlyn99 commented Feb 4, 2025

wasade Feb 4, 2025

Choose a reason for hiding this comment

wasade Feb 4, 2025

Choose a reason for hiding this comment

wasade Feb 4, 2025

Choose a reason for hiding this comment

sherlyn99 Feb 4, 2025

Choose a reason for hiding this comment

wasade Feb 4, 2025

Choose a reason for hiding this comment

sherlyn99 Feb 4, 2025

Choose a reason for hiding this comment

wasade Feb 4, 2025

Choose a reason for hiding this comment