Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible missing orthology #923

Open
3 tasks
kevinschaper opened this issue Dec 3, 2024 · 4 comments
Open
3 tasks

Investigate possible missing orthology #923

kevinschaper opened this issue Dec 3, 2024 · 4 comments
Assignees

Comments

@kevinschaper
Copy link
Member

About 70% of our Panther ingest ends up in the dangling edges bin, and @leokim-l noticed that in a process that they're running that it seemed like we may have low orthology coverage between human genes and genes from species other than mouse.

  • Improve QC output to get a better picture of which nodes are missing from the graph, which kind of nodes are successfully merged, which kind of nodes work from the ingest without any normalization
  • Look at whether there are Panther associations that we're bringing through the process that we don't actually want (is it being filtered by taxon currently? does it match the taxon list of genes we have?)
  • Dipper brought in ZFIN's human curated orthology, right now we only have Panther. Possibly split that off from this issue as its own new modular ingest.

Additional info:

This is the neo4j query that is showing few results for species other than mouse:

`MATCH
(upheno:biolink:PhenotypicFeature WHERE upheno.id STARTS WITH "UPHENO:")<-[:biolink:subclass_of]-(phenotype:biolink:PhenotypicFeature)<-[gena:biolink:has_phenotype]-(gene:biolink:Gene)-[:biolink:orthologous_to]-(human_gene:biolink:Gene WHERE "NCBITaxon:9606" IN [human_gene.in](http://human_gene.in/)_taxon)
RETURN
    upheno.id,
    phenotype.id,
    gene.id,
    gena.negated,
    CASE WHEN [gene.in](http://gene.in/)_taxon IS NOT NULL AND size([gene.in](http://gene.in/)_taxon) > 0
         THEN REDUCE(s = "", x IN [gene.in](http://gene.in/)_taxon | s + x + CASE WHEN x <> [gene.in](http://gene.in/)_taxon[size([gene.in](http://gene.in/)_taxon)-1] THEN "|" ELSE "" END)
         ELSE "" END AS gene_in_taxon,
    human_gene.id,
    gena.primary_knowledge_source,
    gena.publications`

and here is the visualization showing the difference in counts
image_480

@kevinschaper kevinschaper self-assigned this Dec 3, 2024
@leokim-l
Copy link
Member

leokim-l commented Dec 4, 2024

Figure and analysis by Peter, tagging him since he is the person most interested in this. @hansenp

@kevinschaper
Copy link
Member Author

Right now I'm adding ZFIN's curated orthology, which should give us the best possible connections between human and zebrafish, I'm also planning to fix the missing XB-GENEPAGE to XB-GENE mappings that will give us XenBase's own orthology.

I just ran across issues in monarch-ingest where we looked at Panther's dangling edges: monarch-initiative/monarch-ingest#446 & #351

What we missed in 351 was that the counts in Panther didn't change, even though the counts of what came out of our ingest did change, which will probably require a careful tracing through of the utils functions related to the ingest to see if there is a filtering that happens. I didn't include my methodology for the counting (boo past me!) there, the presence of subject_taxon_label and object_taxon_label sure looks like it was coming from the finished KG. I wonder if the kind of identifier used by Panther changed, moving from one where we had good mappings to one where we didn't?

@kevinschaper
Copy link
Member Author

Separately, I did some looking into DIOPT updates. They claim a 2021 build, but are missing ZFIN orthology that existed in 2020. It's wonderful when we can pull from an aggregator to get many sources in one easy go, but I don't think it's a great idea when the update cadence is so irregular.

@leokim-l
Copy link
Member

Slightly related: we asked ourselves what happens when running a reduced query (just pasting here the important part):

(phenotype:biolink:PhenotypicFeature)<-[gena:biolink:has_phenotype]-(gene:biolink:Gene)-[:biolink:orthologous_to]-(human_gene:biolink:Gene WHERE "NCBITaxon:9606" IN human_gene.in_taxon)

we get the following number of items:

  • fission yeast (NCBITaxon:4896): 180361
  • zebrafish (NCBITaxon:7955): 59874

whereas with the full query this changes to:

  • fission yeast (NCBITaxon:4896): 180361 --> 739
  • zebrafish (NCBITaxon:7955): 59874 --> 20412

Is this a bug or a feature of uPheno? Sorry if this is trivial, thanks! @matentzn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants