Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniprot data to ingest #52

Closed
pombase-admin opened this issue Jan 9, 2012 · 65 comments
Closed

Uniprot data to ingest #52

pombase-admin opened this issue Jan 9, 2012 · 65 comments

Comments

@pombase-admin
Copy link

Most of the information in UniProt is already in GeneDB but
Uniprot may have information about catalytic activities and residues which we could potentially load into Chado

Original comment by: ValWood

@ValWood

This comment was marked as outdated.

@ValWood ValWood changed the title uniprot data Uniprot data to ingest Dec 30, 2021
@ValWood
Copy link
Member

ValWood commented Nov 29, 2023

Also mentioned here:#1126

@kimrutherford

This comment was marked as outdated.

@ValWood

This comment was marked as outdated.

@ValWood

This comment was marked as outdated.

@kimrutherford

This comment was marked as outdated.

@ValWood
Copy link
Member

ValWood commented Jul 26, 2024

  • Virus hosts SHOULD HAVE NO DATA

  • Sequences

    • Alternative products (isoforms) A LIST OF PROTEINS WITH ISOFORMS WOULD BE USEFUL TO CHECK BUT ISN'T AN AREA OF FOCUS FOR US RIGHT NOW
  • Function

    • Active site DISPLAY IN PROTEIN VIEWER

    • Binding site DISPLAY IN PROTEIN VIEWER

    • Cofactor A LIST WOULD BE USEFUL BUT I THINK WE ARE COVERED BY GO

    • Kinetics WE COULD DISPLAY THIS

  • Interaction WE CAN LOOK AT THESE LATER BUT SHOULD BE COVERED

    • Interacts with
    • Subunit structure
  • PTM / Processing (I CAN PROVIDE A MAPPING FOR THESE BUT WE ONLY NEED TO REPORT WHAT WE DON"T ALREADY HAVE)

    • Cross-link

    • Disulfide bond

    • Glycosylation

    • Initiator methionine

    • Lipidation

    • Modified residue

    • Peptide

    • Post-translational modification

    • Propeptide

    • Signal peptide

    • Transit peptide

    • Beta strand THESE WOULD BE USEFUL IN THE PROTEIN VIEWER, WE WERE GOING TO GET THEM FROM PDB

    • Helix

    • Turn

    • Compositional bias

@kimrutherford
Copy link
Member

kimrutherford commented Jul 26, 2024

Thanks Val.

Note to self, here's the updated API URL, adding the extra fields:
https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Cft_signal%2Cft_transit%2Cxref_pombase%2Cft_binding%2Cft_act_site%2Ccc_catalytic_activity%2Cgene_synonym%2Ccc_ptm%2Cft_mod_res%2Ccc_cofactor%2Ckinetics&format=tsv&query=%28%28organism_id%3A284812%29%29

and the commands to update the data file in SVN:

cd pombe-embl
curl 'https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Cft_signal%2Cft_transit%2Cxref_pombase%2Cft_binding%2Cft_act_site%2Ccc_catalytic_activity%2Cgene_synonym%2Ccc_ptm%2Cft_mod_res%2Ccc_cofactor%2Ckinetics&format=tsv&query=%28%28organism_id%3A284812%29%29' | gzip -d > external_data/uniprot_data_from_api.tsv
svn commit external_data/uniprot_data_from_api.tsv

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 1, 2024
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 1, 2024
kimrutherford added a commit to pombase/pombase-python-web that referenced this issue Aug 2, 2024
@kimrutherford

This comment was marked as resolved.

@kimrutherford

This comment was marked as resolved.

@ValWood

This comment was marked as outdated.

@ValWood

This comment was marked as resolved.

@ValWood
Copy link
Member

ValWood commented Sep 3, 2024

Review of features

  • Names & Taxonomy
    • Entry Name IGNORE
    • Gene Names IGNORE
    • Gene Names (ordered locus) IGNORE
    • Gene Names (ORF) IGNORE
    • Gene Names (primary) IGNORE
    • Gene Names (synonym) IGNORE
    • Organism IGNORE
    • Organism (ID) IGNORE
    • Protein names IGNORE
    • Proteomes IGNORE
    • Taxonomic lineage IGNORE
    • Taxonomic lineage (Ids) IGNORE
    • Virus hosts IGNORE
  • Sequences
    • Alternative products (isoforms) IGNORE FOR NOW
    • Alternative sequence IGNORE FOR NOW
    • Erroneous gene model prediction IGNORE
    • Fragment IGNORE
    • Gene encoded by IGNORE
    • Length IGNORE
    • Mass IGNORE
    • Mass spectrometry IGNORE
    • Natural variant IGNORE
    • Non-adjacent residues IGNORE
    • Non-standard residue IGNORE
    • Non-terminal residue IGNORE
    • Polymorphism IGNORE
    • RNA Editing IGNORE
    • Sequence IGNORE
    • Sequence caution IGNORE
    • Sequence conflict IGNORE
    • Sequence uncertainty IGNORE
    • Sequence version IGNORE
  • Function IGNORE
    • Absorption IGNORE
    • Active site IMPORT
    • Binding site IMPORT
    • Catalytic activity IGNORE
      - Cofactor CAN I SEE A LIST TO CHECK WE HAVE EVERYTHING. IGNORE (SEE BELOW)
    • DNA binding IGNORE (COVERED BY GO)
    • EC number IGNORE FOR NOW (there are a number of routes to get this, but we do not currently display)
    • Activity regulation IGNORE (GO)
    • Function [CC] IGNORE (GO)
    • Kinetics IGNORE FOR NOW (we might do this later)
    • Pathway IGNORE (GO)
    • pH dependence IGNORE
    • Redox potential IGNORE
    • Rhea ID IGNORE (COVERED BY GO)
    • Site ?????. what is this????
    • Temperature dependence IGNORE
  • Miscellaneous
    • Annotation IGNORE
    • Caution IGNORE
    • Keywords IGNORE
    • Keyword ID IGNORE
    • Miscellaneous [CC] IGNORE
    • Protein existence IGNORE
    • Reviewed IGNORE
    • Tools IGNORE
    • UniParc IGNORE
    • Comments IGNORE
    • Features
  • Interaction
    • Interacts with IGNORE FOR NOW
    • Subunit structure IGNORE FOR NOW
  • Expression IGNORE
    • Developmental stage IGNORE
    • Induction IGNORE
    • Tissue specificity IGNORE
  • Gene Ontology (GO) IGNORE
    • Gene Ontology (biological process) IGNORE
    • Gene Ontology (cellular component) IGNORE
    • Gene Ontology (GO) IGNORE
    • Gene Ontology (molecular function) IGNORE
    • Gene Ontology IDs IGNORE
  • Pathology & Biotech IGNORE
    • Allergenic Properties IGNORE
    • Biotechnological use IGNORE
    • Disruption phenotype IGNORE FOR NOW
    • Involvement in disease IGNORE
    • Mutagenesis IGNORE FOR NOW
    • Pharmaceutical use IGNORE
    • Toxic dose IGNORE
  • Subcellular location IGNORE (COVERED BY GO)
    • Intramembrane (COVERED BY GO)
    • Subcellular location [CC] (COVERED BY GO)
    • Topological domain
    • Transmembrane IMPORTED?
  • PTM / Processing
    • Chain IMPORTED? (Not very useful)
    • Cross-link IMPORTED
    • Disulfide bond IMPORTED
    • Glycosylation IMPORTED
    • Initiator methionine IGNORE
      - Lipidation IMPORTED???
      - Modified residue What does this provide??? how does it differ from Post-translational modification
    • Peptide IGNORE
      **- Post-translational modification What does this provide??? how does it differ from Modified residue
    • Propeptide IMPORTED
    • Signal peptide IMPORTED
    • Transit peptide IMPORTED
  • Structure IGNORE
    • 3D IGNORE IGNORE
    • Beta strand IMPORTED
    • Helix IMPORTED
    • Turn IMPORTED
  • Publications IGNORE
    • PubMed ID IGNORE
    • DOI ID IGNORE
  • Date of IGNORE
    • Date of creation IGNORE
    • Date of last modification IGNORE
    • Date of last sequence modification IGNORE
    • Entry version IGNORE
  • Family & Domains IGNORE
    - Coiled coil IGNORE ????? ( we get this from INterPRO)
    • Compositional bias
    • Domain [CC] IGNORE ( we get this from INterPRO)
    • Domain [FT] IGNORE ( we get this from INterPRO)
    • Motif IGNORE ( we get this from INterPRO)
    • Protein families IGNORE ( we get this from INterPRO)
    • Region IGNORE
    • Repeat IGNORE
    • Sequence similarities IGNORE
    • Zinc finger IGNORE

@ValWood
Copy link
Member

ValWood commented Sep 3, 2024

I think this is the current situation. Can you confirm and answer my queries about the bold ones?

@ValWood ValWood added announce and removed announce labels Sep 3, 2024
@ValWood
Copy link
Member

ValWood commented Sep 3, 2024

I'm going to close this. Everything is done but I have one question for some data we could import.
I'm not sure what it covers though. I will open a new ticket about that.

@ValWood
Copy link
Member

ValWood commented Sep 3, 2024

final:
#1213

@ValWood ValWood closed this as completed Sep 3, 2024
@kimrutherford

This comment was marked as outdated.

@kimrutherford
Copy link
Member

Cofactor CAN I SEE A LIST TO CHECK WE HAVE EVERYTHING

uniprotkb_organism_id_284812_2024_09_04-cofactor.tsv.txt

@ValWood
Copy link
Member

ValWood commented Sep 4, 2024

OK we can ignore SITE

@ValWood
Copy link
Member

ValWood commented Sep 4, 2024

OK I checked the first 20 or so cofactor and all are present as IEA GO binding annotations (I expected they would be but I wanted to check). Any examples where the coordinates are known have binding site annotations in the protein viewer (so we can ignore cofactor)

@kimrutherford

This comment was marked as outdated.

@kimrutherford

This comment was marked as outdated.

@kimrutherford
Copy link
Member

Coiled coil IGNORE ????? ( we get this from INterPRO)

I downloaded the coiled coil data from pfam while it was still existed. InterPro doesn't provide coiled coil data in their XML file.

We also get the low complexity regions and disordered regions from the pfam download. The file is from 2021 so it's quite out of date now.

@ValWood
Copy link
Member

ValWood commented Sep 5, 2024

I will ask if InterPro could include it in their XML....

@ValWood
Copy link
Member

ValWood commented Sep 5, 2024

We can map all of the:

LIPID 485; /note="GPI-anchor amidated serine"; /evidence="ECO:0000255"
maps to N-seryl-glycosylphosphatidylinositolethanolamine
https://www.pombase.org/term/MOD:00171
(currently 1 in PomBase)

LIPID 202; /note="S-geranylgeranyl cysteine"; /evidence="ECO:0000250|UniProtKB:P62745"
maps to geranylgeranylated residue (All should be cysteine)
https://www.pombase.org/term/MOD:00441
(currently 2 in PomBase)

LIPID 404; /note="S-farnesyl cysteine"; /evidence="ECO:0000250"
maps to S-farnesyl-L-cysteine
https://www.pombase.org/term/MOD:00111
(currently 3 in PomBase)

LIPID 116; /note="Phosphatidylethanolamine amidated glycine"; /evidence="ECO:0000250|UniProtKB:P38182"
maps to N-glycyl-1-(phosphatidyl)ethanolamine
https://www.pombase.org/term/MOD:00351
(currently 1 in PomBase)

@ValWood
Copy link
Member

ValWood commented Sep 5, 2024

LIPID 2; /note="N-myristoyl glycine"; /evidence="ECO:0000269|PubMed:14722091"
to N-myristoylglycine
https://www.pombase.org/term/MOD:00068
(currently 21 in PomBase)

What does that leave?

@ValWood
Copy link
Member

ValWood commented Sep 5, 2024

I will ask if InterPro could include it in their XML....

I mailed Interpro to ask if this is possible, but now I started worrying that we have a lot of features from intoPro that incorrect coordinates based on the current Pase sequences (because they are based on UniProt) i.e. everything that has any features does have coordinate changes.
This will include InterPro domains but also the coil-coil etc. How much hassle would it be, instead of updating with every InterPro release, to run InterProScan locally a couple of times a year ?
I would rather the features were accurate coordinates, but maybe lagging a little bit behind InterPro in content ( because there is never too much new stuff for pombe in a release)

@kimrutherford
Copy link
Member

How much hassle would it be, instead of updating with every InterPro release, to run InterProScan locally a couple of times a year ?

Last time I tried to install InterProScan it was too hard and I failed. That was a long time ago though and they now provide a helpful Docker image. I'll give it a go. There will be a bit of downstream work because the output format of InterProScan isn't the same as the XML file from InterPro.

@kimrutherford
Copy link
Member

What does that leave?

Here's the complete list. Only 51 genes have this type of data.

uniprotkb_lipid.tsv.txt

@ValWood
Copy link
Member

ValWood commented Sep 6, 2024

The remaining mappings

SPBC13G1.11; LIPID 193; /note="S-palmitoyl cysteine"; /evidence="ECO:0000250";
maps to https://www.pombase.org/term/MOD:00115
S-palmitoyl-L-cysteine

SPBPJ4664.02; LIPID 3944; /note="GPI-anchor amidated alanine"; /evidence="ECO:0000255"0250"
https://www.pombase.org/term/MOD:00818
glycosylphosphatidylinositolated residue
(MOD does not have further specificity)

SPAC212.08c; LIPID 96; /note="GPI-anchor amidated glycine"; /evidence="ECO:0000255"
https://www.pombase.org/term/MOD:00818
glycosylphosphatidylinositolated residue
(MOD does not have further specificity)

SPCC1322.10; LIPID 242; /note="GPI-like-anchor amidated asparagine"; /evidence="ECO:0000255"
Ignore

@kimrutherford
Copy link
Member

I've moved the Lipidation stuff to:

And the InterPro task is here:

So I think we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants