Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Card #14

Open
merdivane opened this issue May 29, 2023 · 1 comment
Open

Dataset Card #14

merdivane opened this issue May 29, 2023 · 1 comment

Comments

@merdivane
Copy link
Contributor

merdivane commented May 29, 2023

Let's put datasets here. In order to avoid confusion, the dataset refers to curated data from the database which we can use in ml and AI models. Database refers to Clinvar where they have raw data online but we need to work to get data and convert it to the dataset.

Dataset Card

  • Description and task (predicting pathogenicity, carcinogenicity, binding affinity)
  • How many instances (for example 10k pathogenicity and 50k benign)
  • File size
  • File format
  • Paper (where the dataset is used)
  • Snippet of data
@ecuracosta
Copy link
Contributor

ecuracosta commented May 30, 2023

Hey team,

I have some exciting news to share regarding building our datasets. I came across an API that I believe will be extremely valuable for our data requirements. It's called the Proteins API, provided by the European Bioinformatics Institute (EBI).

The Proteins service provides an interface for accessing UniProtKB entries and UniProtKB isoform entries. The features service provides protein functional annotations from UniProt Knowledgebase (UniProtKB) protein entries. The variation, proteomics and antigen services provide annotations imported and mapped from large scale data sources, such as 1000 Genomes, ExAC (Exome Aggregation Consortium), ClinVar (Clinical significance of Variants), TCGA (The Cancer Genome Atlas), COSMIC (Catalogue Of Somatic Mutations In Cancer), TOPMed (Trans-Omics for Precision Medicine), gnomAD (Genome Aggregation Database), PeptideAtlas, MaxQB (MaxQuant DataBase), EPD (Encyclopedia of Proteome Dynamics), ProteomicsDB and HPA, along with UniProtKB annotations for these feature types (if applicable). And there is more.

You can find the documentation for the API here: Proteins API Documentation

Using this API, I've already started experimenting with Python requests and have successfully retrieved responses. Here's an example code snippet I've been working on:

import requests
import json
import pandas as pd

base_url = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=100"
params = {
    "accession": "A0A1B0GTW7",  # replace with the accession of the protein you are interested in
    "format": "json"
}

response = requests.get(base_url, params=params)

if response.status_code == 200:
    data = response.json()
else:
    print(f"Request failed with status code {response.status_code}")

Some examples of what you can get:

  1. Inside "comments" you have "FUNCTION" where you can read: 'Putative metalloproteinase that plays a role in left-right patterning process'
  2. Also inside "comments" you have "DISEASE" where you can read: 'Heterotaxy, visceral, 12, autosomal' or a longer form 'A form of visceral heterotaxy, a complex disorder due to disruption of the normal left-right asymmetry of the thoracoabdominal organs. Visceral heterotaxy or situs ambiguus results in randomization of the placement of visceral organs, including the heart, lungs, liver, spleen, and stomach. The organs are oriented randomly with respect to the left-right axis and with respect to one another. It can be associated with a variety of congenital defects including cardiac malformations. Early death may occur. HTX12 inheritance is autosomal recessive.'
  3. You can also found sequence alterations leading to disease inside "features"
  4. References
  5. WT sequence
  6. And a HUGE amount of data from all the databases listed above...

We can use this API to build any dataset we want from (almost?) any source we want!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants