Google Summer of Code 2019 Wrap up

Google Summer of Code 2019 Wrap-up

The cBioPortal for Cancer Genomics is a resource designed to provide broad community access to cancer genomic data. It provides a unique user-friendly and biology-centric computational user interface, with the goal of making genomic data more easily accessible to translational scientists, biologists, and clinicians. The public instance of cBioPortal is now one of the most popular online resources for cancer genomics data and attracts more than 3,000 unique visitors (cancer researchers and clinicians) per day. In addition, there are dozens of local instances installed in medical centers, universities, government institutions, and pharmaceutical companies around the globe.

The cBioPortal software is available under an open source license via GitHub. The software is now developed and maintained by a multi-institutional team, consisting of Memorial Sloan Kettering Cancer Center (MSK), the Dana Farber Cancer Institute, Princess Margaret Cancer Centre in Toronto, Children's Hospital of Philadelphia, The Hyve in the Netherlands, Bilkent University in Ankara, Turkey, and Weill Cornell Medicine.

We have had a very exciting couple of months again mentoring students through Google Summer of Code (GSoC). This time we had 6 students who all did a great job! It is always fun to meet so many new bright individuals. Our students were based in China, Turkey and the US. At the end of the summer everyone enthusiastically presented their work to our global team on Google Hangouts. The projects ranged from visualization prototypes to new data import pipelines and backend optimizations. Here we present a summary of their work.

Project 1: Clonal evolution visualization web-tools

project 1

Wenjie Sun, who just finished his PhD at Zhejiang University, was already a frequent user of cBioPortal before joining us for GSoC. He helped develop new features in cBioPortal to visualize the evolution of a tumor; a topic he was already familiar with through his PhD where one of the projects involved reconstructing tumor evolution in colorectal cancer patients using samples from primary tumors and metastases. Wenjie developed a pipeline to cluster mutation data in cBioPortal and subsequently developed visualizations in React that use this clustering information to show the tumor evolution.

For more information on this project see Wenjie’s blog post: https://wenjiesun.blogspot.com/2019/08/gsoc-2018-clonal-evolution.html

Project 2: Integrating PathwayMapper into cBioPortal

PathwayMapper is a web based pathway curation tool for interactive creation, editing, and sharing of cancer pathways. The tool supports remote users to collaborate and concurrently modify pathways using ShareDB with built-in conflict resolution implemented as a ReactJS component. This summer Ziya Erkoç, a talented computer science student at Bilkent University in Turkey, implemented a new tab on the results page of the cBioPortal that integrates directly with PathwayMapper. For a given query it indicates how often each gene is altered in what pathway. The pathways are pulled from the curated collection of pathways from PathwayMapper. This visualization helps give insight into what pathways are altered across a cohort of patients. Ziya did a great job bringing this to completion from initial prototype all the way to final product review. The integration is scheduled for the next release. A more detailed description of his project can be found here:

https://docs.google.com/document/d/1lcD5f0DSTkwfYg8F39ae61H17jWJzOXHAZDZpc9bUcw/edit.

Project 3: OncoKB Analysis in Study View

OncoKB as a precision oncology knowledge base has been widely used in cBioPortal for analysing patient/sample genomic information together with clinical input. Due to the limitation of the performance and the size of the genomic data associated with a given cancer study, study-wide OncoKB analysis was not possible until the start of this GSoC project. The student, Andrew Cvekl, made an excellent proposal and we decided to move forward with his proposal and limit the work scope to prototypes to explore how the data should be generated and visualized.

We began with a GSoC project idea and quickly dove into a preliminary backend implementation. Andrew was able to make a pipeline to annotate mutation data using the OncoKB API. Initially we considered including oncogenicity, mutation effect and highest level of therapeutic implications. Later we realized that they all have a very similar work flow and changed the scope to include only oncogenicity as a pie chart and also enabled driver filtering for the Mutated Genes table. The first goal of the project is to show a pie chart in the study view with OncoKB data, specifically HAS_DRIVER. In order to do so, the pipeline includes a process to store the annotation to clinical table which is used to populate the pie charts.

project3a

During annotation, each mutation is annotated individually. However, for study analysis we are interested to know whether a sample has a driver mutation. The annotation is based on each individual mutation. In order to use it in the study view which focuses on samples, we need to convert the annotation to sample level. More specifically, we need to convert oncogenicity to HAS_DRIVER. Together we wrote a converter to connect two pieces of information.

project3b

project3c

Andrew also updated the endpoint for the Mutated Genes Table to include an option to filter the mutations to show driver only events.

As the final products, a pie chart and a selection checkbox, which shown above, have been added in the cBioPortal study view by following the cBioPortal Frontend development standards.

We really appreciate the time and effort Andrew has contributed to the project. It has shed some light on the path to achieve the goal. Some issues will need to be resolved before moving the feature into production, such as performance for large studies, data discrepancy between the pre-annotation and real-time annotation, but what Andrew has achieved is significant to the project and certainly will be continued in future development.

Project 4: Addition of novel CPTAC proteogenomic datasets to the cBioPortal

In cBioPortal, most cancer studies have historically focused on analyzing genomic data. The Clinical Proteomic Analysis Consortium (CPTAC) is a national effort dedicated to the understanding of molecular profiles in cancers using proteogenomic analyses. Along with classical genomic data such as somatic mutation status, copy number variation, and RNA-sequencing data, CPTAC is providing protein and phosphoprotein data generated by mass spectrometry. These extensive datasets from CPTAC are currently being used for studies to better classify the molecular profiles of each available tumor type to advance cancer diagnostics and treatment.

In GSoC 2017, Pamela Wu from NYU worked on integrating the first batch of CPTAC data into cBioPortal, documented in this paper. This year, another GSoC student, Lizabeth Katsnelson, continued to newly generated CPTAC proteogenomic datasets into the cBioPortal, collaborating with Dr. David Fenyö, a CPTAC member at NYU Langone Health. Unlike the first set of CPTAC data, the new datasets include not only mass spectrometry proteomics data (protein and phosphoprotein levels), but also somatic mutations, copy number variation, and transcript levels from RNA-sequencing. During the summer Lizabeth worked with the CPTAC working groups and successfully curated curated a few studies. The colon cancer CPTAC study has been released. Others will be released to the public portal once they are published in a scientific journal.

Project 5: ETL pipeline development for TCGA data from GDC Portal

The Genomic Data Commons (GDC) Portal is a data sharing platform developed at the National Cancer Institute (NCI) to enable precision medicine in oncology. It houses harmonized data from various large-scale genomics projects including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). The GDC portal also provides tools and programmatic interfaces to facilitate the selection, visualization, and download of clinical and genomic data. The goal of this project was to develop an ETL pipeline for TCGA data from the GDC portal to the cBioPortal for cancer genomics.

We were fortunate this summer to have Zachary Heins as a GSoC student working on this project. This project began with another GSoC student in 2017 and Zachary made a lot of progress and exceeded our expectations with his efforts this summer. This was made possible with the help of some folks at the GDC who helped co-mentor this project and who provided valuable advice throughout the project term.

Due to several dependencies being out of date, the pipeline had actually devolved into an unworkable state. Zack brought the pipeline “back to life” and we now have a functional tool that can download and transform data directly from the GDC portal into a format compatible for importing into the cBioPortal. The data types that the pipeline can now process are (1) clinical, (2) mutation, (3) copy-number alterations, and (4) RNAseq expression data. The work from this summer is now in the master branch of the cBioPortal GDC ETL pipeline GitHub repository

Project 6: Spark and Parquet Backend for cBioPortal Web API

cBioPortal utilizes a Spring MVC architecture with MyBatis for the persistence layer and a relational database (MySQL) for data storage. As the number and size of cancer datasets increase, high-performance computing and storage will only become more vital in providing an adequate cBioPortal user experience. The primary goal of this project was to create a prototype which improves performance of the existing web APIs that support the Study Summary View for large sample cohorts.

For more information on this project see Doori’s GSoC submission on GitHub: https://github.com/doori/GSoC-submission

Provide feedback

Saved searches

Use saved searches to filter your results more quickly