First draft of literate programming data story.

It uses only the columns common to all tables.
googlegenomics · Oct 2, 2014 · a770822 · cassiedoll · Oct 3, 2014 · deflaux
1 parent 2f8da16
commit a770822
Show file tree

Hide file tree

Showing 8 changed files with 756 additions and 10 deletions.
diff --git a/R/.gitignore b/R/.gitignore
@@ -0,0 +1 @@
+.httr-oauth
diff --git a/R/README.Rmd b/R/README.Rmd
@@ -0,0 +1,126 @@
+<!-- R Markdown Documentation, DO NOT EDIT THE PLAIN MARKDOWN VERSION OF THIS FILE -->
+
+<!-- Copyright 2014 Google Inc. All rights reserved. -->
+
+<!-- Licensed under the Apache License, Version 2.0 (the "License"); -->
+<!-- you may not use this file except in compliance with the License. -->
+<!-- You may obtain a copy of the License at -->
+
+<!--     http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!-- Unless required by applicable law or agreed to in writing, software -->
+<!-- distributed under the License is distributed on an "AS IS" BASIS, -->
+<!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -->
+<!-- See the License for the specific language governing permissions and -->
+<!-- limitations under the License. -->
+
+Literate Programming with R and BigQuery
+========================================================
+
+R Markdown Introduction
+-------------------------
+
+This is an [R Markdown](http://rmarkdown.rstudio.com/) document.  By using RMarkdown, we can write R code in a [literate programming](http://en.wikipedia.org/wiki/Literate_programming) style interleaving snippets of code within narrative content.  This document can be read, but it can also be executed.  Most importantly though, it can be rendered so that the results of an R analysis at a point in time are captured.
+
+It is written in [Markdown](http://daringfireball.net/projects/markdown/syntax), a simple formatting syntax for authoring web pages.  You can embed an R code chunk like this:
+```{r default data, comment=NA}
+summary(cars)
+```
+
+You can also embed plots, for example:
+```{r plot example, fig.align="center"}
+plot(cars)
+```
+
+See the [`rmarkdown` package](http://cran.r-project.org/web/packages/rmarkdown/index.html) for more detail about how to use RMarkdown from R.  [RStudio](http://www.rstudio.com/) has support for [R Markdown](http://rmarkdown.rstudio.com/) from its user interface.
+
+BigQuery Analysis of Variants
+--------------
+
+Now let us move onto [literate programming](http://en.wikipedia.org/wiki/Literate_programming) for [BigQuery](https://developers.google.com/bigquery/).
+
+If you have not used the [bigrquery](https://github.com/hadley/bigrquery) package previously, you will likely need to do something like the following to get it installed:
+
+```{r one time setup, eval=FALSE}
+### Only needed the first time around
+install.packages("devtools")
+devtools::install_github("assertthat")
+devtools::install_github("bigrquery")
+```
+
+Next we will load our needed packages into our session:
+```{r initialize}
+library(bigrquery)
+library(ggplot2)
+library(xtable)
+```
+
+And write a little convenience function:
+```{r}
+project <- "google.com:biggene"                           # put your projectID here
+table <- "genomics-public-data:platinum_genomes.variants" # put your table here
+DisplayAndDispatchQuery <- function(queryUri) {
+  # Read in the SQL from a file or URL.
+  querySql <- readChar(queryUri, nchars=1e6)
+  # Find and replace the table name placeholder with our table name.
+  querySql <- sub("_THE_TABLE_", table, querySql, fixed=TRUE)
+  # Display the updated SQL.
+  cat(querySql)
+  # Dispatch the query to BigQuery for execution.
+  query_exec(querySql, project)
+}
+```
+
+Now we're ready to execute our query, bringing the results down to our R session for further examination:
+```{r comment=NA}
+result <- DisplayAndDispatchQuery("../sql/sample-variant-counts-for-brca1.sql")
+```
+
+Let us examine our query result:
+```{r result, comment=NA}
+head(result)
+summary(result)
+str(result)
+```
+We can see that what we get back from bigrquery is an R dataframe holding our query results.
+
+Data Visualization
+-------------------
+Now that our results are in a dataframe, we can easily apply data visualization to our results:
+```{r viz, fig.align="center"}
+ggplot(result, aes(x=call_set_name, y=variant_count)) +
+  geom_bar(stat="identity") + coord_flip() +
+  ggtitle("Count of Variants Per Sample")
+```
+and its clear to see that number of variants within BRCA1 for each sample corresponds roughly to two levels.
+
+We can then examine the variant level data more closely:
+```{r comment=NA}
+result <- DisplayAndDispatchQuery("../sql/variant-level-data-for-brca1.sql")
+```
+Number of rows returned by this query: `r nrow(result)`.
+
+Displaying the first few rows of the dataframe of results:
+```{r echo=FALSE, message=FALSE, warning=FALSE, comment=NA, results="asis"}
+print(xtable(head(result)), type="html", include.rownames=F)
+```
+
+
+And also work with the sample level data: 
+```{r comment=NA}
+result <- DisplayAndDispatchQuery("../sql/sample-level-data-for-brca1.sql")
+```
+Number of rows returned by this query: `r nrow(result)`.
+
+
+Displaying the first few rows of the dataframe of results:
+```{r echo=FALSE, message=FALSE, warning=FALSE, comment=NA, results="asis"}
+print(xtable(head(result)), type="html", include.rownames=F)
+```
+
+Provenance
+-------------------
+Lastly, let us capture version information about R and loaded packages for the sake of provenance.
+```{r provenance, comment=NA}
+sessionInfo()
+```
diff --git a/R/README.html b/R/README.html
diff --git a/README.md b/README.md
@@ -1,29 +1,28 @@
 getting-started-bigquery
 ========================
 
-The repository contains examples of how to use BigQuery with genomics data. The code within each language-specific folder demonstrates the same query - a simple query upon Platinum Genomes.  For more detail about this data see [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes).
+The repository contains examples of how to use BigQuery with genomics data. The code within each language-specific folder demonstrates the same set of queries upon the Platinum Genomes dataset.  For more detail about this data see [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes).
 
 Getting Started
 -------------------------------------
 
 1. [Sign up for BigQuery](https://developers.google.com/bigquery/sign-up).
 1. Go to the BigQuery [Browser Tool](https://bigquery.cloud.google.com).
 1. Click on **"Compose Query"**.
-1. Copy and paste the following query into the dialog box:
+1. Copy and paste the following query into the dialog box and click on **"Run Query"**:
 ```
 SELECT
-  contig_name,
-  COUNT( contig_name) AS num_variants,
-  COUNT(call.callset_name) AS num_variant_calls
+  reference_name,
+  COUNT(reference_name) AS num_variants,
+  COUNT(call.call_set_name) AS num_variant_calls
 FROM
   [genomics-public-data:platinum_genomes.variants]
 GROUP BY
-  contig_name
+  reference_name
 ORDER BY
-  contig_name
+  reference_name
 ```
-1. Click on **"Run Query"**
-1. View the results!
+View the results!
 
 Google Genomics Public Data
 -------------------------------------
@@ -35,7 +34,7 @@ To add the [Google Genomics Public Data](https://developers.google.com/genomics/
   <img src="figure/display.png" title="Display project" alt="Display Project" style="display: block; margin: auto;" />
   1. Enter `genomics-public-data` in the _‘Add Project’_ dialog.
   <img src="figure/add.png" title="Add Project" alt="Add Project" style="display: block; margin: auto;" />
-  1. Now the [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes) datasets appear in the left navigation pane of the BigQuery [Browser Tool](https://bigquery.cloud.google.com).
+Now the [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes) datasets appear in the left navigation pane of the BigQuery [Browser Tool](https://bigquery.cloud.google.com).
 
 What next?
 ----------

diff --git a/sql/sample-level-data-for-brca1.sql b/sql/sample-level-data-for-brca1.sql
@@ -0,0 +1,23 @@
+# Retrieve sample-level information for BRCA1 variants.
+SELECT
+  reference_name,
+  start,
+  end,
+  reference_bases,
+  GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
+  call.call_set_name,
+  GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
+  call.phaseset,
+  call.genotype_likelihood,
+FROM
+  [_THE_TABLE_]
+WHERE
+  reference_name = 'chr17'
+  AND start BETWEEN 41196311
+  AND 41277499
+HAVING
+  alternate_bases IS NOT NULL
+ORDER BY
+  start,
+  alternate_bases,
+  call.call_set_name
diff --git a/sql/sample-variant-counts-for-brca1.sql b/sql/sample-variant-counts-for-brca1.sql
@@ -0,0 +1,30 @@
+# Sample variant counts within BRCA1.
+SELECT
+  call_set_name,
+  COUNT(call_set_name) AS variant_count,
+FROM (
+  SELECT
+    reference_name,
+    start,
+    END,
+    reference_bases,
+    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
+    call.call_set_name AS call_set_name,
+    NTH(1,
+      call.genotype) WITHIN call AS first_allele,
+    NTH(2,
+      call.genotype) WITHIN call AS second_allele,
+  FROM
+      [_THE_TABLE_]
+  WHERE
+    reference_name = 'chr17'
+    AND start BETWEEN 41196311
+    AND 41277499
+  HAVING
+    first_allele > 0
+    OR second_allele > 0
+    )
+GROUP BY
+  call_set_name
+ORDER BY
+  call_set_name
diff --git a/sql/sample-variant-counts.sql b/sql/sample-variant-counts.sql
@@ -0,0 +1,26 @@
+# Sample variant counts.
+SELECT
+  call_set_name,
+  COUNT(call_set_name) AS variant_count,
+FROM (
+  SELECT
+    reference_name,
+    start,
+    END,
+    reference_bases,
+    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
+    call.call_set_name AS call_set_name,
+    NTH(1,
+      call.genotype) WITHIN call AS first_allele,
+    NTH(2,
+      call.genotype) WITHIN call AS second_allele,
+  FROM
+      [_THE_TABLE_]
+  HAVING
+    first_allele > 0
+    OR second_allele > 0
+    )
+GROUP BY
+  call_set_name
+ORDER BY
+  call_set_name
diff --git a/sql/variant-level-data-for-brca1.sql b/sql/variant-level-data-for-brca1.sql
@@ -0,0 +1,22 @@
+# Retrieve variant-level information for BRCA1 variants.
+SELECT
+  reference_name,
+  start,
+  end,
+  reference_bases,
+  GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
+  quality,
+  GROUP_CONCAT(filter) WITHIN RECORD AS filter,
+  GROUP_CONCAT(names) WITHIN RECORD AS names,
+  COUNT(call.call_set_name) WITHIN RECORD AS num_samples,
+FROM
+  [_THE_TABLE_]
+WHERE
+  reference_name = 'chr17'
+  AND start BETWEEN 41196311
+  AND 41277499
+HAVING
+  alternate_bases IS NOT NULL
+ORDER BY
+  start,
+  alternate_bases