Skip to content
This repository has been archived by the owner on Sep 14, 2021. It is now read-only.

Commit

Permalink
First draft of literate programming data story.
Browse files Browse the repository at this point in the history
It uses only the columns common to all tables.
  • Loading branch information
deflaux committed Oct 2, 2014
1 parent 2f8da16 commit a770822
Show file tree
Hide file tree
Showing 8 changed files with 756 additions and 10 deletions.
1 change: 1 addition & 0 deletions R/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.httr-oauth
126 changes: 126 additions & 0 deletions R/README.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
<!-- R Markdown Documentation, DO NOT EDIT THE PLAIN MARKDOWN VERSION OF THIS FILE -->

<!-- Copyright 2014 Google Inc. All rights reserved. -->

<!-- Licensed under the Apache License, Version 2.0 (the "License"); -->
<!-- you may not use this file except in compliance with the License. -->
<!-- You may obtain a copy of the License at -->

<!-- http://www.apache.org/licenses/LICENSE-2.0 -->

<!-- Unless required by applicable law or agreed to in writing, software -->
<!-- distributed under the License is distributed on an "AS IS" BASIS, -->
<!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -->
<!-- See the License for the specific language governing permissions and -->
<!-- limitations under the License. -->

Literate Programming with R and BigQuery
========================================================

R Markdown Introduction
-------------------------

This is an [R Markdown](http://rmarkdown.rstudio.com/) document. By using RMarkdown, we can write R code in a [literate programming](http://en.wikipedia.org/wiki/Literate_programming) style interleaving snippets of code within narrative content. This document can be read, but it can also be executed. Most importantly though, it can be rendered so that the results of an R analysis at a point in time are captured.

It is written in [Markdown](http://daringfireball.net/projects/markdown/syntax), a simple formatting syntax for authoring web pages. You can embed an R code chunk like this:
```{r default data, comment=NA}
summary(cars)

This comment has been minimized.

Copy link
@cassiedoll

cassiedoll Oct 3, 2014

Contributor

awesome. i love the new readme and setup. everything makes perfect sense now.

just one thought - i'm not sure you really need the two 'cars' examples here. The text that links out to markdown stuff seems good, but the cars stuff just confused me a little.

This comment has been minimized.

Copy link
@deflaux

deflaux Oct 3, 2014

Author Contributor

nope, they can go

I will remove them

```

You can also embed plots, for example:
```{r plot example, fig.align="center"}
plot(cars)
```

See the [`rmarkdown` package](http://cran.r-project.org/web/packages/rmarkdown/index.html) for more detail about how to use RMarkdown from R. [RStudio](http://www.rstudio.com/) has support for [R Markdown](http://rmarkdown.rstudio.com/) from its user interface.

BigQuery Analysis of Variants
--------------

Now let us move onto [literate programming](http://en.wikipedia.org/wiki/Literate_programming) for [BigQuery](https://developers.google.com/bigquery/).

If you have not used the [bigrquery](https://github.com/hadley/bigrquery) package previously, you will likely need to do something like the following to get it installed:

```{r one time setup, eval=FALSE}
### Only needed the first time around
install.packages("devtools")
devtools::install_github("assertthat")
devtools::install_github("bigrquery")
```

Next we will load our needed packages into our session:
```{r initialize}
library(bigrquery)
library(ggplot2)
library(xtable)
```

And write a little convenience function:
```{r}
project <- "google.com:biggene" # put your projectID here
table <- "genomics-public-data:platinum_genomes.variants" # put your table here
DisplayAndDispatchQuery <- function(queryUri) {
# Read in the SQL from a file or URL.
querySql <- readChar(queryUri, nchars=1e6)
# Find and replace the table name placeholder with our table name.
querySql <- sub("_THE_TABLE_", table, querySql, fixed=TRUE)
# Display the updated SQL.
cat(querySql)
# Dispatch the query to BigQuery for execution.
query_exec(querySql, project)
}
```

Now we're ready to execute our query, bringing the results down to our R session for further examination:
```{r comment=NA}
result <- DisplayAndDispatchQuery("../sql/sample-variant-counts-for-brca1.sql")
```

Let us examine our query result:
```{r result, comment=NA}
head(result)
summary(result)
str(result)
```
We can see that what we get back from bigrquery is an R dataframe holding our query results.

Data Visualization
-------------------
Now that our results are in a dataframe, we can easily apply data visualization to our results:
```{r viz, fig.align="center"}
ggplot(result, aes(x=call_set_name, y=variant_count)) +
geom_bar(stat="identity") + coord_flip() +
ggtitle("Count of Variants Per Sample")
```
and its clear to see that number of variants within BRCA1 for each sample corresponds roughly to two levels.

We can then examine the variant level data more closely:
```{r comment=NA}
result <- DisplayAndDispatchQuery("../sql/variant-level-data-for-brca1.sql")
```
Number of rows returned by this query: `r nrow(result)`.

Displaying the first few rows of the dataframe of results:
```{r echo=FALSE, message=FALSE, warning=FALSE, comment=NA, results="asis"}
print(xtable(head(result)), type="html", include.rownames=F)
```


And also work with the sample level data:
```{r comment=NA}
result <- DisplayAndDispatchQuery("../sql/sample-level-data-for-brca1.sql")
```
Number of rows returned by this query: `r nrow(result)`.


Displaying the first few rows of the dataframe of results:
```{r echo=FALSE, message=FALSE, warning=FALSE, comment=NA, results="asis"}
print(xtable(head(result)), type="html", include.rownames=F)
```

Provenance
-------------------
Lastly, let us capture version information about R and loaded packages for the sake of provenance.
```{r provenance, comment=NA}
sessionInfo()
```
519 changes: 519 additions & 0 deletions R/README.html

Large diffs are not rendered by default.

19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,28 @@
getting-started-bigquery
========================

The repository contains examples of how to use BigQuery with genomics data. The code within each language-specific folder demonstrates the same query - a simple query upon Platinum Genomes. For more detail about this data see [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes).
The repository contains examples of how to use BigQuery with genomics data. The code within each language-specific folder demonstrates the same set of queries upon the Platinum Genomes dataset. For more detail about this data see [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes).

Getting Started
-------------------------------------

1. [Sign up for BigQuery](https://developers.google.com/bigquery/sign-up).
1. Go to the BigQuery [Browser Tool](https://bigquery.cloud.google.com).
1. Click on **"Compose Query"**.
1. Copy and paste the following query into the dialog box:
1. Copy and paste the following query into the dialog box and click on **"Run Query"**:
```
SELECT
contig_name,
COUNT( contig_name) AS num_variants,
COUNT(call.callset_name) AS num_variant_calls
reference_name,
COUNT(reference_name) AS num_variants,
COUNT(call.call_set_name) AS num_variant_calls
FROM
[genomics-public-data:platinum_genomes.variants]
GROUP BY
contig_name
reference_name
ORDER BY
contig_name
reference_name
```
1. Click on **"Run Query"**
1. View the results!
View the results!

Google Genomics Public Data
-------------------------------------
Expand All @@ -35,7 +34,7 @@ To add the [Google Genomics Public Data](https://developers.google.com/genomics/
<img src="figure/display.png" title="Display project" alt="Display Project" style="display: block; margin: auto;" />
1. Enter `genomics-public-data` in the _‘Add Project’_ dialog.
<img src="figure/add.png" title="Add Project" alt="Add Project" style="display: block; margin: auto;" />
1. Now the [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes) datasets appear in the left navigation pane of the BigQuery [Browser Tool](https://bigquery.cloud.google.com).
Now the [Google Genomics Public Data](https://developers.google.com/genomics/datasets/platinum-genomes) datasets appear in the left navigation pane of the BigQuery [Browser Tool](https://bigquery.cloud.google.com).

What next?
----------
Expand Down
23 changes: 23 additions & 0 deletions sql/sample-level-data-for-brca1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Retrieve sample-level information for BRCA1 variants.
SELECT
reference_name,
start,
end,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
call.call_set_name,
GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
call.phaseset,
call.genotype_likelihood,
FROM
[_THE_TABLE_]
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
HAVING
alternate_bases IS NOT NULL
ORDER BY
start,
alternate_bases,
call.call_set_name
30 changes: 30 additions & 0 deletions sql/sample-variant-counts-for-brca1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Sample variant counts within BRCA1.
SELECT
call_set_name,
COUNT(call_set_name) AS variant_count,
FROM (
SELECT
reference_name,
start,
END,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
call.call_set_name AS call_set_name,
NTH(1,
call.genotype) WITHIN call AS first_allele,
NTH(2,
call.genotype) WITHIN call AS second_allele,
FROM
[_THE_TABLE_]
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
HAVING
first_allele > 0
OR second_allele > 0
)
GROUP BY
call_set_name
ORDER BY
call_set_name
26 changes: 26 additions & 0 deletions sql/sample-variant-counts.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Sample variant counts.
SELECT
call_set_name,
COUNT(call_set_name) AS variant_count,
FROM (
SELECT
reference_name,
start,
END,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
call.call_set_name AS call_set_name,
NTH(1,
call.genotype) WITHIN call AS first_allele,
NTH(2,
call.genotype) WITHIN call AS second_allele,
FROM
[_THE_TABLE_]
HAVING
first_allele > 0
OR second_allele > 0
)
GROUP BY
call_set_name
ORDER BY
call_set_name
22 changes: 22 additions & 0 deletions sql/variant-level-data-for-brca1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Retrieve variant-level information for BRCA1 variants.
SELECT
reference_name,
start,
end,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
quality,
GROUP_CONCAT(filter) WITHIN RECORD AS filter,
GROUP_CONCAT(names) WITHIN RECORD AS names,
COUNT(call.call_set_name) WITHIN RECORD AS num_samples,
FROM
[_THE_TABLE_]
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
HAVING
alternate_bases IS NOT NULL
ORDER BY
start,
alternate_bases

0 comments on commit a770822

Please sign in to comment.