404
+ +Page not found
+ + +diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..77d0f92 --- /dev/null +++ b/404.html @@ -0,0 +1,145 @@ + + +
+ + + + +Page not found
+ + +GeneID | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|---|
<chr> | <int> | <int> | <int> | <int> | <int> | <int> | |
5 | WBGene00000001 | 3227 | 2168 | 2589 | 5659 | 2619 | 5239 |
6 | WBGene00000002 | 270 | 203 | 266 | 355 | 191 | 425 |
7 | WBGene00000003 | 341 | 415 | 411 | 387 | 255 | 499 |
8 | WBGene00000004 | 584 | 438 | 518 | 1028 | 541 | 888 |
9 | WBGene00000005 | 383 | 395 | 483 | 119 | 65 | 189 |
10 | WBGene00000006 | 343 | 344 | 334 | 206 | 114 | 220 |
GeneID | N2_day1_rep2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 1400596 |
2 | N_multimapping | 1305129 |
3 | N_noFeature | 152183 |
4 | N_ambiguous | 439830 |
5 | WBGene00000003 | 415 |
6 | WBGene00000007 | 513 |
GeneID | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|---|
<chr> | <int> | <int> | <int> | <int> | <int> | <int> | |
WBGene00000001 | WBGene00000001 | 3227 | 2168 | 2589 | 5659 | 2619 | 5239 |
WBGene00000002 | WBGene00000002 | 270 | 203 | 266 | 355 | 191 | 425 |
WBGene00000003 | WBGene00000003 | 341 | 415 | 411 | 387 | 255 | 499 |
WBGene00000004 | WBGene00000004 | 584 | 438 | 518 | 1028 | 541 | 888 |
WBGene00000005 | WBGene00000005 | 383 | 395 | 483 | 119 | 65 | 189 |
WBGene00000006 | WBGene00000006 | 343 | 344 | 334 | 206 | 114 | 220 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | |
WBGene00000001 | 3227 | 2168 | 2589 | 5659 | 2619 | 5239 |
WBGene00000002 | 270 | 203 | 266 | 355 | 191 | 425 |
WBGene00000003 | 341 | 415 | 411 | 387 | 255 | 499 |
WBGene00000004 | 584 | 438 | 518 | 1028 | 541 | 888 |
WBGene00000005 | 383 | 395 | 483 | 119 | 65 | 189 |
WBGene00000006 | 343 | 344 | 334 | 206 | 114 | 220 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | |
---|---|---|---|
<int> | <int> | <int> | |
WBGene00000001 | 3227 | 2168 | 2589 |
WBGene00000002 | 270 | 203 | 266 |
WBGene00000003 | 341 | 415 | 411 |
WBGene00000004 | 584 | 438 | 518 |
WBGene00000005 | 383 | 395 | 483 |
WBGene00000006 | 343 | 344 | 334 |
colSums(merged_df) | |
---|---|
<dbl> | |
N2_day1_rep1 | 37398898 |
N2_day1_rep2 | 29488709 |
N2_day1_rep3 | 34593136 |
N2_day7_rep1 | 48275683 |
N2_day7_rep2 | 23204449 |
N2_day7_rep3 | 46005617 |
group |
---|
N2_day1 |
N2_day1 |
N2_day1 |
N2_day7 |
N2_day7 |
N2_day7 |
group | |
---|---|
N2_day1_rep1 | N2_day1 |
N2_day1_rep2 | N2_day1 |
N2_day1_rep3 | N2_day1 |
N2_day7_rep1 | N2_day7 |
N2_day7_rep2 | N2_day7 |
N2_day7_rep3 | N2_day7 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
WBGene00000001 | 3219.3986 | 2674.1107 | 2740.3156 | 4313.59368 | 3660.57703 | 3673.8895 |
WBGene00000002 | 269.3640 | 250.3895 | 281.5465 | 270.60006 | 266.96075 | 298.0346 |
WBGene00000003 | 340.1968 | 511.8800 | 435.0211 | 294.99218 | 356.41357 | 349.9276 |
WBGene00000004 | 582.6243 | 540.2493 | 548.2748 | 783.59680 | 756.15585 | 622.7169 |
WBGene00000005 | 382.0978 | 487.2111 | 511.2292 | 90.70819 | 90.85052 | 132.5377 |
WBGene00000006 | 342.1920 | 424.3054 | 353.5208 | 157.02426 | 159.33783 | 154.2767 |
count | group | |
---|---|---|
<dbl> | <fct> | |
N2_day1_rep1 | 37716.448 | N2_day1 |
N2_day1_rep2 | 46554.449 | N2_day1 |
N2_day1_rep3 | 45402.524 | N2_day1 |
N2_day7_rep1 | 1544.064 | N2_day7 |
N2_day7_rep2 | 1564.527 | N2_day7 |
N2_day7_rep3 | 1489.270 | N2_day7 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 |
WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 |
WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 |
WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 |
WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 |
WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 |
gene | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | pseudo_reference | |
---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 | 8.115889 |
2 | WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 | 5.611934 |
3 | WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 | 5.931841 |
4 | WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 | 6.453309 |
5 | WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 | 5.383699 |
6 | WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 | 5.491203 |
gene | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | pseudo_reference | |
---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 | 8.115889 |
2 | WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 | 5.611934 |
3 | WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 | 5.931841 |
4 | WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 | 6.453309 |
5 | WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 | 5.383699 |
6 | WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 | 5.491203 |
gene | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | pseudo_reference | |
---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 | 3347.2307 |
2 | WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 | 273.6730 |
3 | WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 | 376.8478 |
4 | WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 | 634.7996 |
5 | WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 | 217.8266 |
6 | WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 | 242.5487 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 0.9640805 | 0.6476996 | 0.7734752 | 1.6906513 | 0.7824378 | 1.5651745 |
2 | 0.9865787 | 0.7417610 | 0.9719628 | 1.2971683 | 0.6979131 | 1.5529480 |
3 | 0.9048746 | 1.1012403 | 1.0906259 | 1.0269398 | 0.6766657 | 1.3241420 |
4 | 0.9199753 | 0.6899815 | 0.8160055 | 1.6194086 | 0.8522374 | 1.3988666 |
5 | 1.7582795 | 1.8133692 | 2.2173603 | 0.5463062 | 0.2984025 | 0.8676627 |
6 | 1.4141490 | 1.4182718 | 1.3770430 | 0.8493139 | 0.4700087 | 0.9070343 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
WBGene00000001 | 3219.3986 | 2674.1107 | 2740.3156 | 4313.59368 | 3660.57703 | 3673.8895 |
WBGene00000002 | 269.3640 | 250.3895 | 281.5465 | 270.60006 | 266.96075 | 298.0346 |
WBGene00000003 | 340.1968 | 511.8800 | 435.0211 | 294.99218 | 356.41357 | 349.9276 |
WBGene00000004 | 582.6243 | 540.2493 | 548.2748 | 783.59680 | 756.15585 | 622.7169 |
WBGene00000005 | 382.0978 | 487.2111 | 511.2292 | 90.70819 | 90.85052 | 132.5377 |
WBGene00000006 | 342.1920 | 424.3054 | 353.5208 | 157.02426 | 159.33783 | 154.2767 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
WBGene00000001 | 3219.3986 | 2674.1107 | 2740.3156 | 4313.59368 | 3660.57703 | 3673.8895 |
WBGene00000002 | 269.3640 | 250.3895 | 281.5465 | 270.60006 | 266.96075 | 298.0346 |
WBGene00000003 | 340.1968 | 511.8800 | 435.0211 | 294.99218 | 356.41357 | 349.9276 |
WBGene00000004 | 582.6243 | 540.2493 | 548.2748 | 783.59680 | 756.15585 | 622.7169 |
WBGene00000005 | 382.0978 | 487.2111 | 511.2292 | 90.70819 | 90.85052 | 132.5377 |
WBGene00000006 | 342.1920 | 424.3054 | 353.5208 | 157.02426 | 159.33783 | 154.2767 |
if (!require("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
+BiocManager::install("DESeq2")
+BiocManager::install("EnhancedVolcano")
+
+Installing package into ‘/usr/local/lib/R/site-library’
+(as ‘lib’ is unspecified)
+
+'getOption("repos")' replaces Bioconductor standard repositories, see
+'help("repositories", package = "BiocManager")' for details.
+Replacement repositories:
+ CRAN: https://cran.rstudio.com
+
+Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)
+
+Installing package(s) 'BiocVersion', 'DESeq2'
+
+also installing the dependencies ‘formatR’, ‘UCSC.utils’, ‘GenomeInfoDbData’, ‘zlibbioc’, ‘abind’, ‘SparseArray’, ‘lambda.r’, ‘futile.options’, ‘GenomeInfoDb’, ‘XVector’, ‘S4Arrays’, ‘DelayedArray’, ‘futile.logger’, ‘snow’, ‘BH’, ‘S4Vectors’, ‘IRanges’, ‘GenomicRanges’, ‘SummarizedExperiment’, ‘BiocGenerics’, ‘Biobase’, ‘BiocParallel’, ‘matrixStats’, ‘locfit’, ‘MatrixGenerics’, ‘RcppArmadillo’
+
+
+Old packages: 'gtable'
+
+'getOption("repos")' replaces Bioconductor standard repositories, see
+'help("repositories", package = "BiocManager")' for details.
+Replacement repositories:
+ CRAN: https://cran.rstudio.com
+
+Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)
+
+Installing package(s) 'EnhancedVolcano'
+
+also installing the dependency ‘ggrepel’
+
+
+Old packages: 'gtable'
+
+library(DESeq2)
+library(dplyr)
+library(EnhancedVolcano)
+
+Loading required package: S4Vectors
+
+Loading required package: stats4
+
+Loading required package: BiocGenerics
+
+
+Attaching package: ‘BiocGenerics’
+
+
+The following objects are masked from ‘package:stats’:
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from ‘package:base’:
+
+ anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+ colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+ get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
+ match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
+ Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
+ tapply, union, unique, unsplit, which.max, which.min
+
+
+
+Attaching package: ‘S4Vectors’
+
+
+The following object is masked from ‘package:utils’:
+
+ findMatches
+
+
+The following objects are masked from ‘package:base’:
+
+ expand.grid, I, unname
+
+
+Loading required package: IRanges
+
+Loading required package: GenomicRanges
+
+Loading required package: GenomeInfoDb
+
+Loading required package: SummarizedExperiment
+
+Loading required package: MatrixGenerics
+
+Loading required package: matrixStats
+
+
+Attaching package: ‘MatrixGenerics’
+
+
+The following objects are masked from ‘package:matrixStats’:
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: Biobase
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: ‘Biobase’
+
+
+The following object is masked from ‘package:MatrixGenerics’:
+
+ rowMedians
+
+
+The following objects are masked from ‘package:matrixStats’:
+
+ anyMissing, rowMedians
+
+
+
+Attaching package: ‘dplyr’
+
+
+The following object is masked from ‘package:Biobase’:
+
+ combine
+
+
+The following object is masked from ‘package:matrixStats’:
+
+ count
+
+
+The following objects are masked from ‘package:GenomicRanges’:
+
+ intersect, setdiff, union
+
+
+The following object is masked from ‘package:GenomeInfoDb’:
+
+ intersect
+
+
+The following objects are masked from ‘package:IRanges’:
+
+ collapse, desc, intersect, setdiff, slice, union
+
+
+The following objects are masked from ‘package:S4Vectors’:
+
+ first, intersect, rename, setdiff, setequal, union
+
+
+The following objects are masked from ‘package:BiocGenerics’:
+
+ combine, intersect, setdiff, union
+
+
+The following objects are masked from ‘package:stats’:
+
+ filter, lag
+
+
+The following objects are masked from ‘package:base’:
+
+ intersect, setdiff, setequal, union
+
+
+Loading required package: ggplot2
+
+Loading required package: ggrepel
+
+getwd()
+
+'/content'
+list.files()
+
+
+file_paths <- list.files(pattern = "*.ReadsPerGene.out.tab")
+file_paths
+
+
+# Function to read the STAR ReadsPerGene.out.tab file
+read_star_file <- function(file_path) {
+ # Read the file
+ df <- read.table(file_path, header = FALSE, stringsAsFactors = FALSE)
+
+ # Keep only the first (gene) and second (unstranded counts) columns
+ df <- df %>% select(V1, V2)
+
+ # Rename the columns for clarity (GeneID and counts for this sample)
+ colnames(df) <- c("GeneID", gsub(".ReadsPerGene.out.tab", "", basename(file_path)))
+
+ return(df)
+}
+
+# Read all files into a list of data frames
+list_of_dfs <- lapply(file_paths, read_star_file)
+
+# Merge all data frames by the GeneID column
+merged_df <- Reduce(function(x, y) merge(x, y, by = "GeneID"), list_of_dfs)
+
+merged_df <- merged_df[-c(1:4), ]
+
+# Check the first few rows of the combined data frame
+head(merged_df)
+
+# Optionally, write the combined data frame to a CSV file
+write.csv(merged_df, "combined_gene_counts.csv", row.names = FALSE)
+
+GeneID | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|---|
<chr> | <int> | <int> | <int> | <int> | <int> | <int> | |
5 | WBGene00000001 | 3227 | 2168 | 2589 | 5659 | 2619 | 5239 |
6 | WBGene00000002 | 270 | 203 | 266 | 355 | 191 | 425 |
7 | WBGene00000003 | 341 | 415 | 411 | 387 | 255 | 499 |
8 | WBGene00000004 | 584 | 438 | 518 | 1028 | 541 | 888 |
9 | WBGene00000005 | 383 | 395 | 483 | 119 | 65 | 189 |
10 | WBGene00000006 | 343 | 344 | 334 | 206 | 114 | 220 |
class(list_of_dfs)
+
+'list'
+head(list_of_dfs[[2]])
+
+GeneID | N2_day1_rep2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 1400596 |
2 | N_multimapping | 1305129 |
3 | N_noFeature | 152183 |
4 | N_ambiguous | 439830 |
5 | WBGene00000003 | 415 |
6 | WBGene00000007 | 513 |
+
+rownames(merged_df) = merged_df$GeneID
+
+
+head(merged_df)
+
+GeneID | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|---|
<chr> | <int> | <int> | <int> | <int> | <int> | <int> | |
WBGene00000001 | WBGene00000001 | 3227 | 2168 | 2589 | 5659 | 2619 | 5239 |
WBGene00000002 | WBGene00000002 | 270 | 203 | 266 | 355 | 191 | 425 |
WBGene00000003 | WBGene00000003 | 341 | 415 | 411 | 387 | 255 | 499 |
WBGene00000004 | WBGene00000004 | 584 | 438 | 518 | 1028 | 541 | 888 |
WBGene00000005 | WBGene00000005 | 383 | 395 | 483 | 119 | 65 | 189 |
WBGene00000006 | WBGene00000006 | 343 | 344 | 334 | 206 | 114 | 220 |
# NULL reserved word representing empty
+merged_df$GeneID = NULL
+head(merged_df)
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | |
WBGene00000001 | 3227 | 2168 | 2589 | 5659 | 2619 | 5239 |
WBGene00000002 | 270 | 203 | 266 | 355 | 191 | 425 |
WBGene00000003 | 341 | 415 | 411 | 387 | 255 | 499 |
WBGene00000004 | 584 | 438 | 518 | 1028 | 541 | 888 |
WBGene00000005 | 383 | 395 | 483 | 119 | 65 | 189 |
WBGene00000006 | 343 | 344 | 334 | 206 | 114 | 220 |
subset_df4test <- merged_df[, c("N2_day1_rep1", "N2_day1_rep2", "N2_day1_rep3")]
+head(subset_df4test)
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | |
---|---|---|---|
<int> | <int> | <int> | |
WBGene00000001 | 3227 | 2168 | 2589 |
WBGene00000002 | 270 | 203 | 266 |
WBGene00000003 | 341 | 415 | 411 |
WBGene00000004 | 584 | 438 | 518 |
WBGene00000005 | 383 | 395 | 483 |
WBGene00000006 | 343 | 344 | 334 |
Different samples have different total number of counts
+as.data.frame(colSums(merged_df))
+
+colSums(merged_df) | |
---|---|
<dbl> | |
N2_day1_rep1 | 37398898 |
N2_day1_rep2 | 29488709 |
N2_day1_rep3 | 34593136 |
N2_day7_rep1 | 48275683 |
N2_day7_rep2 | 23204449 |
N2_day7_rep3 | 46005617 |
barplot(colSums(merged_df),
+ las = 2,
+ cex.names= 0.6)
+
+# inkscape: an open source software for editing the graph saved in pdf
+pdf("total_count_barplot.pdf")
+barplot(colSums(merged_df),
+ las = 2,
+ cex.names= 0.6)
+dev.off()
+
+pdf: 2
+coldata <- colnames(merged_df)
+coldata_df <- cbind(group = gsub("_rep\\d", "", coldata))
+coldata_df
+
+group |
---|
N2_day1 |
N2_day1 |
N2_day1 |
N2_day7 |
N2_day7 |
N2_day7 |
rownames(coldata_df) = coldata
+coldata_df
+
+group | |
---|---|
N2_day1_rep1 | N2_day1 |
N2_day1_rep2 | N2_day1 |
N2_day1_rep3 | N2_day1 |
N2_day7_rep1 | N2_day7 |
N2_day7_rep2 | N2_day7 |
N2_day7_rep3 | N2_day7 |
dds <- DESeqDataSetFromMatrix(countData = merged_df,
+ colData = coldata_df,
+ design =~ group)
+
+Warning message in DESeqDataSet(se, design = design, ignoreRank):
+“some variables in design formula are characters, converting to factors”
+
+class(dds)
+
+'DESeqDataSet'
+The DESeq()
function normalizes the read counts,estimates dispersions, and fits the linear model, all in one go.
dds <- DESeq(dds)
+
+estimating size factors
+
+estimating dispersions
+
+gene-wise dispersion estimates
+
+mean-dispersion relationship
+
+final dispersion estimates
+
+fitting model and testing
+
+Dispersion is a measure of spread or variability in the data. Variance, standard deviation, IQR, among other measures, can all be used to measure dispersion.
+DESeq2 uses a specific measure of dispersion (α) related to the mean (μ) and variance of the data: Var = μ + α*μ^2.
+sizeFactors(dds)
+
+
+head(counts(dds, normalized = TRUE))
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
WBGene00000001 | 3219.3986 | 2674.1107 | 2740.3156 | 4313.59368 | 3660.57703 | 3673.8895 |
WBGene00000002 | 269.3640 | 250.3895 | 281.5465 | 270.60006 | 266.96075 | 298.0346 |
WBGene00000003 | 340.1968 | 511.8800 | 435.0211 | 294.99218 | 356.41357 | 349.9276 |
WBGene00000004 | 582.6243 | 540.2493 | 548.2748 | 783.59680 | 756.15585 | 622.7169 |
WBGene00000005 | 382.0978 | 487.2111 | 511.2292 | 90.70819 | 90.85052 | 132.5377 |
WBGene00000006 | 342.1920 | 424.3054 | 353.5208 | 157.02426 | 159.33783 | 154.2767 |
The function plotDispEsts
shows the dispersion by mean of normalized counts. We expect the dispersion to decrease as the mean of normalized counts increases.
The functions shows:
+black per-gene dispersion estimates
+a red trend line representing the global relationship between dispersion and normalized count
+blue 'shrunken' values moderating individual dispersion estimates by the global relationship
+blue-circled dispersion outliers with high gene-wise dispersion that were not adjusted.
+plotDispEsts(dds)
+
+
+
+The function plotCounts
is used to plot normalized counts plus a pseudocount of 0.5 by default.
vsd <- vst(dds, blind=FALSE)
+
+class(vsd)
+
+'DESeqTransform'
+plotPCA(vsd, intgroup = c("group"))
+
+using ntop=500 top features by variance
+
+results
functionres <- results(dds)
+res
+
+log2 fold change (MLE): group N2 day7 vs N2 day1
+Wald test p-value: group N2 day7 vs N2 day1
+DataFrame with 46926 rows and 6 columns
+ baseMean log2FoldChange lfcSE stat pvalue
+ <numeric> <numeric> <numeric> <numeric> <numeric>
+WBGene00000001 3380.314 0.4320800 0.136502 3.165370 1.54886e-03
+WBGene00000002 272.816 0.0620929 0.165747 0.374626 7.07939e-01
+WBGene00000003 381.405 -0.3623262 0.199673 -1.814594 6.95864e-02
+WBGene00000004 638.936 0.3698102 0.151232 2.445312 1.44727e-02
+WBGene00000005 282.439 -2.1244976 0.227771 -9.327337 1.08564e-20
+... ... ... ... ... ...
+WBGene00306078 0.566369 0.947326 3.076775 0.307896 7.58162e-01
+WBGene00306080 0.243919 1.429602 4.042905 0.353608 7.23633e-01
+WBGene00306081 27.265033 -3.108820 0.627823 -4.951747 7.35501e-07
+WBGene00306121 14.219195 -0.210100 0.691162 -0.303981 7.61142e-01
+WBGene00306122 0.000000 NA NA NA NA
+ padj
+ <numeric>
+WBGene00000001 4.06703e-03
+WBGene00000002 7.84660e-01
+WBGene00000003 1.17161e-01
+WBGene00000004 2.96896e-02
+WBGene00000005 1.92160e-19
+... ...
+WBGene00306078 NA
+WBGene00306080 NA
+WBGene00306081 3.70005e-06
+WBGene00306121 8.26621e-01
+WBGene00306122 NA
+
+class(res)
+
+'DESeqResults'
+mcols(res)$description
+
+
+baseMean
: mean of normalized counts for all sampleslog2FoldChange
: log2 fold changelfcSE
: standard errorstat
: Wald statisticpvalue
: Wald test p-valuepadj
: BH adjusted p-valuesIf we used the p-value directly from the Wald test with a significance cut-off of p < 0.05, that means there is a 5% chance it is a false positives. Each p-value is the result of a single test (single gene). The more genes we test, the more we inflate the false positive rate. This is the multiple testing problem. For example, if we test 20,000 genes for differential expression, at p < 0.05 we would expect to find 1,000 genes by chance. If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives. We would not want to sift through our “significant” genes to identify which ones are true positives.
+DESeq2 helps reduce the number of genes tested by removing those genes unlikely to be significantly DE prior to testing, such as those with low number of counts and outlier samples (gene-level QC). However, we still need to correct for multiple testing to reduce the number of false positives, and there are a few common approaches:
+Bonferroni: The adjusted p-value is calculated by: p-value * m (m = total number of tests). This is a very conservative approach with a high probability of false negatives, so is generally not recommended.
+FDR/Benjamini-Hochberg: Benjamini and Hochberg (1995) defined the concept of FDR and created an algorithm to control the expected FDR below a specified level given a list of independent p-values. An interpretation of the BH method for controlling the FDR is implemented in DESeq2 in which we rank the genes by p-value, then multiply each ranked p-value by m/rank.
+Q-value / Storey method: The minimum FDR that can be attained when calling that feature significant. For example, if gene X has a q-value of 0.013 it means that 1.3% of genes that show p-values at least as small as gene X are false positives +In DESeq2, the p-values attained by the Wald test are corrected for multiple testing using the Benjamini and Hochberg method by default. There are options to use other methods in the results() function. The p-adjusted values should be used to determine significant genes. The significant genes can be output for visualization and/or functional analysis.
+So what does FDR < 0.05 mean? By setting the FDR cutoff to < 0.05, we’re saying that the proportion of false positives we expect amongst our differentially expressed genes is 5%. For example, if you call 500 genes as differentially expressed with an FDR cutoff of 0.05, you expect 25 of them to be false positives.
+Note on p-values set to NA
: some values in the results table can be set to NA
for one of the following reasons:
If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.
+If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance. Customization of this outlier filtering and description of functionality for replacement of outlier counts and refitting is described below
+If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below
+plotMA(res, ylim=c(-2,2))
+
+
+write.csv(res, file = "BIOI_bulkRNAseq_SE_DESeq2_res.csv")
+
+d <- plotCounts(dds, gene=which.min(res$padj), intgroup="group",
+ returnData=TRUE)
+
+d
+
+count | group | |
---|---|---|
<dbl> | <fct> | |
N2_day1_rep1 | 37716.448 | N2_day1 |
N2_day1_rep2 | 46554.449 | N2_day1 |
N2_day1_rep3 | 45402.524 | N2_day1 |
N2_day7_rep1 | 1544.064 | N2_day7 |
N2_day7_rep2 | 1564.527 | N2_day7 |
N2_day7_rep3 | 1489.270 | N2_day7 |
which.min(res$padj)
+
+465
+library("ggplot2")
+ggplot(d, aes(x=group, y=count)) +
+ geom_point(position=position_jitter(w=0.1,h=0)) +
+ scale_y_log10(breaks=c(25,100,400))
+
+ EnhancedVolcano(res,
+ lab = rownames(res),
+ x = 'log2FoldChange',
+ y = 'pvalue')
+
+Warning message:
+“One or more p-values is 0. Converting to 10^-1 * current lowest non-zero p-value...”
+
+The code below creates a new column called pseudo_reference
that contains the average log-transformed expression value for each gene across all samples. This pseudo-reference is similar to calculating a "reference sample" to compare other samples.
log_data = log(merged_df)
+head(log_data)
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 |
WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 |
WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 |
WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 |
WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 |
WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 |
library(dplyr)
+library(tibble) # rownames_to_column
+
+log_data = log_data %>%
+ rownames_to_column('gene') %>%
+ mutate (pseudo_reference = rowMeans(log_data))
+
+head(log_data)
+
+gene | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | pseudo_reference | |
---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 | 8.115889 |
2 | WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 | 5.611934 |
3 | WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 | 5.931841 |
4 | WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 | 6.453309 |
5 | WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 | 5.383699 |
6 | WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 | 5.491203 |
table(log_data$pseudo_reference == "-Inf")
+
+FALSE TRUE
+16951 29975
+
+filtered_log_data = log_data %>% filter(pseudo_reference != "-Inf")
+head(filtered_log_data)
+
+gene | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | pseudo_reference | |
---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 | 8.115889 |
2 | WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 | 5.611934 |
3 | WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 | 5.931841 |
4 | WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 | 6.453309 |
5 | WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 | 5.383699 |
6 | WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 | 5.491203 |
filtered_log_data$pseudo_reference = exp(filtered_log_data$pseudo_reference)
+
+head(filtered_log_data)
+
+gene | N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | pseudo_reference | |
---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | WBGene00000001 | 8.079308 | 7.681560 | 7.859027 | 8.641002 | 7.870548 | 8.563886 | 3347.2307 |
2 | WBGene00000002 | 5.598422 | 5.313206 | 5.583496 | 5.872118 | 5.252273 | 6.052089 | 273.6730 |
3 | WBGene00000003 | 5.831882 | 6.028279 | 6.018593 | 5.958425 | 5.541264 | 6.212606 | 376.8478 |
4 | WBGene00000004 | 6.369901 | 6.082219 | 6.249975 | 6.935370 | 6.293419 | 6.788972 | 634.7996 |
5 | WBGene00000005 | 5.948035 | 5.978886 | 6.180017 | 4.779123 | 4.174387 | 5.241747 | 217.8266 |
6 | WBGene00000006 | 5.837730 | 5.840642 | 5.811141 | 5.327876 | 4.736198 | 5.393628 | 242.5487 |
dim(log_data)
+dim(filtered_log_data)
+
+
+This step calculates the fold change between each sample and the pseudo-reference for each gene.
+ratio_data = sweep(exp(filtered_log_data[,2:7]), 1, filtered_log_data[,8], "/")
+head(ratio_data)
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 0.9640805 | 0.6476996 | 0.7734752 | 1.6906513 | 0.7824378 | 1.5651745 |
2 | 0.9865787 | 0.7417610 | 0.9719628 | 1.2971683 | 0.6979131 | 1.5529480 |
3 | 0.9048746 | 1.1012403 | 1.0906259 | 1.0269398 | 0.6766657 | 1.3241420 |
4 | 0.9199753 | 0.6899815 | 0.8160055 | 1.6194086 | 0.8522374 | 1.3988666 |
5 | 1.7582795 | 1.8133692 | 2.2173603 | 0.5463062 | 0.2984025 | 0.8676627 |
6 | 1.4141490 | 1.4182718 | 1.3770430 | 0.8493139 | 0.4700087 | 0.9070343 |
+
+The code below computes the median fold change for each sample across all genes.
+
+
+scaling_factors = apply(ratio_data, 2, median, na.rm = TRUE)
+scaling_factors
+
+
+The 2
indicates that the function is applied to columns, i.e., for each sample.
This step below normalizes each sample by its scaling factors, making the data comparable across samples. The result is a normalized gene expression matrix.
+manually_normalized = sweep(merged_df, 2, scaling_factors, "/")
+head(manually_normalized)
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
WBGene00000001 | 3219.3986 | 2674.1107 | 2740.3156 | 4313.59368 | 3660.57703 | 3673.8895 |
WBGene00000002 | 269.3640 | 250.3895 | 281.5465 | 270.60006 | 266.96075 | 298.0346 |
WBGene00000003 | 340.1968 | 511.8800 | 435.0211 | 294.99218 | 356.41357 | 349.9276 |
WBGene00000004 | 582.6243 | 540.2493 | 548.2748 | 783.59680 | 756.15585 | 622.7169 |
WBGene00000005 | 382.0978 | 487.2111 | 511.2292 | 90.70819 | 90.85052 | 132.5377 |
WBGene00000006 | 342.1920 | 424.3054 | 353.5208 | 157.02426 | 159.33783 | 154.2767 |
hist(manually_normalized$N2_day1_rep1)
+
+The code below shows that the size factors and the normalized read counts calculated by ourselves are the same as what DESeq2 function returns.
+head(counts(dds, normalized = TRUE))
+
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
WBGene00000001 | 3219.3986 | 2674.1107 | 2740.3156 | 4313.59368 | 3660.57703 | 3673.8895 |
WBGene00000002 | 269.3640 | 250.3895 | 281.5465 | 270.60006 | 266.96075 | 298.0346 |
WBGene00000003 | 340.1968 | 511.8800 | 435.0211 | 294.99218 | 356.41357 | 349.9276 |
WBGene00000004 | 582.6243 | 540.2493 | 548.2748 | 783.59680 | 756.15585 | 622.7169 |
WBGene00000005 | 382.0978 | 487.2111 | 511.2292 | 90.70819 | 90.85052 | 132.5377 |
WBGene00000006 | 342.1920 | 424.3054 | 353.5208 | 157.02426 | 159.33783 | 154.2767 |
sizeFactors(dds)
+
+
+sessionInfo()
+
+R version 4.4.1 (2024-06-14)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats4 stats graphics grDevices utils datasets methods
+[8] base
+
+other attached packages:
+ [1] tibble_3.2.1 EnhancedVolcano_1.22.0
+ [3] ggrepel_0.9.6 ggplot2_3.5.1
+ [5] dplyr_1.1.4 DESeq2_1.44.0
+ [7] SummarizedExperiment_1.34.0 Biobase_2.64.0
+ [9] MatrixGenerics_1.16.0 matrixStats_1.4.1
+[11] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
+[13] IRanges_2.38.1 S4Vectors_0.42.1
+[15] BiocGenerics_0.50.0
+
+loaded via a namespace (and not attached):
+ [1] generics_0.1.3 utf8_1.2.4 SparseArray_1.4.8
+ [4] lattice_0.22-6 magrittr_2.0.3 digest_0.6.37
+ [7] evaluate_1.0.1 grid_4.4.1 pbdZMQ_0.3-13
+[10] fastmap_1.2.0 jsonlite_1.8.9 Matrix_1.7-1
+[13] BiocManager_1.30.25 httr_1.4.7 fansi_1.0.6
+[16] UCSC.utils_1.0.0 scales_1.3.0 codetools_0.2-20
+[19] abind_1.4-8 cli_3.6.3 rlang_1.1.4
+[22] crayon_1.5.3 XVector_0.44.0 munsell_0.5.1
+[25] withr_3.0.1 base64enc_0.1-3 repr_1.1.7
+[28] DelayedArray_0.30.1 S4Arrays_1.4.1 tools_4.4.1
+[31] parallel_4.4.1 uuid_1.2-1 BiocParallel_1.38.0
+[34] colorspace_2.1-1 locfit_1.5-9.10 GenomeInfoDbData_1.2.12
+[37] IRdisplay_1.1 vctrs_0.6.5 R6_2.5.1
+[40] lifecycle_1.0.4 zlibbioc_1.50.0 pkgconfig_2.0.3
+[43] pillar_1.9.0 gtable_0.3.5 glue_1.8.0
+[46] Rcpp_1.0.13 tidyselect_1.2.1 IRkernel_1.3.2
+[49] farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
+[52] compiler_4.4.1
+
+
+ You can download the data from the link here
+# https://www.10xgenomics.com/datasets/5k-human-pbmcs-3-v3-1-chromium-controller-3-1-standard
+wget https://cf.10xgenomics.com/samples/cell-exp/7.0.1/SC3pv3_GEX_Human_PBMC/SC3pv3_GEX_Human_PBMC_fastqs.tar
+tar xvf SC3pv3_GEX_Human_PBMC_fastqs.tar
+
+For this class, you can find a copy under the path:
+%%bash
+ls /scratch/zt1/project/bioi611/shared/raw_data/Chromium_3p_GEX_Human_PBMC_fastqs/
+
+Chromium_3p_GEX_Human_PBMC_S1_L001_I1_001.fastq.gz
+Chromium_3p_GEX_Human_PBMC_S1_L001_I2_001.fastq.gz
+Chromium_3p_GEX_Human_PBMC_S1_L001_R1_001.fastq.gz
+Chromium_3p_GEX_Human_PBMC_S1_L001_R2_001.fastq.gz
+
+Read | +Read 1 | +i7 Index | +i5 Index | +Read 2 | +
---|---|---|---|---|
Purpose | +Cell barcode & UMI | +Sample Index | +Sample Index | +Insert | +
Length** | +28 | +10 | +10 | +90 | +
An Unique Molecular Identifier (UMI) is a short sequence tag (usually 8-12 nucleotides long) that is added to each RNA molecule during library preparation. UMIs are critical for accurately quantifying gene expression, as they help to distinguish between unique RNA molecules and technical duplicates that arise from PCR amplification.
+How ow UMIs work in the 10x scRNA-seq workflow:
+Library Preparation: Each RNA molecule is tagged with a UMI as well as cell-specific barcodes. This labeling occurs before PCR amplification, so each original RNA molecule within a single cell is uniquely identifiable.
+Eliminating Amplification Bias: When the tagged molecules are amplified by PCR, each original molecule (regardless of how many duplicates it creates) retains its unique UMI. Later, when sequencing reads are aligned and counted, duplicate reads with the same UMI and gene alignment are considered as representing a single molecule.
+Accurate Quantification: Using UMIs allows for a more accurate measure of gene expression by avoiding overcounting due to PCR duplicates, providing a closer representation of the actual RNA molecules present in each cell.
+%%bash
+zcat /scratch/zt1/project/bioi611/shared/raw_data/Chromium_3p_GEX_Human_PBMC_fastqs/Chromium_3p_GEX_Human_PBMC_S1_L001_R1_001.fastq.gz |head -12
+
+@A00836:523:HJH22DSXY:1:1101:1823:1016 1:N:0:ATGGAGGGAG+AATGGGTTAT
+TNATGGACAAACAGGCCGTTGCACTAAA
++
+F#FFFFFFFFFFF:FFFFFFFFFFFFFF
+@A00836:523:HJH22DSXY:1:1101:1841:1016 1:N:0:ATGGAGGGAG+AATGGGTTAT
+TNGTGATGTTCTTGTTCTCACTCGAGGT
++
+F#FFFFFFFFFFFFFFFFFFFFFFFFFF
+@A00836:523:HJH22DSXY:1:1101:1949:1016 1:N:0:ATGGAGGGAG+AATGGGTTAT
+ANACAGGGTCCTACGGTTCATCTTTGTG
++
+F#FFFFFFFFFFFFFFFFFFFFFFFFFF
+
+%%bash
+zcat /scratch/zt1/project/bioi611/shared/raw_data/Chromium_3p_GEX_Human_PBMC_fastqs/Chromium_3p_GEX_Human_PBMC_S1_L001_R2_001.fastq.gz |head -12
+
+@A00836:523:HJH22DSXY:1:1101:1823:1016 2:N:0:ATGGAGGGAG+AATGGGTTAT
+GGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCAAGACAGGTGAACTGCTCGAGGCCGGGAGTTTGAGACCAGCCTGGACAACATGGC
++
+FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF,F,FFF:FFFFFFFFFF
+@A00836:523:HJH22DSXY:1:1101:1841:1016 2:N:0:ATGGAGGGAG+AATGGGTTAT
+CAGGGCCTGTTGGGGGTTGGGGGCAAGGAGAGGGAGAGCATTAGGACAAATACCTAATGTGTGTGGGGCTTAAAACCTAGATGACGGGTT
++
+FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF
+@A00836:523:HJH22DSXY:1:1101:1949:1016 2:N:0:ATGGAGGGAG+AATGGGTTAT
+TTTTTTTTGTTCAAATGATTTTAATTATTGGAATGCACAATTTTTTTAATATGCAAATAAAAAGTTTAAAAACCAAAAAAAAAAAAAAGA
++
+FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,:,:
+
+Cell Ranger is a software suite developed by 10x Genomics for processing and analyzing data from their single-cell RNA-seq (scRNA-seq), single-cell ATAC-seq (scATAC-seq), and other single-cell assays. Cell Ranger performs tasks such as alignment, filtering, UMI counting, and data aggregation, streamlining the analysis of single-cell datasets generated by 10x Genomics platforms.
+Key Functions of cellranger count
:
Preprocessing and Alignment: Cell Ranger aligns the reads to a reference genome and uses cell and UMI barcodes to assign each read to a specific cell and RNA molecule. It leverages the STAR aligner for RNA-seq data.
+UMI Counting: Once the reads are aligned, Cell Ranger aggregates UMIs per gene per cell, providing an accurate gene expression count that minimizes PCR amplification bias.
+Gene Expression Quantification: For scRNA-seq data, Cell Ranger creates a gene expression matrix with rows for genes, columns for cells, and values representing UMI counts. This matrix is foundational for downstream analysis, including cell clustering, differential expression, and pathway analysis.
+Cell Clustering and Visualization: Cell Ranger includes tools to cluster cells based on gene expression patterns and create basic visualizations (e.g., t-SNE or UMAP plots) for exploratory data analysis.
+%%bash
+sbatch /scratch/zt1/project/bioi611/shared/scripts/scRNA_10x_illumina_demo_cellranger.sub
+
+Submitted batch job 8653857
+
+%%bash
+cat /scratch/zt1/project/bioi611/shared/scripts/scRNA_10x_illumina_demo_cellranger.sub
+
+#!/bin/bash
+#SBATCH --partition=standard
+#SBATCH -t 40:00:00
+#SBATCH -n 1
+#SBATCH -c 26
+#SBATCH --mem=250g
+#SBATCH --job-name=scRNA_10x_illumina_demo_cellranger
+#SBATCH --mail-type=FAIL,BEGIN,END
+#SBATCH --error=%x-%J-%u.err
+#SBATCH --output=%x-%J-%u.out
+
+
+## Prepare the input folder
+WORKDIR="/scratch/zt1/project/bioi611/user/$USER/scRNA_10x_illumina_demo/"
+REFERENCE="/scratch/zt1/project/bioi611/shared/reference/refdata-gex-GRCh38-2024-A"
+FASTQ_DIR=/scratch/zt1/project/bioi611/shared/raw_data/Chromium_3p_GEX_Human_PBMC_fastqs/
+
+mkdir -p $WORKDIR
+cd $WORKDIR
+export PATH=/scratch/zt1/project/bioi611/shared/software/cellranger-8.0.1/bin:$PATH
+
+cellranger count --id GEX3p_Human_PBMC \
+ --transcriptome $REFERENCE \
+ --create-bam true \
+ --fastqs $FASTQ_DIR
+
+Input files/folder for cellranger count
:
Raw fastq files
+Reference
+In this class, the pre-built reference has been downloaded from 10x website: +https://www.10xgenomics.com/support/software/cell-ranger/downloads
+%%bash
+ls /scratch/zt1/project/bioi611/shared/reference/refdata-gex-GRCh38-2024-A
+ls /scratch/zt1/project/bioi611/shared/reference/refdata-gex-GRCh38-2024-A/fasta/
+ls /scratch/zt1/project/bioi611/shared/reference/refdata-gex-GRCh38-2024-A/genes/
+ls /scratch/zt1/project/bioi611/shared/reference/refdata-gex-GRCh38-2024-A/star/
+cat /scratch/zt1/project/bioi611/shared/reference/refdata-gex-GRCh38-2024-A/reference.json
+
+fasta
+genes
+reference.json
+star
+genome.fa
+genome.fa.fai
+genes.gtf.gz
+chrLength.txt
+chrNameLength.txt
+chrName.txt
+chrStart.txt
+exonGeTrInfo.tab
+exonInfo.tab
+geneInfo.tab
+Genome
+genomeParameters.txt
+SA
+SAindex
+sjdbInfo.txt
+sjdbList.fromGTF.out.tab
+sjdbList.out.tab
+transcriptInfo.tab
+{
+ "fasta_hash": "b6f131840f9f337e7b858c3d1e89d7ce0321b243",
+ "genomes": [
+ "GRCh38"
+ ],
+ "gtf_hash.gz": "432db3ab308171ef215fac5dc4ca40096099a4c6",
+ "input_fasta_files": [
+ "Homo_sapiens.GRCh38.dna.primary_assembly.fa.modified"
+ ],
+ "input_gtf_files": [
+ "gencode.v44.primary_assembly.annotation.gtf.filtered"
+ ],
+ "mem_gb": 16,
+ "mkref_version": "8.0.0",
+ "threads": 2,
+ "version": "2024-A"
+}
+
+The job you submitted will generate an output folder:
+%%bash
+ls /scratch/zt1/project/bioi611/user/$USER/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/*
+
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_cmdline
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_filelist
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_finalstate
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/GEX3p_Human_PBMC.mri.tgz
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_invocation
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_jobmode
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_log
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_mrosource
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_perf
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_perf._truncated_
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_sitecheck
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_tags
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_timestamp
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_uuid
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_vdrkill
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/_versions
+
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/outs:
+analysis
+cloupe.cloupe
+filtered_feature_bc_matrix
+filtered_feature_bc_matrix.h5
+metrics_summary.csv
+molecule_info.h5
+possorted_genome_bam.bam
+possorted_genome_bam.bam.bai
+raw_feature_bc_matrix
+raw_feature_bc_matrix.h5
+web_summary.html
+
+/scratch/zt1/project/bioi611/user/xie186/scRNA_10x_illumina_demo/GEX3p_Human_PBMC/SC_RNA_COUNTER_CS:
+CELLRANGER_PREFLIGHT
+CELLRANGER_PREFLIGHT_LOCAL
+COPY_CHEMISTRY_SPEC
+fork0
+FULL_COUNT_INPUTS
+GET_AGGREGATE_BARCODES_OUT
+SC_MULTI_CORE
+_STRUCTIFY
+WRITE_GENE_INDEX
+
+You can also find a copy here:
+/scratch/zt1/project/bioi611/shared/output/scRNA_10x_illumina_demo
+
+The summary file in html format is the first file you will check:
+/scratch/zt1/project/bioi611/shared/output/scRNA_10x_illumina_demo/outs/web_summary.html
+
+A detailed documentation can be found here to help you interpret the summary file.
+If there is any metrics that is not within the expectation, a warning or an error message will be shown on the top of the summary.
+The three files below are usually used as input for downstream analysis.
+%%bash
+ls /scratch/zt1/project/bioi611/shared/output/scRNA_10x_illumina_demo/outs/filtered_feature_bc_matrix
+
+barcodes.tsv.gz
+features.tsv.gz
+matrix.mtx.gz
+
+The folder above contains only detected cell-associated +barcodes. Each element of the matrix is the +number of UMIs associated with a feature +(row) and a barcode (column.
+It can be input into third-party packages +and allows users to wrangle the barcodefeature matrix (e.g. to filter outlier cells, run +dimensionality reduction, normalize gene +expression).
+ +id | group |
---|---|
<chr> | <dbl> |
sample01 | 1 |
sample02 | 1 |
sample03 | 1 |
sample04 | 1 |
sample05 | 1 |
sample06 | 1 |
sample07 | 1 |
sample08 | 1 |
sample09 | 1 |
sample10 | 1 |
sample11 | 0 |
sample12 | 0 |
sample13 | 0 |
sample14 | 0 |
sample15 | 0 |
sample16 | 0 |
sample17 | 0 |
sample18 | 0 |
sample19 | 0 |
sample20 | 0 |
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
10 | XLOC_000010 | TCONS_00000010 | transcript | 10 | 3.193499 | 0.01381576 | 0.10521233 | |
25 | XLOC_000014 | TCONS_00000017 | transcript | 25 | 1.549093 | 0.26773622 | 0.79114975 | |
35 | XLOC_000017 | TCONS_00000020 | transcript | 35 | 4.388626 | 0.01085070 | 0.08951825 | |
41 | XLOC_000246 | TCONS_00000598 | transcript | 41 | 1.440519 | 0.47108019 | 0.90253747 | |
45 | XLOC_000019 | TCONS_00000024 | transcript | 45 | 1.714340 | 0.08402948 | 0.48934813 | |
67 | XLOC_000255 | TCONS_00000613 | transcript | 67 | 2.518524 | 0.27317385 | 0.79114975 |
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1225 | XLOC_000440 | TCONS_00001129 | transcript | 1225 | 5.67437122 | 1.035753e-05 | 0.001025395 | |
980 | XLOC_000179 | TCONS_00000452 | transcript | 980 | 6.25344921 | 2.514632e-05 | 0.001244743 | |
469 | XLOC_000101 | TCONS_00000244 | transcript | 469 | 119.29999938 | 2.398681e-04 | 0.007915648 | |
695 | XLOC_000354 | TCONS_00000883 | transcript | 695 | 0.21950959 | 3.302059e-04 | 0.008172596 | |
1012 | XLOC_000409 | TCONS_00001041 | transcript | 1012 | 0.24664434 | 1.527175e-03 | 0.030238073 | |
123 | XLOC_000029 | TCONS_00000059 | transcript | 123 | 0.01603345 | 2.097875e-03 | 0.034614939 | |
961 | XLOC_000176 | TCONS_00000435 | transcript | 961 | 4.96074399 | 2.736075e-03 | 0.038695918 | |
880 | XLOC_000531 | TCONS_00001277 | transcript | 880 | 29.40485236 | 3.272859e-03 | 0.040501628 | |
1063 | XLOC_000197 | TCONS_00000487 | transcript | 1063 | 3.12988187 | 4.555313e-03 | 0.050108442 | |
35 | XLOC_000017 | TCONS_00000020 | transcript | 35 | 4.38862590 | 1.085070e-02 | 0.089518247 |
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
469 | XLOC_000101 | TCONS_00000244 | transcript | 469 | 119.299999 | 0.0002398681 | 0.007915648 | |
477 | XLOC_000101 | TCONS_00000252 | transcript | 477 | 3.128184 | 0.4511803632 | 0.902537469 |
if (!require("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
+
+BiocManager::install("ballgown")
+
+'getOption("repos")' replaces Bioconductor standard repositories, see
+'help("repositories", package = "BiocManager")' for details.
+Replacement repositories:
+ CRAN: https://cran.rstudio.com
+
+Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)
+
+Warning message:
+“package(s) not installed when version(s) same as or greater than current; use
+ `force = TRUE` to re-install: 'ballgown'”
+Old packages: 'Matrix'
+
+library(ballgown)
+data_directory = system.file('extdata', package='ballgown') # automatically finds ballgown's installation directory
+# examine data_directory:
+data_directory
+
+Attaching package: ‘ballgown’
+
+
+The following object is masked from ‘package:base’:
+
+ structure
+
+'/usr/local/lib/R/site-library/ballgown/extdata'
+list.files(data_directory)
+
+
+# make the ballgown object:
+bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
+bg
+
+Mon Oct 21 17:42:05 2024
+
+Mon Oct 21 17:42:05 2024: Reading linking tables
+
+Mon Oct 21 17:42:05 2024: Reading intron data files
+
+Mon Oct 21 17:42:05 2024: Merging intron data
+
+Mon Oct 21 17:42:05 2024: Reading exon data files
+
+Mon Oct 21 17:42:05 2024: Merging exon data
+
+Mon Oct 21 17:42:05 2024: Reading transcript data files
+
+Mon Oct 21 17:42:05 2024: Merging transcript data
+
+Wrapping up the results
+
+Mon Oct 21 17:42:05 2024
+
+
+
+
+ballgown instance with 100 transcripts and 20 samples
+
+
+
+A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.
+Exon, intron, and transcript structures are easily extracted from the main ballgown object:
+structure(bg)$exon
+
+
+GRanges object with 633 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] 18 24412069-24412331 * | 12 10
+ [2] 22 17308271-17308950 + | 55 25
+ [3] 22 17309432-17310226 + | 56 25
+ [4] 22 18121428-18121652 + | 88 35
+ [5] 22 18138428-18138598 + | 89 35
+ ... ... ... ... . ... ...
+ [629] 22 51221929-51222113 - | 3777 1294
+ [630] 22 51221319-51221473 - | 3782 1297
+ [631] 22 51221929-51222162 - | 3783 1297
+ [632] 22 51221929-51222168 - | 3784 1301
+ [633] 6 31248149-31248334 * | 3794 1312
+ -------
+ seqinfo: 3 sequences from an unspecified genome; no seqlengths
+
+structure(bg)$intron
+
+
+GRanges object with 536 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] 22 17308951-17309431 + | 33 25
+ [2] 22 18121653-18138427 + | 57 35
+ [3] 22 18138599-18185008 + | 58 35
+ [4] 22 18185153-18209442 + | 59 35
+ [5] 22 18385514-18387397 - | 72 41
+ ... ... ... ... . ... ...
+ [532] 22 51216410-51220615 - | 2750 c(1294, 1297, 1301)
+ [533] 22 51220776-51221928 - | 2756 1294
+ [534] 22 51220780-51221318 - | 2757 1297
+ [535] 22 51221474-51221928 - | 2758 1297
+ [536] 22 51220780-51221928 - | 2759 1301
+ -------
+ seqinfo: 1 sequence from an unspecified genome; no seqlengths
+
+structure(bg)$trans
+
+
+GRangesList object of length 100:
+$`10`
+GRanges object with 1 range and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] 18 24412069-24412331 * | 12 10
+ -------
+ seqinfo: 3 sequences from an unspecified genome; no seqlengths
+
+$`25`
+GRanges object with 2 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] 22 17308271-17308950 + | 55 25
+ [2] 22 17309432-17310226 + | 56 25
+ -------
+ seqinfo: 3 sequences from an unspecified genome; no seqlengths
+
+$`35`
+GRanges object with 4 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] 22 18121428-18121652 + | 88 35
+ [2] 22 18138428-18138598 + | 89 35
+ [3] 22 18185009-18185152 + | 90 35
+ [4] 22 18209443-18212080 + | 91 35
+ -------
+ seqinfo: 3 sequences from an unspecified genome; no seqlengths
+
+...
+<97 more elements>
+
+The expr slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the *_data.ctab Tablemaker output files. Ballgown implements the following syntax to access components of the expr slot:
+*expr(ballgown_object_name, <EXPRESSION_MEASUREMENT>)
+
+where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate .ctab file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the bg ballgown object:
+transcript_fpkm = texpr(bg, 'FPKM')
+transcript_cov = texpr(bg, 'cov')
+whole_tx_table = texpr(bg, 'all')
+exon_mcov = eexpr(bg, 'mcov')
+junction_rcount = iexpr(bg)
+whole_intron_table = iexpr(bg, 'all')
+gene_expression = gexpr(bg)
+
+pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))
+
+
+pData(bg)
+
+id | group |
---|---|
<chr> | <dbl> |
sample01 | 1 |
sample02 | 1 |
sample03 | 1 |
sample04 | 1 |
sample05 | 1 |
sample06 | 1 |
sample07 | 1 |
sample08 | 1 |
sample09 | 1 |
sample10 | 1 |
sample11 | 0 |
sample12 | 0 |
sample13 | 0 |
sample14 | 0 |
sample15 | 0 |
sample16 | 0 |
sample17 | 0 |
sample18 | 0 |
sample19 | 0 |
sample20 | 0 |
plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12',
+ meas='FPKM', colorby='transcript',
+ main='transcripts from gene XLOC_000454: sample 12, FPKM')
+
+It is also possible to plot several samples at once:
+plotTranscripts('XLOC_000454', bg,
+ samples=c('sample01', 'sample06', 'sample12', 'sample19'),
+ meas='FPKM', colorby='transcript')
+
+
+
+You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1):
+plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript')
+
+Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).
+stat_results = stattest(bg, feature='transcript',
+ meas='FPKM', covariate='group',
+ getFC=TRUE)
+
+
+results_transcripts <- data.frame(geneNames = geneNames(bg),
+ geneIDs = geneIDs(bg),
+ transcriptNames = transcriptNames(bg),
+ stat_results)
+
+head(results_transcripts)
+
+geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
10 | XLOC_000010 | TCONS_00000010 | transcript | 10 | 3.193499 | 0.01381576 | 0.10521233 | |
25 | XLOC_000014 | TCONS_00000017 | transcript | 25 | 1.549093 | 0.26773622 | 0.79114975 | |
35 | XLOC_000017 | TCONS_00000020 | transcript | 35 | 4.388626 | 0.01085070 | 0.08951825 | |
41 | XLOC_000246 | TCONS_00000598 | transcript | 41 | 1.440519 | 0.47108019 | 0.90253747 | |
45 | XLOC_000019 | TCONS_00000024 | transcript | 45 | 1.714340 | 0.08402948 | 0.48934813 | |
67 | XLOC_000255 | TCONS_00000613 | transcript | 67 | 2.518524 | 0.27317385 | 0.79114975 |
results_transcripts <- results_transcripts[order(results_transcripts$qval), ]
+
+head(results_transcripts, 10)
+
+geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1225 | XLOC_000440 | TCONS_00001129 | transcript | 1225 | 5.67437122 | 1.035753e-05 | 0.001025395 | |
980 | XLOC_000179 | TCONS_00000452 | transcript | 980 | 6.25344921 | 2.514632e-05 | 0.001244743 | |
469 | XLOC_000101 | TCONS_00000244 | transcript | 469 | 119.29999938 | 2.398681e-04 | 0.007915648 | |
695 | XLOC_000354 | TCONS_00000883 | transcript | 695 | 0.21950959 | 3.302059e-04 | 0.008172596 | |
1012 | XLOC_000409 | TCONS_00001041 | transcript | 1012 | 0.24664434 | 1.527175e-03 | 0.030238073 | |
123 | XLOC_000029 | TCONS_00000059 | transcript | 123 | 0.01603345 | 2.097875e-03 | 0.034614939 | |
961 | XLOC_000176 | TCONS_00000435 | transcript | 961 | 4.96074399 | 2.736075e-03 | 0.038695918 | |
880 | XLOC_000531 | TCONS_00001277 | transcript | 880 | 29.40485236 | 3.272859e-03 | 0.040501628 | |
1063 | XLOC_000197 | TCONS_00000487 | transcript | 1063 | 3.12988187 | 4.555313e-03 | 0.050108442 | |
35 | XLOC_000017 | TCONS_00000020 | transcript | 35 | 4.38862590 | 1.085070e-02 | 0.089518247 |
results_transcripts[results_transcripts$geneIDs == "XLOC_000101", ]
+
+geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
469 | XLOC_000101 | TCONS_00000244 | transcript | 469 | 119.299999 | 0.0002398681 | 0.007915648 | |
477 | XLOC_000101 | TCONS_00000252 | transcript | 477 | 3.128184 | 0.4511803632 | 0.902537469 |
plotMeans('XLOC_000101', bg, groupvar='group', meas='FPKM', colorby='transcript')
+
+https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html
+ +id | group |
---|---|
<chr> | <chr> |
N2_day1_rep1 | young |
N2_day1_rep2 | young |
N2_day1_rep3 | young |
N2_day7_rep1 | old |
N2_day7_rep2 | old |
N2_day7_rep3 | old |
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1 | Y74C9A.6 | WBGene00023193 | Y74C9A.6 | transcript | 1 | 0.3087935 | 0.313759448 | 0.52196762 |
2 | homt-1 | WBGene00022277 | Y74C9A.3.1 | transcript | 2 | 0.6960674 | 0.006886113 | 0.08724496 |
3 | nlp-40 | WBGene00022276 | Y74C9A.2a.3 | transcript | 3 | 0.9999961 | 0.948535319 | 0.96924051 |
4 | nlp-40 | WBGene00022276 | Y74C9A.2a.1 | transcript | 4 | 0.4248382 | 0.040862773 | 0.16836530 |
5 | nlp-40 | WBGene00022276 | Y74C9A.2a.2 | transcript | 5 | 4.8007011 | 0.050336942 | 0.18664213 |
6 | nlp-40 | WBGene00022276 | Y74C9A.2b.1 | transcript | 6 | 0.6299300 | 0.112477785 | 0.29368625 |
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1940 | F48C1.8 | WBGene00018600 | F48C1.8.1 | transcript | 1940 | 2.1012606 | 5.213184e-06 | 0.01647351 |
3643 | F32H2.15 | WBGene00284858 | F32H2.15 | transcript | 3643 | 0.1912040 | 1.087705e-06 | 0.01647351 |
3811 | lin-41 | WBGene00003026 | C12C8.3a.1 | transcript | 3811 | 12.1472453 | 2.940965e-06 | 0.01647351 |
6887 | WBGene00044425 | F54D12.11 | transcript | 6887 | 0.8775027 | 7.411872e-06 | 0.01647351 | |
8957 | ifb-2 | WBGene00002054 | F10C1.7a.1 | transcript | 8957 | 4.5625027 | 1.180041e-06 | 0.01647351 |
20015 | Y69A2AR.28 | WBGene00022099 | Y69A2AR.28.1 | transcript | 20015 | 0.6076403 | 7.905547e-06 | 0.01647351 |
20798 | str-185 | WBGene00006228 | R08C7.7.1 | transcript | 20798 | 0.9668631 | 8.516315e-06 | 0.01647351 |
23441 | unc-44 | WBGene00006780 | B0350.2a.1 | transcript | 23441 | 3.3319026 | 5.919439e-06 | 0.01647351 |
25280 | plp-1 | WBGene00004046 | F45E4.2.1 | transcript | 25280 | 0.7053267 | 7.087841e-06 | 0.01647351 |
25710 | WBGene00023488 | T26A8.5 | transcript | 25710 | 0.9632862 | 8.470675e-06 | 0.01647351 |
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
8957 | ifb-2 | WBGene00002054 | F10C1.7a.1 | transcript | 8957 | 4.562503 | 1.180041e-06 | 0.01647351 |
8956 | ifb-2 | WBGene00002054 | F10C1.7c.1 | transcript | 8956 | 3.554543 | 3.838807e-04 | 0.04612121 |
8958 | ifb-2 | WBGene00002054 | F10C1.7e.1 | transcript | 8958 | 1.009672 | 8.838138e-01 | 0.93058010 |
if (!require("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
+
+BiocManager::install("ballgown")
+
+Installing package into ‘/usr/local/lib/R/site-library’
+(as ‘lib’ is unspecified)
+
+'getOption("repos")' replaces Bioconductor standard repositories, see
+'help("repositories", package = "BiocManager")' for details.
+Replacement repositories:
+ CRAN: https://cran.rstudio.com
+
+Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)
+
+Installing package(s) 'BiocVersion', 'ballgown'
+
+also installing the dependencies ‘plogr’, ‘png’, ‘formatR’, ‘abind’, ‘SparseArray’, ‘RSQLite’, ‘KEGGREST’, ‘lambda.r’, ‘futile.options’, ‘S4Arrays’, ‘DelayedArray’, ‘MatrixGenerics’, ‘AnnotationDbi’, ‘annotate’, ‘futile.logger’, ‘snow’, ‘BH’, ‘locfit’, ‘bitops’, ‘Rhtslib’, ‘SummarizedExperiment’, ‘RCurl’, ‘rjson’, ‘BiocGenerics’, ‘XVector’, ‘genefilter’, ‘BiocParallel’, ‘matrixStats’, ‘edgeR’, ‘statmod’, ‘XML’, ‘Biostrings’, ‘zlibbioc’, ‘Rsamtools’, ‘GenomicAlignments’, ‘BiocIO’, ‘restfulr’, ‘UCSC.utils’, ‘GenomeInfoDbData’, ‘GenomicRanges’, ‘IRanges’, ‘S4Vectors’, ‘sva’, ‘limma’, ‘rtracklayer’, ‘Biobase’, ‘GenomeInfoDb’
+
+
+Old packages: 'gtable'
+
+library(ballgown)
+
+
+Attaching package: ‘ballgown’
+
+
+The following object is masked from ‘package:base’:
+
+ structure
+
++++
+Please upload your data generated by `tablemaker. Please refer to the section Running Tablemaker the link here for details.
You can also download a copy from the path below:
+/scratch/zt1/project/bioi611/shared/output/bulkRNA_SE_tablemaker.tar.gz
+
+Or you can download a copy via the link below:
+https://umd0-my.sharepoint.com/:u:/g/personal/xie186_umd_edu/EYLz8khnMeRCmyK_YFDDXaQBP_4hzpAgs_nN-TNXghdQMQ?e=5By9ct
+getwd()
+
+'/content'
+
+system("tar zxvf bulkRNA_SE_tablemaker.tar.gz")
+
+data_directory = file.path(getwd(), "bulkRNA_SE_tablemaker")
+data_directory
+
+'/content/bulkRNA_SE_tablemaker'
+# make the ballgown object:
+bg = ballgown(dataDir = data_directory, samplePattern='N2_day', meas='all')
+bg
+
+Mon Oct 28 10:28:23 2024
+
+Mon Oct 28 10:28:23 2024: Reading linking tables
+
+Mon Oct 28 10:28:24 2024: Reading intron data files
+
+Mon Oct 28 10:28:27 2024: Merging intron data
+
+Mon Oct 28 10:28:27 2024: Reading exon data files
+
+Mon Oct 28 10:28:33 2024: Merging exon data
+
+Mon Oct 28 10:28:34 2024: Reading transcript data files
+
+Mon Oct 28 10:28:38 2024: Merging transcript data
+
+Wrapping up the results
+
+Mon Oct 28 10:28:38 2024
+
+
+
+
+ballgown instance with 60032 transcripts and 6 samples
+
+A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.
+Exon, intron, and transcript structures are easily extracted from the main ballgown object:
+structure(bg)$exon
+
+
+GRanges object with 178766 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] I 3747-3909 - | 1 1
+ [2] I 4116-4358 - | 2 2
+ [3] I 5195-5296 - | 3 2
+ [4] I 6037-6327 - | 4 2
+ [5] I 9727-9846 - | 5 2
+ ... ... ... ... . ... ...
+ [178762] MtDNA 10348-10401 + | 178762 60028
+ [178763] MtDNA 10403-11354 + | 178763 60029
+ [178764] MtDNA 11356-11691 + | 178764 60030
+ [178765] MtDNA 11691-13272 + | 178765 60031
+ [178766] MtDNA 13275-13327 + | 178766 60032
+ -------
+ seqinfo: 7 sequences from an unspecified genome; no seqlengths
+
+structure(bg)$intron
+
+
+GRanges object with 116284 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] I 4359-5194 - | 1 2
+ [2] I 5297-6036 - | 2 2
+ [3] I 6328-9726 - | 3 2
+ [4] I 9847-10094 - | 4 2
+ [5] I 11562-11617 + | 5 3:4
+ ... ... ... ... . ... ...
+ [116280] X 17715112-17716973 + | 116280 59995:59996
+ [116281] X 17717088-17717170 + | 116281 59995:59996
+ [116282] X 17717279-17717327 + | 116282 59995:59996
+ [116283] X 17717444-17718427 + | 116283 59995
+ [116284] X 17717444-17718434 + | 116284 59996
+ -------
+ seqinfo: 6 sequences from an unspecified genome; no seqlengths
+
+structure(bg)$trans
+
+
+GRangesList object of length 60032:
+$`1`
+GRanges object with 1 range and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] I 3747-3909 - | 1 1
+ -------
+ seqinfo: 7 sequences from an unspecified genome; no seqlengths
+
+$`2`
+GRanges object with 5 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] I 4116-4358 - | 2 2
+ [2] I 5195-5296 - | 3 2
+ [3] I 6037-6327 - | 4 2
+ [4] I 9727-9846 - | 5 2
+ [5] I 10095-10230 - | 6 2
+ -------
+ seqinfo: 7 sequences from an unspecified genome; no seqlengths
+
+$`3`
+GRanges object with 5 ranges and 2 metadata columns:
+ seqnames ranges strand | id transcripts
+ <Rle> <IRanges> <Rle> | <integer> <character>
+ [1] I 11495-11561 + | 7 3:4
+ [2] I 11618-11689 + | 8 3:5
+ [3] I 14951-15160 + | 9 3:5
+ [4] I 16473-16585 + | 10 c(3, 6)
+ [5] I 16702-16793 + | 11 3
+ -------
+ seqinfo: 7 sequences from an unspecified genome; no seqlengths
+
+...
+<60029 more elements>
+
+The expr slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the *_data.ctab Tablemaker output files. Ballgown implements the following syntax to access components of the expr slot:
+*expr(ballgown_object_name, <EXPRESSION_MEASUREMENT>)
+
+where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate .ctab file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the bg ballgown object:
+transcript_fpkm = texpr(bg, 'FPKM')
+transcript_cov = texpr(bg, 'cov')
+whole_tx_table = texpr(bg, 'all')
+exon_mcov = eexpr(bg, 'mcov')
+junction_rcount = iexpr(bg)
+whole_intron_table = iexpr(bg, 'all')
+gene_expression = gexpr(bg)
+
+Warning message in .normarg_f(f, x):
+“'NROW(x)' is not a multiple of split factor length”
+Warning message in tlengths * tmeas:
+“longer object length is not a multiple of shorter object length”
+
+sampleNames(bg)
+
+
+pData(bg) = data.frame(id=sampleNames(bg), group=rep(c("young","old"), each=3))
+
+pData(bg)
+
+id | group |
---|---|
<chr> | <chr> |
N2_day1_rep1 | young |
N2_day1_rep2 | young |
N2_day1_rep3 | young |
N2_day7_rep1 | old |
N2_day7_rep2 | old |
N2_day7_rep3 | old |
plotTranscripts(gene='WBGene00002054', gown=bg, samples='N2_day1_rep1',
+ meas='FPKM', colorby='transcript',
+ main='transcripts from gene XLOC_000454: sample 12, FPKM')
+
+It is also possible to plot several samples at once:
+plotTranscripts('WBGene00002054', bg,
+ samples=c('N2_day1_rep1', 'N2_day7_rep1'),
+ meas='FPKM', colorby='transcript')
+
+
+
+You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1):
+plotMeans('WBGene00002054', bg, groupvar='group', meas='FPKM', colorby='transcript')
+
+Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).
+stat_results = stattest(bg, feature='transcript',
+ meas='FPKM', covariate='group',
+ getFC=TRUE)
+
+
+results_transcripts <- data.frame(geneNames = geneNames(bg),
+ geneIDs = geneIDs(bg),
+ transcriptNames = transcriptNames(bg),
+ stat_results)
+
+head(results_transcripts)
+
+geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1 | Y74C9A.6 | WBGene00023193 | Y74C9A.6 | transcript | 1 | 0.3087935 | 0.313759448 | 0.52196762 |
2 | homt-1 | WBGene00022277 | Y74C9A.3.1 | transcript | 2 | 0.6960674 | 0.006886113 | 0.08724496 |
3 | nlp-40 | WBGene00022276 | Y74C9A.2a.3 | transcript | 3 | 0.9999961 | 0.948535319 | 0.96924051 |
4 | nlp-40 | WBGene00022276 | Y74C9A.2a.1 | transcript | 4 | 0.4248382 | 0.040862773 | 0.16836530 |
5 | nlp-40 | WBGene00022276 | Y74C9A.2a.2 | transcript | 5 | 4.8007011 | 0.050336942 | 0.18664213 |
6 | nlp-40 | WBGene00022276 | Y74C9A.2b.1 | transcript | 6 | 0.6299300 | 0.112477785 | 0.29368625 |
results_transcripts <- results_transcripts[order(results_transcripts$qval), ]
+
+head(results_transcripts, 10)
+
+geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1940 | F48C1.8 | WBGene00018600 | F48C1.8.1 | transcript | 1940 | 2.1012606 | 5.213184e-06 | 0.01647351 |
3643 | F32H2.15 | WBGene00284858 | F32H2.15 | transcript | 3643 | 0.1912040 | 1.087705e-06 | 0.01647351 |
3811 | lin-41 | WBGene00003026 | C12C8.3a.1 | transcript | 3811 | 12.1472453 | 2.940965e-06 | 0.01647351 |
6887 | WBGene00044425 | F54D12.11 | transcript | 6887 | 0.8775027 | 7.411872e-06 | 0.01647351 | |
8957 | ifb-2 | WBGene00002054 | F10C1.7a.1 | transcript | 8957 | 4.5625027 | 1.180041e-06 | 0.01647351 |
20015 | Y69A2AR.28 | WBGene00022099 | Y69A2AR.28.1 | transcript | 20015 | 0.6076403 | 7.905547e-06 | 0.01647351 |
20798 | str-185 | WBGene00006228 | R08C7.7.1 | transcript | 20798 | 0.9668631 | 8.516315e-06 | 0.01647351 |
23441 | unc-44 | WBGene00006780 | B0350.2a.1 | transcript | 23441 | 3.3319026 | 5.919439e-06 | 0.01647351 |
25280 | plp-1 | WBGene00004046 | F45E4.2.1 | transcript | 25280 | 0.7053267 | 7.087841e-06 | 0.01647351 |
25710 | WBGene00023488 | T26A8.5 | transcript | 25710 | 0.9632862 | 8.470675e-06 | 0.01647351 |
results_transcripts[results_transcripts$geneIDs == "WBGene00002054", ]
+
+geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
8957 | ifb-2 | WBGene00002054 | F10C1.7a.1 | transcript | 8957 | 4.562503 | 1.180041e-06 | 0.01647351 |
8956 | ifb-2 | WBGene00002054 | F10C1.7c.1 | transcript | 8956 | 3.554543 | 3.838807e-04 | 0.04612121 |
8958 | ifb-2 | WBGene00002054 | F10C1.7e.1 | transcript | 8958 | 1.009672 | 8.838138e-01 | 0.93058010 |
plotMeans('WBGene00002054', bg, groupvar='group', meas='FPKM', colorby='transcript')
+
+write.csv(results_transcripts,
+ file = "BIOI611_bulkRNA_SE_ballgown.csv",
+ row.names = FALSE)
+
+sessionInfo()
+
+R version 4.4.1 (2024-06-14)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+[1] ballgown_2.36.0
+
+loaded via a namespace (and not attached):
+ [1] IRdisplay_1.1 blob_1.2.4
+ [3] Biostrings_2.72.1 bitops_1.0-9
+ [5] fastmap_1.2.0 RCurl_1.98-1.16
+ [7] GenomicAlignments_1.40.0 XML_3.99-0.17
+ [9] digest_0.6.37 lifecycle_1.0.4
+[11] survival_3.7-0 statmod_1.5.0
+[13] KEGGREST_1.44.1 RSQLite_2.3.7
+[15] genefilter_1.86.0 compiler_4.4.1
+[17] rlang_1.1.4 tools_4.4.1
+[19] utf8_1.2.4 yaml_2.3.10
+[21] rtracklayer_1.64.0 S4Arrays_1.4.1
+[23] bit_4.5.0 curl_5.2.3
+[25] DelayedArray_0.30.1 repr_1.1.7
+[27] RColorBrewer_1.1-3 abind_1.4-8
+[29] BiocParallel_1.38.0 pbdZMQ_0.3-13
+[31] BiocGenerics_0.50.0 grid_4.4.1
+[33] stats4_4.4.1 fansi_1.0.6
+[35] xtable_1.8-4 edgeR_4.2.2
+[37] SummarizedExperiment_1.34.0 cli_3.6.3
+[39] crayon_1.5.3 httr_1.4.7
+[41] rjson_0.2.23 DBI_1.2.3
+[43] cachem_1.1.0 zlibbioc_1.50.0
+[45] splines_4.4.1 parallel_4.4.1
+[47] AnnotationDbi_1.66.0 BiocManager_1.30.25
+[49] XVector_0.44.0 restfulr_0.0.15
+[51] matrixStats_1.4.1 base64enc_0.1-3
+[53] vctrs_0.6.5 Matrix_1.7-1
+[55] jsonlite_1.8.9 sva_3.52.0
+[57] IRanges_2.38.1 S4Vectors_0.42.1
+[59] bit64_4.5.2 locfit_1.5-9.10
+[61] limma_3.60.6 annotate_1.82.0
+[63] glue_1.8.0 codetools_0.2-20
+[65] GenomeInfoDb_1.40.1 BiocIO_1.14.0
+[67] GenomicRanges_1.56.2 UCSC.utils_1.0.0
+[69] pillar_1.9.0 htmltools_0.5.8.1
+[71] IRkernel_1.3.2 GenomeInfoDbData_1.2.12
+[73] R6_2.5.1 evaluate_1.0.1
+[75] lattice_0.22-6 Biobase_2.64.0
+[77] png_0.1-8 Rsamtools_2.20.0
+[79] memoise_2.0.1 Rcpp_1.0.13
+[81] uuid_1.2-1 SparseArray_1.4.8
+[83] nlme_3.1-166 mgcv_1.9-1
+[85] MatrixGenerics_1.16.0
+
+https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html
+ +group |
---|
N2_day1 |
N2_day1 |
N2_day1 |
N2_day7 |
N2_day7 |
N2_day7 |
group | |
---|---|
N2_day1_rep1 | N2_day1 |
N2_day1_rep2 | N2_day1 |
N2_day1_rep3 | N2_day1 |
N2_day7_rep1 | N2_day7 |
N2_day7_rep2 | N2_day7 |
N2_day7_rep3 | N2_day7 |
N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
group | N2_day1 | N2_day1 | N2_day1 | N2_day7 | N2_day7 | N2_day7 |
V1 | V2 | V3 | V4 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_unmapped | 1332776 | 1332776 | 1332776 |
2 | N_multimapping | 1540190 | 1540190 | 1540190 |
3 | N_noFeature | 157102 | 19017455 | 18933214 |
4 | N_ambiguous | 536422 | 128854 | 120342 |
5 | WBGene00000003 | 341 | 161 | 180 |
V1 | V2 | V3 | V4 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_unmapped | 1400596 | 1400596 | 1400596 |
2 | N_multimapping | 1305129 | 1305129 | 1305129 |
3 | N_noFeature | 152183 | 15009786 | 14975925 |
4 | N_ambiguous | 439830 | 104631 | 98489 |
5 | WBGene00000003 | 415 | 198 | 217 |
V1 | V2 | V3 | V4 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_unmapped | 5887223 | 5887223 | 5887223 |
2 | N_multimapping | 1557570 | 1557570 | 1557570 |
3 | N_noFeature | 184441 | 17612359 | 17574940 |
4 | N_ambiguous | 514559 | 122385 | 115498 |
5 | WBGene00000003 | 411 | 175 | 236 |
V1 | V2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 1332776 |
2 | N_multimapping | 1540190 |
3 | N_noFeature | 157102 |
4 | N_ambiguous | 536422 |
5 | WBGene00000003 | 341 |
V1 | V2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 1400596 |
2 | N_multimapping | 1305129 |
3 | N_noFeature | 152183 |
4 | N_ambiguous | 439830 |
5 | WBGene00000003 | 415 |
V1 | V2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 5887223 |
2 | N_multimapping | 1557570 |
3 | N_noFeature | 184441 |
4 | N_ambiguous | 514559 |
5 | WBGene00000003 | 411 |
V1 | V2.x | V2.y | |
---|---|---|---|
<chr> | <int> | <int> | |
1 | N_ambiguous | 536422 | 439830 |
2 | N_multimapping | 1540190 | 1305129 |
3 | N_noFeature | 157102 | 152183 |
4 | N_unmapped | 1332776 | 1400596 |
5 | WBGene00000001 | 3227 | 2168 |
6 | WBGene00000002 | 270 | 203 |
V1 | V2.x | V2.y | V2 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_ambiguous | 536422 | 439830 | 514559 |
2 | N_multimapping | 1540190 | 1305129 | 1557570 |
3 | N_noFeature | 157102 | 152183 | 184441 |
4 | N_unmapped | 1332776 | 1400596 | 5887223 |
5 | WBGene00000001 | 3227 | 2168 | 2589 |
6 | WBGene00000002 | 270 | 203 | 266 |
V1 | V2.x | V2.y | V2 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_ambiguous | 536422 | 439830 | 514559 |
2 | N_multimapping | 1540190 | 1305129 | 1557570 |
3 | N_noFeature | 157102 | 152183 | 184441 |
4 | N_unmapped | 1332776 | 1400596 | 5887223 |
5 | WBGene00000001 | 3227 | 2168 | 2589 |
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <int> | <dbl> | <int> | <dbl> | <dbl> | <dbl> | <int> | <int> | <int> | <int> |
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
R is a language and environment for statistical computing and graphics.
+Runs on a variaty of operation systerms: Windows, Linux and MacOS
+Generates publication-ready plots
+Owns a large open-source community
+Extends functions as packages
+Variable names can have letters, dots and undercores (e.g. gene_name
, "csvfile.1" and chr1
).
A function is a structured, reusable segment of code designed to carry out a specific set of operations. It can accept zero or more inputs (parameters) and can produce an output (result).
+INPUT --function--> OUTPUT
The way to define a function is:
+read_star_file <- function(file_path){
+ ...code goes here...
+ return(df)
+}
+
+The way to use a function in R is:
+read_star_file(paths)
+
+# assign a number to variable gene_count
+gene1 <- 150
+gene2 <- 200
+gene1
+gene2
+
+150
+200
+# Examples of character value
+gene_name1 <- "KRAS"
+gene_name1
+
+'KRAS'
+# Examples of character value
+"RAS" -> gene_name2
+gene_name2
+
+'RAS'
+# Examples of character value
+gene_name3 <- "KRAS"
+gene_name3
+
+'KRAS'
+# Logical value
+gene1 > gene2
+gene1 < gene2
+bool_val = gene1 < gene2
+
+FALSE
+TRUE
+class(gene1)
+class(gene_name)
+class()
+
+'numeric'
+'character'
+An atomic vector is a collection of multiple values (numeric, character, or logical) stored in a single object. You can create an atomic vector using the c() function.
+sample_names <- c("N2_day1_rep1", "N2_day1_rep2", "N2_day1_rep3",
+ "N2_day7_rep1", "N2_day7_rep2", "N2_day7_rep3")
+sample_names
+
+
+group <- gsub("_rep\\d", "", sample_names)
+group
+
+
+coldata_df <- cbind(group = gsub("_rep\\d", "", sample_names))
+coldata_df
+
+group |
---|
N2_day1 |
N2_day1 |
N2_day1 |
N2_day7 |
N2_day7 |
N2_day7 |
rownames(coldata_df) = sample_names
+coldata_df
+
+group | |
---|---|
N2_day1_rep1 | N2_day1 |
N2_day1_rep2 | N2_day1 |
N2_day1_rep3 | N2_day1 |
N2_day7_rep1 | N2_day7 |
N2_day7_rep2 | N2_day7 |
N2_day7_rep3 | N2_day7 |
t(coldata_df)
+
+N2_day1_rep1 | N2_day1_rep2 | N2_day1_rep3 | N2_day7_rep1 | N2_day7_rep2 | N2_day7_rep3 | |
---|---|---|---|---|---|---|
group | N2_day1 | N2_day1 | N2_day1 | N2_day7 | N2_day7 | N2_day7 |
is.matrix(coldata_df)
+coldata_df <- as.data.frame(coldata_df)
+is.data.frame(coldata_df)
+
+TRUE
+TRUE
+A matrix can contain either character or numeric columns and a dataframe can contain both numeric and character columns.
+A list is an ordered collection of objects, which can be any type of R objects (vectors, matrices, data frames, even lists).
+count_files = list("sample" = 'N2_day1_rep1.ReadsPerGene.out.tab', "sample2"='N2_day1_rep2.ReadsPerGene.out.tab')
+count_files
+
+gene_count = list("gene1" = 10, "gene2"=20)
+
+lapply(gene_count, function(x){log2(x+1)})
+
+log2_transform <- function(x){
+ log2(x+1)
+}
+lapply(gene_count, log2_transform)
+
+
+
+getwd()
+#setwd()
+
+'/content'
+list.files()
+
+
+file_paths <- list.files(pattern = "*..ReadsPerGene.out.tab")
+file_paths
+
+
+tab_N2_day1_rep1 <- read.table('N2_day1_rep1.ReadsPerGene.out.tab')
+head(tab_N2_day1_rep1, 5)
+
+V1 | V2 | V3 | V4 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_unmapped | 1332776 | 1332776 | 1332776 |
2 | N_multimapping | 1540190 | 1540190 | 1540190 |
3 | N_noFeature | 157102 | 19017455 | 18933214 |
4 | N_ambiguous | 536422 | 128854 | 120342 |
5 | WBGene00000003 | 341 | 161 | 180 |
tab_N2_day1_rep2 <- read.table('N2_day1_rep2.ReadsPerGene.out.tab')
+head(tab_N2_day1_rep2,5)
+
+V1 | V2 | V3 | V4 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_unmapped | 1400596 | 1400596 | 1400596 |
2 | N_multimapping | 1305129 | 1305129 | 1305129 |
3 | N_noFeature | 152183 | 15009786 | 14975925 |
4 | N_ambiguous | 439830 | 104631 | 98489 |
5 | WBGene00000003 | 415 | 198 | 217 |
tab_N2_day1_rep3 <- read.table('N2_day1_rep3.ReadsPerGene.out.tab')
+head(tab_N2_day1_rep3,5)
+
+V1 | V2 | V3 | V4 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_unmapped | 5887223 | 5887223 | 5887223 |
2 | N_multimapping | 1557570 | 1557570 | 1557570 |
3 | N_noFeature | 184441 | 17612359 | 17574940 |
4 | N_ambiguous | 514559 | 122385 | 115498 |
5 | WBGene00000003 | 411 | 175 | 236 |
library(dplyr)
+
+Attaching package: ‘dplyr’
+
+
+The following objects are masked from ‘package:stats’:
+
+ filter, lag
+
+
+The following objects are masked from ‘package:base’:
+
+ intersect, setdiff, setequal, union
+
+tab_N2_day1_rep1 <- tab_N2_day1_rep1 %>% select(V1, V2)
+head(tab_N2_day1_rep1, 5)
+tab_N2_day1_rep2 <- tab_N2_day1_rep2[, c("V1", "V2")]
+head(tab_N2_day1_rep2, 5)
+tab_N2_day1_rep3 <- tab_N2_day1_rep3[, c("V1", "V2")]
+head(tab_N2_day1_rep3, 5)
+
+V1 | V2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 1332776 |
2 | N_multimapping | 1540190 |
3 | N_noFeature | 157102 |
4 | N_ambiguous | 536422 |
5 | WBGene00000003 | 341 |
V1 | V2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 1400596 |
2 | N_multimapping | 1305129 |
3 | N_noFeature | 152183 |
4 | N_ambiguous | 439830 |
5 | WBGene00000003 | 415 |
V1 | V2 | |
---|---|---|
<chr> | <int> | |
1 | N_unmapped | 5887223 |
2 | N_multimapping | 1557570 |
3 | N_noFeature | 184441 |
4 | N_ambiguous | 514559 |
5 | WBGene00000003 | 411 |
df_merged <- merge(tab_N2_day1_rep1, tab_N2_day1_rep2, by = "V1")
+head(df_merged)
+df_merged <- merge(df_merged, tab_N2_day1_rep3, by = "V1")
+head(df_merged)
+
+V1 | V2.x | V2.y | |
---|---|---|---|
<chr> | <int> | <int> | |
1 | N_ambiguous | 536422 | 439830 |
2 | N_multimapping | 1540190 | 1305129 |
3 | N_noFeature | 157102 | 152183 |
4 | N_unmapped | 1332776 | 1400596 |
5 | WBGene00000001 | 3227 | 2168 |
6 | WBGene00000002 | 270 | 203 |
V1 | V2.x | V2.y | V2 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_ambiguous | 536422 | 439830 | 514559 |
2 | N_multimapping | 1540190 | 1305129 | 1557570 |
3 | N_noFeature | 157102 | 152183 | 184441 |
4 | N_unmapped | 1332776 | 1400596 | 5887223 |
5 | WBGene00000001 | 3227 | 2168 | 2589 |
6 | WBGene00000002 | 270 | 203 | 266 |
The Reduce()
function in R allows us to apply a function repeatedly to a list of elements. Here, we are applying merge()
iteratively to a list of data frames.
df_mer_red <- Reduce(function(x, y) merge(x, y, by = "V1"),
+ list("tab_N2_day1_rep1" = tab_N2_day1_rep1,
+ "tab_N2_day1_rep2" = tab_N2_day1_rep2,
+ "tab_N2_day1_rep3" = tab_N2_day1_rep3))
+head(df_mer_red, 5)
+
+V1 | V2.x | V2.y | V2 | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | N_ambiguous | 536422 | 439830 | 514559 |
2 | N_multimapping | 1540190 | 1305129 | 1557570 |
3 | N_noFeature | 157102 | 152183 | 184441 |
4 | N_unmapped | 1332776 | 1400596 | 5887223 |
5 | WBGene00000001 | 3227 | 2168 | 2589 |
# Define the cars vector with 5 values
+cars <- c(1, 3, 6, 4, 9)
+
+# Graph the cars vector with all defaults
+plot(cars)
+
+Let's add a title, a line to connect the points, and some color:
+# Define the cars vector with 5 values
+cars <- c(1, 3, 6, 4, 9)
+
+# Graph cars using blue points overlayed by a line
+plot(cars, type="o", col="blue")
+
+# Create a title with a red, bold/italic font
+title(main="Autos", col.main="red", font.main=4)
+
+# Define 2 vectors
+cars <- c(1, 3, 6, 4, 9)
+trucks <- c(2, 5, 4, 5, 12)
+
+# Graph cars using a y axis that ranges from 0 to 12
+plot(cars, type="o", col="blue", ylim=c(0,12))
+
+# Graph trucks with red dashed line and square points
+lines(trucks, type="o", pch=22, lty=2, col="red")
+
+# Create a title with a red, bold/italic font
+title(main="Autos", col.main="red", font.main=4)
+
+csv_url <- "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
+
+mtcars <- read.csv(csv_url)
+mtcars
+
+model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <int> | <dbl> | <int> | <dbl> | <dbl> | <dbl> | <int> | <int> | <int> | <int> |
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
plot(mtcars$wt,mtcars$mpg, main="Scatterplot in Base R",
+ xlab="Car Weight", ylab="MPG",
+ pch=4, col = "blue", lwd=1, cex = 2)
+abline(lm(mtcars$mpg~mtcars$wt), col="red")
+text(mtcars$wt, mtcars$mpg, labels=rownames(mtcars), cex=0.5, font=2)
+
+hist(mtcars$hp,
+ prob = TRUE)
+lines(density(mtcars$hp), # density plot
+ lwd = 2, # thickness of line
+ col = "red")
+abline(v=mean(mtcars$hp),
+ lty="dashed",
+ col="blue")
+
+Probability Density
: It tells you how the data is distributed over different ranges (or bins) of values. The height of each bar shows how densely the data is packed in that range of horsepower values.
library(ggplot2)
+
+ggplot(mtcars, aes(x=wt, y=mpg)) +
+ geom_point(size=5, shape=4, color="blue", stroke=1) +
+ geom_smooth(method=lm, color="red") +
+ ggtitle("Scatterplot in ggplot2") +
+ xlab("Car Weight") + # for the x axis label
+ geom_text(label=rownames(mtcars),cex=3)
+
+[1m[22m`geom_smooth()` using formula = 'y ~ x'
+
+ggplot(mtcars, aes(x = hp)) +
+ geom_histogram(aes(y = ..density..),
+ bins = 6, # You can adjust the number of bins
+ fill = "lightblue",
+ color = "black") + # Histogram with probability density
+ geom_density(color = "red", size = 1.5) + # Add the density plot
+ geom_vline(aes(xintercept = mean(hp)),
+ linetype = "dashed",
+ color = "blue",
+ size = 1.2) + # Add dashed blue vertical line at the mean
+ labs(title = "Distribution of Horsepower in mtcars Dataset",
+ x = "Horsepower",
+ y = "Density") + # Add axis labels and title
+ theme_minimal() # Use a clean theme for the plot
+
+
+https://sites.harding.edu/fmccown/R/
+https://jtr13.github.io/cc21fall2/base-r-vs.-ggplot2-visualization.html#base-r-vs.-ggplot2-visualization
+ +In this lab, you are going to analyze the direct RNA data published on Genome Research in 2020. The title of the paper is: The full-length transcriptome of C. elegans using direct RNA sequencing.
+You can go the Data access
section of the paper here: https://genome.cshlp.org/content/30/2/299.full
Go to: https://www.ncbi.nlm.nih.gov/
+Search PRJEB31791
Click SRA
link
Click the first item in the search results
+Click the link: ERP114391
+Right click on the fastq files to obtain the FTP URL
+Then you can use wget
to download the fastq files.
The data for L1
and adult male
samples has been downloaded and saved on the HPC cluster:
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/L1_rep1.fastq.gz
+/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/L1_rep2.fastq.gz
+/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/male_rep1.fastq.gz
+/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/male_rep2.fastq.gz
+
+wf-transcriptomes
wf-transcriptomes
is a cDNA and RNA sequencing data analysis workflow that leverages long nanopore reads, providing a detailed view of the transcriptome.
https://github.com/epi2me-labs/wf-transcriptomes
+Path: /scratch/zt1/project/bioi611/shared/software/wf-transcriptomes-1.4.0/main.nf
Miniforge is a minimal installer for Conda specific to conda-forge. Miniforge allows users to install the conda package manager with the following features pre-configured: conda-forge set as the default (and only) channel.
+rm -rf miniforge3
+wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
+bash Miniforge3-$(uname)-$(uname -m).sh
+
+wf-transcriptome
on the demo datasetsThe workflow wf-transcriptome
has a demo datasets. This demo datasets can be used to test the workflow and help you undestand the input and output.
The demo data can be found here:
+ls /scratch/zt1/project/bioi611/shared/raw_data/wf-transcriptomes-demo/ |cat
+chr20
+differential_expression_fastq
+gencode.v22.annotation.chr20.gff
+gencode.v22.annotation.chr20.gff3
+gencode.v22.annotation.chr20.gtf
+hg38_chr20.fa
+Homo_sapiens.GRCh38.109.gtf.gz
+Homo_sapiens.GRCh38.cdna.all.fa.gz
+Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
+md5sums.txt
+nextflow.config
+ref_transcriptome.fasta
+sample_sheet.csv
+
+You can analyze the demo data by submitting the job file below:
+# Takes around 8 minutes to finish
+sbatch /scratch/zt1/project/bioi611/shared/scripts/ONT_directRNA_wf_transcriptome_demo.sub
+
+The output folder is:
+/scratch/zt1/project/bioi611/user/$USER/ONT_directRNA_demo
+
+The documentation for the output files can be found here:
+https://github.com/epi2me-labs/wf-transcriptomes?tab=readme-ov-file#outputs
+Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
+Title | +File path | +Description | +Per sample or aggregated | +
---|---|---|---|
workflow report | +wf-transcriptomes-report.html | +a HTML report document detailing the primary findings of the workflow | +aggregated | +
Per file read stats | +fastq_ingress_results/reads/fastcat_stats/per-file-stats.tsv | +A TSV with per file read stats, including all samples. | +aggregated | +
Read stats | +fastq_ingress_results/reads/fastcat_stats/per-read-stats.tsv | +A TSV with per read stats, including all samples. | +aggregated | +
Run ID's | +fastq_ingress_results/reads/fastcat_stats/run_ids | +List of run IDs present in reads. | +aggregated | +
Meta map json | +fastq_ingress_results/reads/metamap.json | +Metadata used in workflow presented in a JSON. | +aggregated | +
Concatenated sequence data | +fastq_ingress_results/reads/{{ alias }}.fastq.gz | +Per sample reads concatenated in to one FASTQ file. | +per-sample | +
Assembled transcriptome | +{{ alias }}_transcriptome.fas | +Per sample assembled transcriptome. | +per-sample | +
Annotated assembled transcriptome | +{{ alias }}_merged_transcriptome.fas | +Per sample annotated assembled transcriptome. | +per-sample | +
Alignment summary statistics | +{{ alias }}_read_aln_stats.tsv | +Per sample alignment summary statistics. | +per-sample | +
GFF compare results. | +{{ alias }}_gffcompare | +All GFF compare output files. | +per-sample | +
Differential gene expression results | +de_analysis/results_dge.tsv | +This is a gene-level result file that describes genes and their probability of showing differential expression between experimental conditions. | +aggregated | +
Differential gene expression report | +de_analysis/results_dge.pdf | +Summary report of differential gene expression analysis as a PDF. | +aggregated | +
Differential transcript usage gene TSV | +de_analysis/results_dtu_gene.tsv | +This is a gene-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression. | +aggregated | +
Differential transcript usage report | +de_analysis/results_dtu.pdf | +Summary report of differential transcript usage results as a PDF. | +aggregated | +
Differential transcript usage TSV | +de_analysis/results_dtu_transcript.tsv | +This is a transcript-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression. | +aggregated | +
Differential transcript usage stageR TSV | +de_analysis/results_dtu_stageR.tsv | +This is the output from StageR and it shows both gene and transcript probabilities of differential expression | +aggregated | +
Differential transcript usage DEXSeq TSV | +de_analysis/results_dexseq.tsv | +The complete output from the DEXSeq-analysis, shows both gene and transcript probabilities of differential expression. | +aggregated | +
Gene counts | +de_analysis/all_gene_counts.tsv | +Raw gene counts created by the Salmon tool, before filtering. | +aggregated | +
Gene counts per million | +de_analysis/cpm_gene_counts.tsv | +This file shows counts per million (CPM) of the raw gene counts to facilitate comparisons across samples. | +aggregated | +
Transcript counts | +de_analysis/unfiltered_transcript_counts_with_genes.tsv | +Raw transcript counts created by the Salmon tool, before filtering. Includes reference to the associated gene ID. | +aggregated | +
Transcript per million counts | +de_analysis/unfiltered_tpm_transcript_counts.tsv | +This file shows transcripts per million (TPM) of the raw counts to facilitate comparisons across samples. | +aggregated | +
Transcript counts filtered | +de_analysis/filtered_transcript_counts_with_genes.tsv | +Filtered transcript counts, used for differential transcript usage analysis. Includes a reference to the associated gene ID. | +aggregated | +
Transcript info table | +{{ alias }}_transcripts_table.tsv | +This file details each isoform that was reconstructed from the input reads. It contains a subset of columns from the .tmap output from gffcompare | +per-sample | +
Final non redundant transcriptome | +de_analysis/final_non_redundant_transcriptome.fasta | +Transcripts that were used for differential expression analysis including novel transcripts with the identifiers used for DE analysis. | +aggregated | +
Index of reference FASTA file | +igv_reference/{{ ref_genome file }}.fai | +Reference genome index of the FASTA file required for IGV config. | +aggregated | +
GZI index of the reference FASTA file | +igv_reference/{{ ref_genome file }}.gzi | +GZI Index of the reference FASTA file. | +aggregated | +
JSON configuration file for IGV browser | +igv.json | +JSON configuration file to be loaded in IGV for visualising alignments against the reference. | +aggregated | +
BAM file (minimap2) | +BAMS/{{ alias }}.reads_aln_sorted.bam | +BAM file generated from mapping input reads to the reference. | +per-sample | +
BAM index file (minimap2) | +BAMS/{{ alias }}.reads_aln_sort.bam.bai | +Index file generated from mapping input reads to the reference. | +per-sample | +
wf-transcriptome
on the direct RNA sequencing data from C. elegansBased on the demo data, you can set up the input folder and run the workflow on the direct RNA from the Genome Research paper.
+# Takes around 14 minutes to finish
+sbatch /scratch/zt1/project/bioi611/shared/scripts/ONT_directRNA_wf_transcriptome.sub
+
+You can find the output files here:
+/scratch/zt1/project/bioi611/user/$USER/ONT_directRNA
+
+In Oxford Nanopore sequencing, raw data captures the electrical signal generated as DNA or RNA molecules pass through a nanopore. This signal reflects variations in ionic current caused by the unique properties of each nucleotide. Key points about raw data:
+Ionic Current Signal: The primary measurement is the change in ionic current as each nucleotide interacts with the nanopore. This signal is captured continuously.
+MinKNOW Software: This software suite manages the sequencing process, capturing raw signals and translating them into "reads."
+File Formats:
+POD5: This is the primary file format used in recent ONT sequencing runs, replacing the older FAST5 format.
+Each read in these files corresponds to a single DNA or RNA strand. +Understanding raw data is crucial because it represents the initial and most unprocessed form of information from ONT sequencing. However, it’s challenging to interpret without further processing.
+Base Calling and File Outputs (BAM/FASTQ)
+After generating raw data, the next essential step is base calling, which translates the electrical signal into nucleotide sequences. This is where machine learning plays a critical role:
+Base Calling Process:
+Signal Processing Techniques: ONT’s basecalling algorithms use advanced machine learning models to interpret the raw signal.
+Output: Each ionic current pattern is mapped to a sequence of nucleotide bases (A, T, C, or G).
+Output File Formats:
+BAM Files: These files contain sequence information along with potential modifications and alignment information. ONT typically structures BAM files with 4,000 reads per file by default.
+FASTQ Files: This is the widely-used format for storing nucleotide sequences and their associated quality scores. Similar to BAM, ONT defaults to 4,000 reads per file in FASTQ format.
+In Roach, et. al., 2020, RNA sequencing on the GridION platform was performed using ONT R9.4 flow cells and the standard MinKNOW protocol script (NC_48Hr_sequencing_FLO-MIN106_SQK-RNA001). The raw data is in FAST5 format.
+Guppy is a data processing toolkit that contains the Oxford Nanopore Technologies' production basecalling algorithms and several bioinformatic post-processing features.
+To basecall reads with Guppy
, you will need to use the following commands:
guppy_basecaller
(or the fully-qualified path if using the archive installer)
--input_path
: Full or relative path to the directory where the raw read files are located. The folder can be absolute or a relative path to the current working directory.
--save_path
: Full or relative path to the directory where the basecall results will be saved. The folder can be absolute or a relative path to the current working directory. This folder will be created if it does not exist using the path you provide. (e.g. if it is a relative path, it will be relative to the current working directory)
Then either:
+--config
: configuration file containing Guppy parametersor
+--flowcell
flow cell version --kit
sequencing kit versionThe kit and flow cell information should be clearly labelled on the corresponding boxes. Flow cells almost always start with "FLO" and kits almost always start with "SQK" or "VSK".
+To see the supported flow cells and kits, run Guppy
with the --print_workflows
option:
/scratch/zt1/project/bioi611/shared/software/ont-guppy-cpu/bin/guppy_basecaller --print_workflows |grep 'FLO-M
+IN106' |grep 'SQK-RNA001'
+flowcell kit barcoding config_name model version
+FLO-MIN106 SQK-RNA001 rna_r9.4.1_70bps_hac 2020-09-07_rna_r9.4.1_minion_256_8f8fc47b
+
+
+Guppy
/scratch/zt1/project/bioi611/shared/software/ont-guppy-cpu/bin/guppy_basecaller \
+ -i test_data/ \
+ --config /scratch/zt1/project/bioi611/sha
+red/software/ont-guppy-cpu/data/rna_r9.4.1_70bps_hac.cfg \
+--save_path temp --recursive
+
+Guppy categorizes reads based on their quality scores, storing them in separate folders for easy access and downstream processing.
+pass/: Contains reads with quality scores above a specified threshold (typically a Phred quality score of 7 or higher). These reads are considered high quality and are commonly used for downstream analyses. +File Format: FASTQ files, each storing sequences and associated quality scores.
+fail/: Contains reads with quality scores below the specified threshold, indicating lower confidence in accuracy. These reads might be filtered out or re-processed depending on the study's goals. +File Format: FASTQ files, similar to those in the pass directory, but typically excluded from final analyses.
+Remember the data you basecalled here is only a test dataset. The number of reads in pass/
and fail/
doesn't reflect the acutual data.
Software dorado
is now the recommended tool to perform basecalling on POD5 files.
FAST5 files can be converted to POD5 files using the tool below:
+https://github.com/nanoporetech/pod5-file-format
+
+
+
+ orig.ident | nCount_RNA | nFeature_RNA | |
---|---|---|---|
<fct> | <dbl> | <int> | |
AAACCCATCAGATGCT-1 | pbmc5k | 11578 | 3378 |
AAACGAAAGTGCTACT-1 | pbmc5k | 5655 | 1966 |
AAACGAAGTCGTAATC-1 | pbmc5k | 14728 | 237 |
AAACGAAGTTGCCAAT-1 | pbmc5k | 10903 | 3000 |
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <chr> |
0.000000e+00 | 2.612691 | 0.944 | 0.313 | 0.000000e+00 | 0 | FHIT |
0.000000e+00 | 2.516684 | 0.952 | 0.254 | 0.000000e+00 | 0 | LEF1 |
0.000000e+00 | 2.237676 | 0.935 | 0.304 | 0.000000e+00 | 0 | PRKCQ-AS1 |
0.000000e+00 | 1.184164 | 0.998 | 0.940 | 0.000000e+00 | 0 | RPS3A |
0.000000e+00 | 1.149238 | 0.998 | 0.933 | 0.000000e+00 | 0 | RPS13 |
0.000000e+00 | 1.094210 | 0.999 | 0.948 | 0.000000e+00 | 0 | RPL30 |
0.000000e+00 | 1.088689 | 0.998 | 0.948 | 0.000000e+00 | 0 | RPS14 |
0.000000e+00 | 1.086214 | 0.999 | 0.944 | 0.000000e+00 | 0 | RPL34 |
2.282344e-307 | 1.099268 | 0.998 | 0.938 | 5.656789e-303 | 0 | RPL9 |
8.931130e-306 | 1.991463 | 0.976 | 0.360 | 2.213581e-301 | 0 | CAMK4 |
1.918803e-305 | 1.036297 | 0.999 | 0.960 | 4.755752e-301 | 0 | RPL21 |
1.396057e-302 | 1.154045 | 0.998 | 0.945 | 3.460128e-298 | 0 | RPL32 |
7.395468e-297 | 1.153411 | 0.997 | 0.939 | 1.832967e-292 | 0 | RPS6 |
3.766783e-296 | 1.148444 | 0.999 | 0.966 | 9.335970e-292 | 0 | RPS12 |
7.405330e-296 | 1.058430 | 0.999 | 0.947 | 1.835411e-291 | 0 | RPL35A |
6.849496e-295 | 1.108384 | 0.998 | 0.938 | 1.697647e-290 | 0 | RPS25 |
5.810357e-294 | 1.039329 | 0.998 | 0.935 | 1.440097e-289 | 0 | RPS23 |
3.301385e-293 | 2.280375 | 0.819 | 0.199 | 8.182482e-289 | 0 | MAL |
1.125193e-291 | 1.923380 | 0.980 | 0.457 | 2.788792e-287 | 0 | PRKCA |
1.551987e-289 | 1.021773 | 1.000 | 0.943 | 3.846601e-285 | 0 | RPS15A |
3.591209e-289 | 1.039366 | 0.998 | 0.929 | 8.900811e-285 | 0 | RPS16 |
2.927609e-288 | 1.010351 | 0.997 | 0.935 | 7.256080e-284 | 0 | RPL7 |
1.236189e-286 | 1.000003 | 0.998 | 0.947 | 3.063895e-282 | 0 | RPS27A |
2.934025e-286 | 1.031792 | 0.998 | 0.943 | 7.271982e-282 | 0 | RPL11 |
3.006967e-286 | 1.078095 | 0.998 | 0.927 | 7.452768e-282 | 0 | RPS20 |
4.044675e-285 | 1.094725 | 0.998 | 0.924 | 1.002473e-280 | 0 | RPL22 |
2.179611e-283 | 3.093953 | 0.643 | 0.110 | 5.402167e-279 | 0 | TSHZ2 |
1.224196e-280 | 1.028954 | 0.998 | 0.937 | 3.034170e-276 | 0 | RPL38 |
1.691321e-280 | 2.519016 | 0.739 | 0.168 | 4.191939e-276 | 0 | CCR7 |
2.787511e-279 | 1.000468 | 0.998 | 0.951 | 6.908846e-275 | 0 | RPS27 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
0.009116629 | 3.631680 | 0.021 | 0.002 | 1 | 11 | H4C1 |
0.009116629 | 3.608558 | 0.021 | 0.002 | 1 | 11 | ENSG00000289291 |
0.009116629 | 3.605235 | 0.021 | 0.002 | 1 | 11 | VAMP1-AS1 |
0.009116629 | 3.591495 | 0.021 | 0.002 | 1 | 11 | ENSG00000249328 |
0.009116629 | 3.567020 | 0.021 | 0.002 | 1 | 11 | ERICH2-DT |
0.009116629 | 3.550000 | 0.021 | 0.002 | 1 | 11 | NLRP10 |
0.009116629 | 3.518027 | 0.021 | 0.002 | 1 | 11 | ENSG00000270087 |
0.009116629 | 3.496166 | 0.021 | 0.002 | 1 | 11 | ENSG00000253593 |
0.009116629 | 3.393207 | 0.021 | 0.002 | 1 | 11 | ADAM11 |
0.009116629 | 3.272524 | 0.021 | 0.002 | 1 | 11 | ENSG00000228150 |
0.009116629 | 3.266281 | 0.021 | 0.002 | 1 | 11 | LINC03065 |
0.009116629 | 3.221456 | 0.021 | 0.002 | 1 | 11 | STARD13-AS |
0.009116629 | 3.109640 | 0.021 | 0.002 | 1 | 11 | FGGY-DT |
0.009116629 | 2.116185 | 0.021 | 0.002 | 1 | 11 | ENSG00000285751 |
0.009353183 | 1.486986 | 0.146 | 0.057 | 1 | 11 | TRPM2 |
0.009560927 | 1.198107 | 0.083 | 0.024 | 1 | 11 | SYCP3 |
0.009563781 | 1.367524 | 0.062 | 0.015 | 1 | 11 | MAP7 |
0.009595402 | 1.283987 | 0.083 | 0.024 | 1 | 11 | PPP1R13L |
0.009686754 | 2.584749 | 0.042 | 0.008 | 1 | 11 | PRRG2 |
0.009706714 | 2.407445 | 0.042 | 0.008 | 1 | 11 | LINC02185 |
0.009746744 | 2.473951 | 0.042 | 0.008 | 1 | 11 | LINC02901 |
0.009766814 | 2.300928 | 0.042 | 0.008 | 1 | 11 | WNK3 |
0.009766814 | 1.640083 | 0.042 | 0.008 | 1 | 11 | CALCRL-AS1 |
0.009786921 | 2.434878 | 0.042 | 0.008 | 1 | 11 | ENSG00000272112 |
0.009786922 | 2.438294 | 0.042 | 0.008 | 1 | 11 | ENSG00000277589 |
0.009847464 | 2.262934 | 0.042 | 0.008 | 1 | 11 | PCDH15 |
0.009847464 | 2.064514 | 0.042 | 0.008 | 1 | 11 | CIBAR1 |
0.009888011 | 2.173570 | 0.042 | 0.008 | 1 | 11 | SEZ6 |
0.009908340 | 1.746860 | 0.042 | 0.008 | 1 | 11 | RTKN |
0.009993542 | 1.340809 | 0.146 | 0.059 | 1 | 11 | ENSG00000287100 |
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
CST7 | 0.000000e+00 | 8.454013 | 0.946 | 0.010 | 0.000000e+00 |
GZMA | 0.000000e+00 | 7.631713 | 0.950 | 0.019 | 0.000000e+00 |
NKG7 | 0.000000e+00 | 7.541945 | 0.991 | 0.064 | 0.000000e+00 |
CCL5 | 0.000000e+00 | 7.416700 | 0.989 | 0.053 | 0.000000e+00 |
MYBL1 | 2.644088e-288 | 5.930514 | 0.853 | 0.055 | 6.553373e-284 |
# # Install the remotes package
+# if (!requireNamespace("remotes", quietly = TRUE)) {
+# install.packages("remotes")
+# }
+# # Install Seurat
+# if (!requireNamespace("Seurat", quietly = TRUE)) {
+# remotes::install_github("satijalab/seurat", "seurat5", quiet = TRUE)
+# }
+# # Install BiocManager
+# if (!require("BiocManager", quietly = TRUE))
+# install.packages("BiocManager")
+
+# # Install SingleR package
+# if (!require("hdf5r", quietly = TRUE)){
+# BiocManager::install("hdf5r")
+# }
+# # Install SingleR package
+# if (!require("presto", quietly = TRUE)){
+# remotes::install_github("immunogenomics/presto")
+# }
+# # Install SingleR package
+# if (!require("SingleR", quietly = TRUE)){
+# BiocManager::install("SingleR")
+# }
+# if (!require("celldex", quietly = TRUE)){
+# BiocManager::install("celldex")
+# }
+# if (!require("SingleCellExperiment", quietly = TRUE)){
+# BiocManager::install("SingleCellExperiment")
+# }
+# if (!require("scater", quietly = TRUE)){
+# BiocManager::install("scater")
+# }
+
+## Installing the R packages could take around 51 minutes
+## To speed up this process, you can download the R lib files
+## saved from a working Google Colab session
+## https://drive.google.com/file/d/1EQvZnsV6P0eNjbW0hwYhz0P5z0iH3bsL/view?usp=drive_link
+system("gdown 1EQvZnsV6P0eNjbW0hwYhz0P5z0iH3bsL")
+
+system("md5sum R_lib4scRNA.tar.gz", intern = TRUE)
+
+'5898c04fca5e680710cd6728ef9b1422 R_lib4scRNA.tar.gz'
+## required by scater package
+system("apt-get install libx11-dev libcairo2-dev") #, intern = TRUE)
+
+system("tar zxvf R_lib4scRNA.tar.gz")
+
+.libPaths(c("/content/usr/local/lib/R/site-library", .libPaths()))
+
+.libPaths()
+
+
+library(Seurat)
+library(dplyr)
+library(SingleR)
+library(celldex)
+library(scater)
+library(SingleCellExperiment)
+
+list.files()
+
+
+# https://drive.google.com/file/d/1-CvmcLvKMYW-OcLuGfFuGQMK2b5_VMFk/view?usp=drive_link
+# Download "filtered_feature_bc_matrix.h5"
+# Output of cellranger
+system("gdown 1-CvmcLvKMYW-OcLuGfFuGQMK2b5_VMFk")
+
+system("md5sum filtered_feature_bc_matrix.h5", intern = TRUE)
+
+'360fc0760ebb9e6dd253d808a427b20d filtered_feature_bc_matrix.h5'
+count_mtx_scrna <- Read10X_h5("filtered_feature_bc_matrix.h5")
+# If you have the filtered_feature_bc_matrix/ folder, you can use
+# Read10X to create 'count_mtx_scrna'
+# system("mkdir filtered_feature_bc_matrix/; mv filtered_feature_bc_matrix.zip filtered_feature_bc_matrix")
+# system("cd filtered_feature_bc_matrix; unzip filtered_feature_bc_matrix.zip")
+# count_mtx_scrna <- Read10X("filtered_feature_bc_matrix/")
+
+class(count_mtx_scrna)
+
+'dgCMatrix'
+The dgCMatrix class is a specific data structure in R's Matrix package, designed to store sparse matrices in a memory-efficient format. Sparse matrices are those with many zeros, making them ideal for high-dimensional data in applications like bioinformatics, where gene expression matrices often contain a lot of zeroes.
+++Why Use dgCMatrix?
+
print(format(object.size(count_mtx_scrna), units = "MB"))
+
+[1] "168.7 Mb"
+
+# Check a few genes in the first 20 cells
+count_mtx_scrna[c("CD3D", "TCL1A", "MS4A1"), 100:140]
+
+ [[ suppressing 41 column names ‘AACCATGCACTCAAGT-1’, ‘AACCATGGTAGCTTGT-1’, ‘AACCATGTCAATCCGA-1’ ... ]]
+
+
+
+
+3 x 41 sparse Matrix of class "dgCMatrix"
+
+CD3D 8 1 . . . 3 . 2 10 . 3 . . . 2 7 1 . . . 1 1 . 8 . . 2 4 . . . 12 11 .
+TCL1A . . . . . . 6 . . . . . . . . . . . . . . . 10 . . . . . . . . . . .
+MS4A1 . . . . . . 9 . . 16 . . . . . . . . . . . . 5 . . . . . . . . . . .
+
+CD3D . 5 . . . 3 .
+TCL1A . . . . . . .
+MS4A1 20 . . . 39 . .
+
+# non-normalized da# Initialize the Seurat object with the raw count matrix
+pbmc <- CreateSeuratObject(counts = count_mtx_scrna,
+ project = "pbmc5k",
+ min.cells = 3, min.features = 200)
+pbmc
+
+An object of class Seurat
+24785 features across 4884 samples within 1 assay
+Active assay: RNA (24785 features, 0 variable features)
+ 1 layer present: counts
+
+++Seurat slots +https://github.com/satijalab/seurat/wiki/seurat
+
str(pbmc)
+
+Formal class 'Seurat' [package "SeuratObject"] with 13 slots
+ ..@ assays :List of 1
+ .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
+ .. .. .. ..@ layers :List of 1
+ .. .. .. .. ..$ counts:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
+ .. .. .. .. .. .. ..@ i : int [1:14449622] 6 17 42 62 79 83 85 94 100 109 ...
+ .. .. .. .. .. .. ..@ p : int [1:4885] 0 3378 5344 5581 8581 10897 13921 16902 20299 22669 ...
+ .. .. .. .. .. .. ..@ Dim : int [1:2] 24785 4884
+ .. .. .. .. .. .. ..@ Dimnames:List of 2
+ .. .. .. .. .. .. .. ..$ : NULL
+ .. .. .. .. .. .. .. ..$ : NULL
+ .. .. .. .. .. .. ..@ x : num [1:14449622] 1 1 4 1 1 2 1 1 1 1 ...
+ .. .. .. .. .. .. ..@ factors : list()
+ .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
+ .. .. .. .. .. ..@ .Data: logi [1:4884, 1] TRUE TRUE TRUE TRUE TRUE TRUE ...
+ .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
+ .. .. .. .. .. .. .. ..$ : chr [1:4884] "AAACCCATCAGATGCT-1" "AAACGAAAGTGCTACT-1" "AAACGAAGTCGTAATC-1" "AAACGAAGTTGCCAAT-1" ...
+ .. .. .. .. .. .. .. ..$ : chr "counts"
+ .. .. .. .. .. ..$ dim : int [1:2] 4884 1
+ .. .. .. .. .. ..$ dimnames:List of 2
+ .. .. .. .. .. .. ..$ : chr [1:4884] "AAACCCATCAGATGCT-1" "AAACGAAAGTGCTACT-1" "AAACGAAGTCGTAATC-1" "AAACGAAGTTGCCAAT-1" ...
+ .. .. .. .. .. .. ..$ : chr "counts"
+ .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
+ .. .. .. .. .. ..@ .Data: logi [1:24785, 1] TRUE TRUE TRUE TRUE TRUE TRUE ...
+ .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
+ .. .. .. .. .. .. .. ..$ : chr [1:24785] "ENSG00000238009" "ENSG00000241860" "ENSG00000290385" "ENSG00000291215" ...
+ .. .. .. .. .. .. .. ..$ : chr "counts"
+ .. .. .. .. .. ..$ dim : int [1:2] 24785 1
+ .. .. .. .. .. ..$ dimnames:List of 2
+ .. .. .. .. .. .. ..$ : chr [1:24785] "ENSG00000238009" "ENSG00000241860" "ENSG00000290385" "ENSG00000291215" ...
+ .. .. .. .. .. .. ..$ : chr "counts"
+ .. .. .. ..@ default : int 1
+ .. .. .. ..@ assay.orig: chr(0)
+ .. .. .. ..@ meta.data :'data.frame': 24785 obs. of 0 variables
+ .. .. .. ..@ misc :List of 1
+ .. .. .. .. ..$ calcN: logi TRUE
+ .. .. .. ..@ key : chr "rna_"
+ ..@ meta.data :'data.frame': 4884 obs. of 3 variables:
+ .. ..$ orig.ident : Factor w/ 1 level "pbmc5k": 1 1 1 1 1 1 1 1 1 1 ...
+ .. ..$ nCount_RNA : num [1:4884] 11578 5655 14728 10903 6174 ...
+ .. ..$ nFeature_RNA: int [1:4884] 3378 1966 237 3000 2316 3024 2981 3397 2370 2811 ...
+ ..@ active.assay: chr "RNA"
+ ..@ active.ident: Factor w/ 1 level "pbmc5k": 1 1 1 1 1 1 1 1 1 1 ...
+ .. ..- attr(*, "names")= chr [1:4884] "AAACCCATCAGATGCT-1" "AAACGAAAGTGCTACT-1" "AAACGAAGTCGTAATC-1" "AAACGAAGTTGCCAAT-1" ...
+ ..@ graphs : list()
+ ..@ neighbors : list()
+ ..@ reductions : list()
+ ..@ images : list()
+ ..@ project.name: chr "pbmc5k"
+ ..@ misc : list()
+ ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
+ .. ..$ : int [1:3] 5 0 2
+ ..@ commands : list()
+ ..@ tools : list()
+
+
+
+slotNames(pbmc)
+
+
+pbmc@active.assay
+
+'RNA'
+class(pbmc@meta.data)
+head(pbmc@meta.data, 4)
+
+'data.frame'
+orig.ident | nCount_RNA | nFeature_RNA | |
---|---|---|---|
<fct> | <dbl> | <int> | |
AAACCCATCAGATGCT-1 | pbmc5k | 11578 | 3378 |
AAACGAAAGTGCTACT-1 | pbmc5k | 5655 | 1966 |
AAACGAAGTCGTAATC-1 | pbmc5k | 14728 | 237 |
AAACGAAGTTGCCAAT-1 | pbmc5k | 10903 | 3000 |
Layers(pbmc)
+
+'counts'
+pbmc@version
+pbmc@commands
+
+[1] ‘5.0.2’
+
+# Use $ operator to add columns to object metadata.
+pbmc$percent.mt <- PercentageFeatureSet(pbmc,
+ pattern = "^MT-")
+
+
+colnames(pbmc@meta.data)
+
+
+# Use violin plot to visualize QC metrics
+VlnPlot(pbmc,
+ features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
+ ncol = 3)
+
+Warning message:
+“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
+
+Shape: Each violin plot shows the distribution of values for each feature across the cells in your dataset. The shape of the plot indicates the density of cells with particular values for that feature.
+Wider sections indicate more cells with those values. +Narrow sections indicate fewer cells with those values.
+Vertical Axis: Represents the range of values for each feature. For instance:
+nFeature_RNA
and nCount_RNA
: Higher values suggest more gene diversity and RNA content, respectively.
percent.mt
: Higher values indicate higher mitochondrial content, which may point to stressed or dying cells.
nFeature_RNA
: The number of unique features (genes) detected per cell.
Extremely high values could suggest potential doublets (two cells mistakenly captured as one), as two cells would have more unique genes combined.
+Low number of detected genes - potential ambient mRNA (not real cells)
+nCount_RNA
: The total number of RNA molecules (or unique molecular identifiers, UMIs) detected per cell.
Higher counts generally indicate higher RNA content, but they could also result from cell doublets. +Cells with very low nCount_RNA might represent poor-quality cells with low RNA capture, while very high counts may also suggest doublets.
+percent.mt
: The percentage of reads mapping to mitochondrial genes.
High mitochondrial content often indicates cell stress or apoptosis, as damaged cells tend to release mitochondrial RNA.
+Filtering cells with high percent.mt
values is common to exclude potentially dying cells.
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
+# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
+
+plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
+plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
+plot1 + plot2
+
+# Load necessary libraries
+library(Seurat)
+library(ggplot2)
+
+# Define the function to calculate median and MAD values
+calculate_thresholds <- function(seurat_obj) {
+ # Extract relevant columns
+ nFeature_values <- seurat_obj@meta.data$nFeature_RNA
+ nCount_values <- seurat_obj@meta.data$nCount_RNA
+ percent_mt_values <- seurat_obj@meta.data$percent.mt
+
+ # Calculate medians and MADs
+ nFeature_median <- median(nFeature_values, na.rm = TRUE)
+ nFeature_mad <- mad(nFeature_values, constant = 1, na.rm = TRUE)
+
+ nCount_median <- median(nCount_values, na.rm = TRUE)
+ nCount_mad <- mad(nCount_values, constant = 1, na.rm = TRUE)
+
+ percent_mt_median <- median(percent_mt_values, na.rm = TRUE)
+ percent_mt_mad <- mad(percent_mt_values, constant = 1, na.rm = TRUE)
+
+ # Calculate thresholds for horizontal lines
+ thresholds <- list(
+ nFeature_upper = nFeature_median + 4 * nFeature_mad,
+ nFeature_lower = nFeature_median - 4 * nFeature_mad,
+ nCount_upper = nCount_median + 4 * nCount_mad,
+ nCount_lower = nCount_median - 4 * nCount_mad,
+ percent_mt_upper = percent_mt_median + 4 * percent_mt_mad
+ )
+
+ return(thresholds)
+}
+
+# Calculate thresholds
+thresholds <- calculate_thresholds(pbmc)
+
+
+thresholds
+
+vplot1 <- VlnPlot(pbmc, features = c("nFeature_RNA"), ncol = 2) +
+ geom_hline(yintercept = thresholds$nFeature_upper,
+ color = "blue", linetype = "solid") +
+ geom_hline(yintercept = thresholds$nFeature_lower,
+ color = "blue", linetype = "solid") +
+ theme(legend.position="none")
+vplot2 <- VlnPlot(pbmc, features = c("percent.mt"), ncol = 2) +
+ geom_hline(yintercept = thresholds$percent_mt_upper,
+ color = "blue", linetype = "solid") +
+ theme(legend.position="none")
+vplot1 + vplot2
+
+Warning message:
+“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
+Warning message:
+“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
+
+pbmc <- subset(pbmc,
+ subset = thresholds$nFeature_lower > 200 &
+ nFeature_RNA < thresholds$nFeature_upper &
+ percent.mt < thresholds$percent_mt_upper)
+
+
+# Use violin plot to visualize QC metrics after QC
+VlnPlot(pbmc,
+ features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
+ ncol = 3)
+
+Warning message:
+“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
+
+Instead of using an arbitrary number, you can also use statistical algorithm to predict doublets and empty droplets to filter the cells, such as DoubletFinder
and EmptyDrops
.
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. In Seurat v5, Normalized values are stored in pbmc[["RNA"]]$data
.
pbmc <- NormalizeData(pbmc) # normalization.method = "LogNormalize", scale.factor = 10000
+
+Normalizing layer: counts
+
+While this method of normalization is standard and widely used in scRNA-seq analysis, global-scaling relies on an assumption that each cell originally contains the same number of RNA molecules.
+Next, we identify a subset of features that show high variation across cells in the dataset—meaning they are highly expressed in some cells and lowly expressed in others. Prior work, including our own, has shown that focusing on these variable genes in downstream analyses can enhance the detection of biological signals in single-cell datasets.
+The approach used in Seurat improves upon previous versions by directly modeling the inherent mean-variance relationship in single-cell data. This method is implemented in the FindVariableFeatures() function, which, by default, selects 2,000 variable features per dataset. These features will then be used in downstream analyses, such as PCA.
+pbmc <- FindVariableFeatures(pbmc,
+ selection.method = "vst",
+ nfeatures = 2000)
+
+
+Finding variable features for layer counts
+
+# Identify the 10 most highly variable genes
+top10 <- head(VariableFeatures(pbmc), 10)
+options(repr.plot.width=10, repr.plot.height= 6)
+
+# plot variable features with and without labels
+plot1 <- VariableFeaturePlot(pbmc)
+plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
+plot1 + plot2
+
+
+When using repel, set xnudge and ynudge to 0 for optimal results
+
+Warning message in scale_x_log10():
+“[1m[22m[32mlog-10[39m transformation introduced infinite values.”
+Warning message in scale_x_log10():
+“[1m[22m[32mlog-10[39m transformation introduced infinite values.”
+
+Next, we apply a linear transformation (scaling
) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData()
function:
Shifts the expression of each gene, so that the mean expression across cells is 0 +Scales the expression of each gene, so that the variance across cells is 1
+This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
+The results of this are stored in pbmc[["RNA"]]$scale.data
By default, only variable features are scaled. +You can specify the features argument to scale additional features.
+all.genes <- rownames(pbmc)
+pbmc <- ScaleData(pbmc, features = all.genes)
+
+Centering and scaling data matrix
+
+pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
+
+PC_ 1
+Positive: CD247, IL32, IL7R, RORA, CAMK4, LTB, INPP4B, STAT4, BCL2, ANK3
+ ZEB1, LEF1, TRBC1, CARD11, THEMIS, BACH2, MLLT3, RNF125, RASGRF2, NR3C2
+ NELL2, PDE3B, LINC01934, ENSG00000290067, PRKCA, TAFA1, PYHIN1, CTSW, CSGALNACT1, SAMD3
+Negative: LYZ, FCN1, IRAK3, SLC8A1, CLEC7A, PLXDC2, IFI30, S100A9, SPI1, CYBB
+ MNDA, LRMDA, FGL2, VCAN, CTSS, RBM47, CSF3R, MCTP1, NCF2, TYMP
+ CYRIA, CST3, HCK, SLC11A1, WDFY3, S100A8, MS4A6A, MPEG1, LST1, CSTA
+PC_ 2
+Positive: CD247, S100A4, STAT4, NKG7, CST7, CTSW, GZMA, SYTL3, RNF125, SAMD3
+ NCALD, MYO1F, MYBL1, KLRD1, PLCB1, TGFBR3, PRF1, GNLY, RAP1GAP2, RORA
+ CCL5, HOPX, FGFBP2, YES1, PYHIN1, FNDC3B, GNG2, SYNE1, KLRF1, SPON2
+Negative: BANK1, MS4A1, CD79A, FCRL1, PAX5, IGHM, AFF3, LINC00926, NIBAN3, EBF1
+ IGHD, BLK, CD22, OSBPL10, HLA-DQA1, COL19A1, GNG7, KHDRBS2, RUBCNL, TNFRSF13C
+ COBLL1, RALGPS2, TCL1A, BCL11A, CDK14, CD79B, PLEKHG1, HLA-DQB1, IGKC, BLNK
+PC_ 3
+Positive: TUBB1, GP9, GP1BB, PF4, CAVIN2, GNG11, NRGN, PPBP, RGS18, PRKAR2B
+ H2AC6, ACRBP, PTCRA, TMEM40, TREML1, CLU, LEF1, GPX1, CMTM5, SMANTIS
+ MPIG6B, CAMK4, MPP1, SPARC, ENSG00000289621, ITGB3, MYL9, MYL4, ITGA2B, F13A1
+Negative: NKG7, CST7, GNLY, PRF1, KLRD1, GZMA, KLRF1, MCTP2, GZMB, FGFBP2
+ HOPX, SPON2, C1orf21, TGFBR3, VAV3, MYBL1, CTSW, SYNE1, NCALD, IL2RB
+ SAMD3, GNG2, BNC2, CEP78, YES1, RAP1GAP2, PDGFD, LINC02384, CARD11, CLIC3
+PC_ 4
+Positive: CAMK4, INPP4B, IL7R, LEF1, PRKCA, PDE3B, MAML2, LTB, ANK3, PLCL1
+ BCL2, CDC14A, THEMIS, FHIT, NELL2, VIM, ENSG00000290067, MLLT3, TSHZ2, NR3C2
+ IL32, CMTM8, ENSG00000249806, ZEB1, SESN3, CSGALNACT1, TAFA1, LEF1-AS1, SLC16A10, LDLRAD4
+Negative: GP1BB, GP9, TUBB1, PF4, CAVIN2, GNG11, PPBP, H2AC6, PTCRA, NRGN
+ ACRBP, TMEM40, PRKAR2B, RGS18, TREML1, MPIG6B, SMANTIS, CMTM5, CLU, SPARC
+ ITGA2B, ITGB3, ENSG00000289621, MYL9, CAPN1-AS1, MYL4, ENSG00000288758, DAB2, PDGFA-DT, CTTN
+PC_ 5
+Positive: CDKN1C, HES4, FCGR3A, PELATON, CSF1R, IFITM3, SIGLEC10, TCF7L2, ZNF703, MS4A7
+ UICLM, ENSG00000287682, NEURL1, RHOC, FMNL2, CKB, FTL, CALHM6, HMOX1, BATF3
+ ACTB, MYOF, CCDC26, IFITM2, PAPSS2, RRAS, LST1, VMO1, SERPINA1, LRRC25
+Negative: LINC02458, AKAP12, CA8, ENSG00000250696, SLC24A3, HDC, IL3RA, EPAS1, ENPP3, OSBPL1A
+ TRPM6, CCR3, CSF2RB, SEMA3C, THSD7A, ATP10D, DACH1, CRPPA, ATP8B4, TMEM164
+ ABHD5, CLC, CR1, ITGB8, LIN7A, TAFA2, MBOAT2, GATA2, DAPK2, GCSAML
+
+You have several useful ways to visualize both cells and features that define the PCA, including VizDimReduction()
, DimPlot()
, and DimHeatmap()
.
DimPlot(pbmc, reduction = "pca") + NoLegend()
+
+DimHeatmap()
draws a heatmap focusing on a principal component. Both cells and genes are sorted by their principal component scores
DimHeatmap(pbmc, dims = 1:3, cells = 500, balanced = TRUE)
+DimHeatmap(pbmc, dims = 20:22, cells = 500, balanced = TRUE)
+
+
+
+
+The elbow plot is a useful tool for determining the number of principal components (PCs) needed to capture the majority of variation in the data. It displays the standard deviation of each PC, with the "elbow" point typically serving as the threshold for selecting the most informative PCs. However, identifying the exact location of the elbow can be somewhat subjective.
+ElbowPlot(pbmc, ndims = 50)
+
+# Determine the percentage of variation associated with each PC
+pct_var <- pbmc[["pca"]]@stdev / sum(pbmc[["pca"]]@stdev) * 100
+
+# Calculate cumulative percentages for each PC
+cumu_pct <- cumsum(pct_var)
+
+# Identify the first PC where cumulative percentage exceeds 90% and individual variance is less than 5%
+pc_number <- which(cumu_pct > 90 & pct_var < 5)[1]
+
+pc_number
+
+41
+
+
+Seurat embeds cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected quasi-cliques
or communities
.
Seurat first constructs a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors()
function, and takes as input the previously defined dimensionality of the dataset.
To cluster the cells, Seurat next applies modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the granularity
of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 1 typically returns good results for single-cell datasets of around 5k cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents()
function.
pbmc <- FindNeighbors(pbmc, dims = 1:pc_number)
+pbmc <- FindClusters(pbmc, resolution = 0.2)
+
+Computing nearest neighbor graph
+
+Computing SNN
+
+
+
+Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
+
+Number of nodes: 4559
+Number of edges: 184776
+
+Running Louvain algorithm...
+Maximum modularity in 10 random starts: 0.9568
+Number of communities: 12
+Elapsed time: 0 seconds
+
+# Look at cluster IDs of the first 5 cells
+head(Idents(pbmc), 5)
+
+
+To visualize and explore these datasets, Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP.
+The goal of tSNE/UMAP is to learn underlying structure in the dataset, in order to place similar cells together in low-dimensional space. Therefore, cells that are grouped together within graph-based clusters determined above should co-localize on these dimension reduction plots.
+pbmc <- RunUMAP(pbmc, dims = 1:pc_number)
+
+
+12:20:01 UMAP embedding parameters a = 0.9922 b = 1.112
+
+12:20:01 Read 4559 rows and found 41 numeric columns
+
+12:20:01 Using Annoy for neighbor search, n_neighbors = 30
+
+12:20:01 Building Annoy index with metric = cosine, n_trees = 50
+
+0% 10 20 30 40 50 60 70 80 90 100%
+
+[----|----|----|----|----|----|----|----|----|----|
+
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+|
+
+12:20:03 Writing NN index file to temp file /tmp/Rtmp2EFvyF/file14478d5c732
+
+12:20:03 Searching Annoy index using 1 thread, search_k = 3000
+
+12:20:05 Annoy recall = 100%
+
+12:20:06 Commencing smooth kNN distance calibration using 1 thread
+ with target n_neighbors = 30
+
+12:20:08 Found 2 connected components,
+falling back to 'spca' initialization with init_sdev = 1
+
+12:20:08 Using 'irlba' for PCA
+
+12:20:08 PCA: 2 components explained 46.09% variance
+
+12:20:08 Scaling init to sdev = 1
+
+12:20:08 Commencing optimization for 500 epochs, with 188320 positive edges
+
+12:20:18 Optimization finished
+
+Idents(pbmc) = pbmc$seurat_clusters
+
+DimPlot(pbmc, reduction = "umap", label = TRUE)
+
+# find markers for every cluster compared to all remaining cells, report only the positive
+# ones
+pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
+pbmc.markers %>%
+ group_by(cluster) %>%
+ dplyr::filter(avg_log2FC > 1)
+
+Calculating cluster 0
+
+Calculating cluster 1
+
+Calculating cluster 2
+
+Calculating cluster 3
+
+Calculating cluster 4
+
+Calculating cluster 5
+
+Calculating cluster 6
+
+Calculating cluster 7
+
+Calculating cluster 8
+
+Calculating cluster 9
+
+Calculating cluster 10
+
+Calculating cluster 11
+
+p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene |
---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <chr> |
0.000000e+00 | 2.612691 | 0.944 | 0.313 | 0.000000e+00 | 0 | FHIT |
0.000000e+00 | 2.516684 | 0.952 | 0.254 | 0.000000e+00 | 0 | LEF1 |
0.000000e+00 | 2.237676 | 0.935 | 0.304 | 0.000000e+00 | 0 | PRKCQ-AS1 |
0.000000e+00 | 1.184164 | 0.998 | 0.940 | 0.000000e+00 | 0 | RPS3A |
0.000000e+00 | 1.149238 | 0.998 | 0.933 | 0.000000e+00 | 0 | RPS13 |
0.000000e+00 | 1.094210 | 0.999 | 0.948 | 0.000000e+00 | 0 | RPL30 |
0.000000e+00 | 1.088689 | 0.998 | 0.948 | 0.000000e+00 | 0 | RPS14 |
0.000000e+00 | 1.086214 | 0.999 | 0.944 | 0.000000e+00 | 0 | RPL34 |
2.282344e-307 | 1.099268 | 0.998 | 0.938 | 5.656789e-303 | 0 | RPL9 |
8.931130e-306 | 1.991463 | 0.976 | 0.360 | 2.213581e-301 | 0 | CAMK4 |
1.918803e-305 | 1.036297 | 0.999 | 0.960 | 4.755752e-301 | 0 | RPL21 |
1.396057e-302 | 1.154045 | 0.998 | 0.945 | 3.460128e-298 | 0 | RPL32 |
7.395468e-297 | 1.153411 | 0.997 | 0.939 | 1.832967e-292 | 0 | RPS6 |
3.766783e-296 | 1.148444 | 0.999 | 0.966 | 9.335970e-292 | 0 | RPS12 |
7.405330e-296 | 1.058430 | 0.999 | 0.947 | 1.835411e-291 | 0 | RPL35A |
6.849496e-295 | 1.108384 | 0.998 | 0.938 | 1.697647e-290 | 0 | RPS25 |
5.810357e-294 | 1.039329 | 0.998 | 0.935 | 1.440097e-289 | 0 | RPS23 |
3.301385e-293 | 2.280375 | 0.819 | 0.199 | 8.182482e-289 | 0 | MAL |
1.125193e-291 | 1.923380 | 0.980 | 0.457 | 2.788792e-287 | 0 | PRKCA |
1.551987e-289 | 1.021773 | 1.000 | 0.943 | 3.846601e-285 | 0 | RPS15A |
3.591209e-289 | 1.039366 | 0.998 | 0.929 | 8.900811e-285 | 0 | RPS16 |
2.927609e-288 | 1.010351 | 0.997 | 0.935 | 7.256080e-284 | 0 | RPL7 |
1.236189e-286 | 1.000003 | 0.998 | 0.947 | 3.063895e-282 | 0 | RPS27A |
2.934025e-286 | 1.031792 | 0.998 | 0.943 | 7.271982e-282 | 0 | RPL11 |
3.006967e-286 | 1.078095 | 0.998 | 0.927 | 7.452768e-282 | 0 | RPS20 |
4.044675e-285 | 1.094725 | 0.998 | 0.924 | 1.002473e-280 | 0 | RPL22 |
2.179611e-283 | 3.093953 | 0.643 | 0.110 | 5.402167e-279 | 0 | TSHZ2 |
1.224196e-280 | 1.028954 | 0.998 | 0.937 | 3.034170e-276 | 0 | RPL38 |
1.691321e-280 | 2.519016 | 0.739 | 0.168 | 4.191939e-276 | 0 | CCR7 |
2.787511e-279 | 1.000468 | 0.998 | 0.951 | 6.908846e-275 | 0 | RPS27 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
0.009116629 | 3.631680 | 0.021 | 0.002 | 1 | 11 | H4C1 |
0.009116629 | 3.608558 | 0.021 | 0.002 | 1 | 11 | ENSG00000289291 |
0.009116629 | 3.605235 | 0.021 | 0.002 | 1 | 11 | VAMP1-AS1 |
0.009116629 | 3.591495 | 0.021 | 0.002 | 1 | 11 | ENSG00000249328 |
0.009116629 | 3.567020 | 0.021 | 0.002 | 1 | 11 | ERICH2-DT |
0.009116629 | 3.550000 | 0.021 | 0.002 | 1 | 11 | NLRP10 |
0.009116629 | 3.518027 | 0.021 | 0.002 | 1 | 11 | ENSG00000270087 |
0.009116629 | 3.496166 | 0.021 | 0.002 | 1 | 11 | ENSG00000253593 |
0.009116629 | 3.393207 | 0.021 | 0.002 | 1 | 11 | ADAM11 |
0.009116629 | 3.272524 | 0.021 | 0.002 | 1 | 11 | ENSG00000228150 |
0.009116629 | 3.266281 | 0.021 | 0.002 | 1 | 11 | LINC03065 |
0.009116629 | 3.221456 | 0.021 | 0.002 | 1 | 11 | STARD13-AS |
0.009116629 | 3.109640 | 0.021 | 0.002 | 1 | 11 | FGGY-DT |
0.009116629 | 2.116185 | 0.021 | 0.002 | 1 | 11 | ENSG00000285751 |
0.009353183 | 1.486986 | 0.146 | 0.057 | 1 | 11 | TRPM2 |
0.009560927 | 1.198107 | 0.083 | 0.024 | 1 | 11 | SYCP3 |
0.009563781 | 1.367524 | 0.062 | 0.015 | 1 | 11 | MAP7 |
0.009595402 | 1.283987 | 0.083 | 0.024 | 1 | 11 | PPP1R13L |
0.009686754 | 2.584749 | 0.042 | 0.008 | 1 | 11 | PRRG2 |
0.009706714 | 2.407445 | 0.042 | 0.008 | 1 | 11 | LINC02185 |
0.009746744 | 2.473951 | 0.042 | 0.008 | 1 | 11 | LINC02901 |
0.009766814 | 2.300928 | 0.042 | 0.008 | 1 | 11 | WNK3 |
0.009766814 | 1.640083 | 0.042 | 0.008 | 1 | 11 | CALCRL-AS1 |
0.009786921 | 2.434878 | 0.042 | 0.008 | 1 | 11 | ENSG00000272112 |
0.009786922 | 2.438294 | 0.042 | 0.008 | 1 | 11 | ENSG00000277589 |
0.009847464 | 2.262934 | 0.042 | 0.008 | 1 | 11 | PCDH15 |
0.009847464 | 2.064514 | 0.042 | 0.008 | 1 | 11 | CIBAR1 |
0.009888011 | 2.173570 | 0.042 | 0.008 | 1 | 11 | SEZ6 |
0.009908340 | 1.746860 | 0.042 | 0.008 | 1 | 11 | RTKN |
0.009993542 | 1.340809 | 0.146 | 0.059 | 1 | 11 | ENSG00000287100 |
# find all markers distinguishing cluster 5 from clusters 0 and 3
+cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3))
+head(cluster5.markers, n = 5)
+
+p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
CST7 | 0.000000e+00 | 8.454013 | 0.946 | 0.010 | 0.000000e+00 |
GZMA | 0.000000e+00 | 7.631713 | 0.950 | 0.019 | 0.000000e+00 |
NKG7 | 0.000000e+00 | 7.541945 | 0.991 | 0.064 | 0.000000e+00 |
CCL5 | 0.000000e+00 | 7.416700 | 0.989 | 0.053 | 0.000000e+00 |
MYBL1 | 2.644088e-288 | 5.930514 | 0.853 | 0.055 | 6.553373e-284 |
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
+
+
+VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
+
+
+FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
+ "CD8A"))
+
+pbmc.markers %>%
+ group_by(cluster) %>%
+ dplyr::filter(avg_log2FC > 1) %>%
+ slice_head(n = 10) %>%
+ ungroup() -> top10
+DoHeatmap(pbmc, features = top10$gene) + NoLegend()
+
+SingleR
SingleR is an automated annotation tool designed for single-cell RNA sequencing (scRNA-seq) data. It assigns labels to new cells in a test dataset by comparing their similarity to a reference dataset, which consists of samples (either single-cell or bulk) with known labels. This approach eliminates the need for manually interpreting clusters and identifying marker genes for each new dataset. Instead, the biological insights from the reference dataset can be efficiently applied to annotate new datasets automatically.
+library(SingleCellExperiment )
+
+sce <- as.SingleCellExperiment(pbmc)
+sce <- scater::logNormCounts(sce)
+
+
+# Download and cache the normalized expression values of the data
+# stored in the Human Primary Cell Atlas. The data will be
+# downloaded from ExperimentHub, returning a SummarizedExperiment
+# object for further use.
+hpca <- HumanPrimaryCellAtlasData()
+
+# Obtain human bulk RNA-seq data from Blueprint and ENCODE
+blueprint <- BlueprintEncodeData()
+
+
+pred.hpca <- SingleR(test = sce,
+ ref = hpca, labels = hpca$label.main)
+tab_hpca <- table(pred.hpca$pruned.labels)
+write.csv(sort(tab_hpca, decreasing=TRUE), 'pbmc_annotations_HPCA_general.csv', row.names=FALSE)
+
+Each row of the output DataFrame contains prediction results for a single cell. Labels are shown before (labels) and after pruning (pruned.labels), along with the associated scores.
+head(pred.hpca)
+
+DataFrame with 6 rows and 4 columns
+ scores labels delta.next
+ <matrix> <character> <numeric>
+AAACCCATCAGATGCT-1 0.1409902:0.3257687:0.281000:... T_cells 0.0708860
+AAACGAAAGTGCTACT-1 0.1407268:0.3072562:0.264148:... T_cells 0.6026772
+AAACGAAGTCGTAATC-1 0.0604399:0.0725122:0.184863:... Erythroblast 0.1268946
+AAACGAAGTTGCCAAT-1 0.1585486:0.3228307:0.278787:... T_cells 0.6492489
+AAACGAATCCGAGGCT-1 0.1166524:0.3565152:0.277855:... B_cell 0.0505309
+AAACGAATCGAACGCC-1 0.1437411:0.3427680:0.299201:... NK_cell 0.3155681
+ pruned.labels
+ <character>
+AAACCCATCAGATGCT-1 T_cells
+AAACGAAAGTGCTACT-1 T_cells
+AAACGAAGTCGTAATC-1 Erythroblast
+AAACGAAGTTGCCAAT-1 T_cells
+AAACGAATCCGAGGCT-1 B_cell
+AAACGAATCGAACGCC-1 NK_cell
+
+pred.blueprint <- SingleR(test = sce,
+ ref = blueprint, labels = blueprint$label.main)
+tab_blueprint <- table(pred.blueprint$pruned.labels)
+ write.csv(sort(tab_blueprint, decreasing=TRUE), 'pbmc_annotations_BlueprintENCODE_general.csv', row.names=FALSE)
+
+head(pred.blueprint)
+
+DataFrame with 6 rows and 4 columns
+ scores labels delta.next
+ <matrix> <character> <numeric>
+AAACCCATCAGATGCT-1 0.2145648:0.1136181:0.437872:... CD4+ T-cells 0.0512150
+AAACGAAAGTGCTACT-1 0.2311086:0.1664737:0.393066:... CD4+ T-cells 0.3283657
+AAACGAAGTCGTAATC-1 0.0977069:0.0728724:0.100641:... Erythrocytes 0.0757261
+AAACGAAGTTGCCAAT-1 0.2289000:0.1565938:0.423090:... CD8+ T-cells 0.0620951
+AAACGAATCCGAGGCT-1 0.2353403:0.1291495:0.500443:... B-cells 0.1276654
+AAACGAATCGAACGCC-1 0.2308036:0.1288775:0.417440:... NK cells 0.1486502
+ pruned.labels
+ <character>
+AAACCCATCAGATGCT-1 CD4+ T-cells
+AAACGAAAGTGCTACT-1 CD4+ T-cells
+AAACGAAGTCGTAATC-1 Erythrocytes
+AAACGAAGTTGCCAAT-1 CD8+ T-cells
+AAACGAATCCGAGGCT-1 B-cells
+AAACGAATCGAACGCC-1 NK cells
+
+table(pbmc$seurat_clusters)
+
+ 0 1 2 3 4 5 6 7 8 9 10 11
+934 777 662 604 465 442 224 157 98 94 54 48
+
+pbmc$singleR_hpca = pred.hpca$pruned.labels
+pbmc$singleR_blueprint = pred.blueprint$pruned.labels
+
+Idents(pbmc) = pbmc$singleR_hpca
+DimPlot(pbmc, reduction = "umap",
+ label = TRUE, pt.size = 0.5,
+ repel = TRUE) + NoLegend()
+# Change back to cluster seurat_clusters
+Idents(pbmc) = pbmc$seurat_clusters
+
+Idents(pbmc) = pbmc$singleR_blueprint
+DimPlot(pbmc, reduction = "umap",
+ label = TRUE, pt.size = 0.5,
+ repel = TRUE) + NoLegend()
+# Change back to cluster seurat_clusters
+Idents(pbmc) = pbmc$seurat_clusters
+
+
+
+Although tools like SingleR can automatically annotate the cell types, usually the results will be used as a guidance. You usually need to use the domain knowleges (known marker genes) to help you perform manual annotation.
+The table below lists the marker genes for different cell types expected in PBMC.
+Markers | +Cell Type | +
---|---|
IL7R, CCR7 | +Naive CD4+ T | +
CD14, LYZ | +CD14+ Mono | +
IL7R, S100A4 | +Memory CD4+ | +
MS4A1 | +B | +
CD8A | +CD8+ T | +
FCGR3A, MS4A7 | +FCGR3A+ Mono | +
GNLY, NKG7 | +NK | +
FCER1A, CST3 | +DC | +
PPBP | +Platelet | +
With the current clustering results, there are 10 clusters.
+table(Idents(pbmc))
+
+ 0 1 2 3 4 5 6 7 8 9 10 11
+934 777 662 604 465 442 224 157 98 94 54 48
+
+# IL7R, CCR7 Naive CD4+ T
+FeaturePlot(pbmc, features = c("IL7R", "CCR7"))
+
+#CD14, LYZ CD14+ Mono
+FeaturePlot(pbmc, features = c("CD14", "LYZ"))
+
+## IL7R, S100A4 Memory CD4+
+FeaturePlot(pbmc, features = c("IL7R", "S100A4"))
+
+# MS4A1 B
+FeaturePlot(pbmc, features = c("MS4A1"))
+
+# CD8A CD8+ T
+FeaturePlot(pbmc, features = c("CD8A"))
+
+# FCGR3A, MS4A7 FCGR3A+ Mono
+FeaturePlot(pbmc, features = c("FCGR3A", "MS4A7"))
+
+# GNLY, NKG7 NK
+FeaturePlot(pbmc, features = c("GNLY", "NKG7"))
+
+#FCER1A, CST3 DC
+FeaturePlot(pbmc, features = c("FCER1A", "CST3"))
+
+# PPBP Platelet
+FeaturePlot(pbmc, features = c("PPBP"))
+
+# PPBP Platelet
+FeaturePlot(pbmc, features = c("ITGB1"))
+
+## Cluster 0 and cluster 6 expresses both IL7R and CCR7
+pbmc = RenameIdents(pbmc, "0"="Naive CD4+ T")
+pbmc = RenameIdents(pbmc, "6"="Naive CD4+ T")
+## Cluster 1 expresses both IL7R and S100A4 (Memory CD4+)
+## We manually annotate cluster 1 as "Memory CD4+"
+pbmc = RenameIdents(pbmc, "1"="Memory CD4+")
+
+# The cell includes partially completed steps,
+# and you will need to complete the manual cell annotation
+# section and submit your completed notebook.
+# Detailed explanation of the logic on cell type annotations should be added.
+# Please ignore clusters 8 and 10 for now.
+
+DimPlot(pbmc, reduction = "umap",
+ label = TRUE, pt.size = 0.5,
+ repel = TRUE) + NoLegend()
+
+saveRDS(pbmc, file = "Seurat_object_pbmc_final.rds")
+
+remotes::install_github("10xGenomics/loupeR")
+loupeR::setup()
+
+Skipping install of 'loupeR' from a github remote, the SHA1 (a169417e) has not changed since last install.
+ Use `force = TRUE` to force installation
+
+library(loupeR)
+
+create_loupe_from_seurat(pbmc,
+ output_name = "Seurat_object_pbmc_cloupe",
+ force = TRUE)
+
+
+2024/11/21 12:28:28 extracting matrix, clusters, and projections
+
+2024/11/21 12:28:28 selected assay: RNA
+
+2024/11/21 12:28:28 selected clusters: active_cluster orig.ident RNA_snn_res.0.2 seurat_clusters singleR_hpca singleR_blueprint
+
+2024/11/21 12:28:28 selected projections: umap
+
+2024/11/21 12:28:28 validating count matrix
+
+2024/11/21 12:28:29 validating clusters
+
+2024/11/21 12:28:29 validating projections
+
+2024/11/21 12:28:29 creating temporary hdf5 file: /tmp/Rtmp2EFvyF/file144170cdcf6.h5
+
+2024/11/21 12:28:33 invoking louper executable
+
+2024/11/21 12:28:33 running command: "/root/.local/share/R/loupeR/louper create --input='/tmp/Rtmp2EFvyF/file144170cdcf6.h5' --output='/content/Seurat_object_pbmc_cloupe.cloupe' --force"
+
+https://monashbioinformaticsplatform.github.io/Single-Cell-Workshop/pbmc3k_tutorial.html
+https://bioinformatics.ccr.cancer.gov/docs/getting-started-with-scrna-seq/IntroToR_Seurat/
+https://hbctraining.github.io/scRNA-seq/lessons/elbow_plot_metric.html
+ +used | (Mb) | gc trigger | (Mb) | max used | (Mb) | |
---|---|---|---|---|---|---|
Ncells | 10810136 | 577.4 | 19318554 | 1031.8 | 14556029 | 777.4 |
Vcells | 127080778 | 969.6 | 301745553 | 2302.2 | 301081234 | 2297.1 |
orig.ident | nCount_RNA | nFeature_RNA | |
---|---|---|---|
<fct> | <dbl> | <int> | |
AAACCTGAGACAATAC-1 | 300min | 1630 | 803 |
AAACCTGAGACACTAA-1 | 300min | 3147 | 1365 |
AAACCTGAGACGCTTT-1 | 300min | 892 | 586 |
AAACCTGAGAGGGCTT-1 | 300min | 1666 | 1033 |
orig.ident | nCount_RNA | nFeature_RNA | |
---|---|---|---|
<chr> | <dbl> | <int> | |
AAACCTGAGACAATAC-1 | 300min | 1630 | 803 |
AAACCTGAGACACTAA-1 | 300min | 3147 | 1365 |
AAACCTGAGACGCTTT-1 | 300min | 892 | 586 |
AAACCTGAGAGGGCTT-1 | 300min | 1666 | 1033 |
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <chr> | |
F36H1.11 | 0.000000e+00 | 5.450659 | 0.570 | 0.048 | 0.000000e+00 | 0 | F36H1.11 |
egl-21 | 1.563079e-282 | 3.984563 | 0.577 | 0.058 | 3.123813e-278 | 0 | egl-21 |
T10B5.4 | 1.850416e-276 | 3.373848 | 0.792 | 0.138 | 3.698057e-272 | 0 | T10B5.4 |
madd-4 | 1.841569e-242 | 4.579571 | 0.420 | 0.030 | 3.680376e-238 | 0 | madd-4 |
F28E10.1 | 8.224195e-225 | 3.734330 | 0.746 | 0.167 | 1.643605e-220 | 0 | F28E10.1 |
C31H5.5 | 8.048185e-222 | 4.466041 | 0.411 | 0.034 | 1.608430e-217 | 0 | C31H5.5 |