HTStream Validation

PAPER: https://docs.google.com/document/d/1YMfAflWbbfXeZvZ2MNUlwyH6blz8bNIFIZq32S5zvUo/edit?usp=sharing

TODO:

HTS std call have stats after every step
add more datasets because why not? (mrnaseq is looking pretty clean)
double check adapter trimmer reduction for all file type? Library prep?
Update htsream version
Fix master_parse.sh need to make for all datasets
ask about library prep part to put in the paper.
Add info about memory and time
Produce a pipeline DNA, RNA, Amplicon one, SE (no super deduper for SE)
Show algorithms do what they are supposed to do… some are straight forward.
Experimental validation, record parameters and use them to show consistency. (MDS plot) statistic for each tool -> info about sample
Make sure all statements applicable to nanopore/pacbio as well in regards to hts
Use HTS stream before and after each tool (like the other ones)
add overlapper to RNA seq methodso
detailing applications is boring, intro some philosophy> apps and what they are meant to do.
discussion > impac, QA/QC,
get bam file to spit out the name fo the gene and siisoalte from the non processed file.
add the stuff from the github.io pages for s4hts
Make sure all statements applicable to nanopore/pacbio as well in regards to hts?

METHODS:

In the `jupyter_notebookes` and `r_analysis` directories

Executable for analyzing data in the rna_phix output directories

`multiqc_data` for storing multiqc reports

More info on the relevant report.

In the `rna_phix` directory

ena/SAMPLES(from datasets.txt/phix_datasets.txt) -> runmaster.py runs hts_master.slurm ${type} ${datasets_file} - python runmaster.py phix OR/AND - python runmaster.py rna - output in 01-HTS_preproc
Clean up files since array doesnt match for phix and rna (whoops) create samples.txt and phix_samples.txt files and tells you array size needed for step 3$ - ./post_hts.sh
STAR alignment for rna type - adjust array based on output of post_hts.sh - sbatch star_proc.slurm - ./master_parse.sh should be run will call parse_output.py for each of the files to get the .json files for each alignment. - TODO: fix parse_output.py and see how all json file get to the output directory. - jupyter notebook analysis for this
BWA Mem alignment for phix type (seq screener) - adjust array based on output of post_hts.sh - sbatch phix_proc.slurm - TODO some thing sfor getting the flgstats stuff
Adapter eval.py? (Using some bbmap scripts - ./randomreads.sh - ./addadapters.sh
Deduper eval.py/.R? (deduper but needed for overall methodology talk)
Primer eval.py (sampe as adapter eval?)
Overlapper (see `Overlapper.md)

CHECKLIST OF APPS/MODULES TESTED

X = done & = not done

X - adaptereval (Adapter eval above)
& # - deduper (big one!)
X - qtrimmer (Star alignment above) (multiqc report for the effect on the reads to double check) (maybe deduper noise to$
X - ntrimmer (same as q trimmer)
X - polyatrim (same as n trimmer)
X - seqscreener (BWA mem alignment)

Extra notes from old markdown

Overlapper.md contains some general methods for overlapper evaluation/comparison
- TODO: update with new datasets (multiple)
- Need gold standard for this dataset or just go based on mapping like Qtrimmer and Adapter Trimmer
NTrimmer_QTrimmer.md contains general methods for evaluation effecienct of Ntrimmer and quality trimming.
- TODO: finalize the datasets.. dig up code again

for i in ls SRR6048806_/SRR6048806__Log.final.out; do echo $i cat $i; done

samtools view -f 64 -F 2304 method1.bam | cut -f1,3,4 | LC_ALL=C sort -t '\t' -k1,1 > method1.txt
samtools view -f 64 -F 2304 method2.bam | cut -f1,3,4 | LC_ALL=C sort -t '\t' -k1,1 > method2.txt
join -t '\t' -1 1 -2 1 method1.txt method2.txt > R1.comparaison

forward_stranded_counts = featureCounts(bams, annot.ext = gff,      isGTFAnnotationFile=T,      GTF.attrType='ID',     GTF.featureType='gene',     minOverlap=27,     allowMultiOverlap=TRUE,     countMultiMappingReads=TRUE,     strandSpecific = 1,  #Only reads where R1 is forward W.R.T. the transcript are counted     isPairedEnd = TRUE,     nthreads = 7,     useMetaFeatures = TRUE     #Counts should be summarized by gene )  gcounts = forward_stranded_counts$counts[, 1:ncol(forward_stranded_counts$counts)] colnames(gcounts) = gsub('.bam', '', colnames(gcounts)) fscann = cbind(ann[match(rownames(gcounts), ann$gene), ], gcounts) write.table(fscann, file='forward_stranded_read_counts_by_gene.tsv', sep='\t', row.names=F, col.names=T)

gtf -> featureCounts
https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts
htseq (bradleys project)
run hts for phix dataset for super deduper to look at multiqc report
should go up and flatten and curve
check the super deduper code
doing mapping for the rna seq workshop
Suggestion: Post Papers to Workshop on Slack, so we can create separate threads instead of on Zulip.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
eval_scripts/evaluation		eval_scripts/evaluation
evaluation		evaluation
jupyter_notebooks		jupyter_notebooks
multiqc_data		multiqc_data
r_analysis		r_analysis
rna_phix		rna_phix
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Overlapper.md		Overlapper.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTStream Validation

TODO:

METHODS:

In the `jupyter_notebookes` and `r_analysis` directories

`multiqc_data` for storing multiqc reports

In the `rna_phix` directory

Primer eval.py (sampe as adapter eval?)

CHECKLIST OF APPS/MODULES TESTED

Extra notes from old markdown

About

Releases

Packages

Languages

License

s4hts/HTStream_validation

Folders and files

Latest commit

History

Repository files navigation

HTStream Validation

TODO:

METHODS:

In the jupyter_notebookes and r_analysis directories

multiqc_data for storing multiqc reports

In the rna_phix directory

Primer eval.py (sampe as adapter eval?)

CHECKLIST OF APPS/MODULES TESTED

Extra notes from old markdown

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

In the `jupyter_notebookes` and `r_analysis` directories

`multiqc_data` for storing multiqc reports

In the `rna_phix` directory

Packages