Merge pull request #111 from ncsa/develop

Develop
ncsa · May 26, 2024 · b4fb9a3 · b4fb9a3
2 parents 66ab859 + e96fd20
commit b4fb9a3
Show file tree

Hide file tree

Showing 56 changed files with 1,179 additions and 43,826 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -17,45 +17,66 @@ jobs:
       - uses: actions/checkout@v3
       - uses: s-weigand/[email protected]
         with:
-          conda-channels: [bioconda, conda-forge]
+          conda-channels: bioconda, conda-forge
           activate-conda: true
           repository: NCSA/NEAT
       - name: Environment Setup
         run: |
           conda env create -f environment.yml -n test_neat
-          conda activate test_neat
+          source activate test_neat
           poetry install
-          cd config_template
 
       - name: Run NEAT Simulation for config_test1
-        run: python -m neat --log-level DEBUG read-simulator -c config_test1.yml -o ../outputs/test1_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test1.yml -o ../outputs/test1_read-simulator
 
       - name: Run NEAT Simulation for config_test2
-        run: python -m neat --log-level DEBUG read-simulator -c config_test2.yml -o ../outputs/test2_read-simulator
+        run: |
+          source activate test_neat 
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test2.yml -o ../outputs/test2_read-simulator
 
       - name: Run NEAT Simulation for config_test3
-        run: python -m neat --log-level DEBUG read-simulator -c config_test3.yml -o ../outputs/test3_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test3.yml -o ../outputs/test3_read-simulator
 
       - name: Run NEAT Simulation for config_test4
-        run: python -m neat --log-level DEBUG read-simulator -c config_test4.yml -o ../outputs/test4_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test4.yml -o ../outputs/test4_read-simulator
 
       - name: Run NEAT Simulation for config_test5
-        run: python -m neat --log-level DEBUG read-simulator -c config_test5.yml -o ../outputs/test5_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test5.yml -o ../outputs/test5_read-simulator
 
       - name: Run NEAT Simulation for config_test6
-        run: python -m neat --log-level DEBUG read-simulator -c config_test6.yml -o ../outputs/test6_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test6.yml -o ../outputs/test6_read-simulator
 
       - name: Run NEAT Simulation for config_test7
-        run: python -m neat --log-level DEBUG read-simulator -c config_test7.yml -o ../outputs/test7_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test7.yml -o ../outputs/test7_read-simulator
 
       - name: Run NEAT Simulation for config_test8
-        run: python -m neat --log-level DEBUG read-simulator -c config_test8.yml -o ../outputs/test8_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test8.yml -o ../outputs/test8_read-simulator
 
       - name: Run NEAT Simulation for config_test9
-        run: python -m neat --log-level DEBUG read-simulator -c config_test9.yml -o ../outputs/test9_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test9.yml -o ../outputs/test9_read-simulator
 
       - name: Run NEAT Simulation for config_test10
-        run: python -m neat --log-level DEBUG read-simulator -c config_test10.yml -o ../outputs/test10_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test10.yml -o ../outputs/test10_read-simulator
 
       - name: Run NEAT Simulation for config_test11
-        run: python -m neat --log-level DEBUG read-simulator -c config_test11.yml -o ../outputs/test11_read-simulator
+        run: |
+          source activate test_neat
+          python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test11.yml -o ../outputs/test11_read-simulator
diff --git a/ChangeLog.md b/ChangeLog.md
@@ -1,6 +1,16 @@
 # NEAT has a new home
 NEAT is now a part of the NCSA github and active development will continue here. Please direct issues, comments, and requests to the NCSA issue tracker. Submit pull requests here insead of the old repo.
 
+# NEAT v4.2
+- After several bug fixes that constituted release 4.1 and some minor releases, we are ready to release an overhauled vesion of NEAT 4.0.
+- Removed GC bias - it had little to no effect and made implementation nearly impossible
+- Removed fasta creation - we had tweaked this a bit but never got any feedback. It may come back if requested.
+- Improvements/fixes/full implementations of:
+  - heterozygosity
+  - read creation (now with more reads!)
+  - bam alignment/creation
+  - bed tool incorporation
+
 -Updated "master" branch to "main." - please update your repo accordingly
 # NEAT v4.0
 - Rewritten the models. Models generated on old versions of NEAT will have to be redone, due to the restructuring of the codebase. These new models should be smaller and more efficient. We have replicated the previous default models in the new style. There is no straightforward way to convert between these, unfortuantely.

diff --git a/README.md b/README.md
@@ -1,13 +1,13 @@
 # The NEAT Project v4.0
-Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.0. This is our first (beta) release of the newest version of NEAT. There is still lots of work to be done. See the [ChangeLog](ChangeLog.md) for notes.
+Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.2. This release of NEAT includes several fixes and a little bit of restructuring. There is still lots of work to be done. See the [ChangeLog](ChangeLog.md) for notes. We have discarded the fasta file writing for now and removed that code. We may add that in as a feature in the future, if users call for it. We also removed GC bias for now. It severely complicated implementation, and had very few noticeable effects. After discussing with some people at the Illinois Institute for Genomic Biology, it sounded like GC bias may be a bit of a non-factor with improved chemistries. These will be reintroduced if needed/called for. 
 
 We are also working on redeveloping NEAT in Rust, a memory and thread safe language that will lend itself well to the way NEAT works, check that out here: https://github.com/ncsa/rusty-neat
 
 Stay tuned over the coming weeks for exciting updates to NEAT, and learn how to [contribute](CONTRIBUTING.md) yourself. If you'd like to use some of our code, no problem! Just review the [license](LICENSE.md), first.
 
 NEAT's read-simulator is a fine-grained read simulator. It simulates real-looking data using models learned from specific datasets. There are several supporting utilities for generating models used for simulation and for comparing the outputs of alignment and variant calling to the golden BAM and golden VCF produced by NEAT.
 
-This is release v4.0 of the software. While it has been tested, it does represent a shift in the software with the introduction of a configuration file. For a stable release using the old command line interface, please see: [NEAT 3.0](https://github.com/ncsa/NEAT/releases/tag/3.3) (or check out older tagged releases)
+This is release v4.2 of the software. While it has been tested, it does represent a shift in the software with the introduction of a configuration file. For a stable release using the old command line interface, please see: [NEAT 3.0](https://github.com/ncsa/NEAT/releases/tag/3.3) (or check out older tagged releases)
 
 To cite this work, please use:
 
@@ -31,7 +31,6 @@ Table of Contents
       * [Large single end reads](#large-single-end-reads)
       * [Parallelizing simulation](#parallelizing-simulation)
   * [Utilities](#utilities)
-    * [compute_gc_bias](#computegcbias)
     * [model_fragment_lengths](#modelfraglen)
     * [gen_mut_model](#genmutmodel)
     * [model_sequencing_error](#modelseqerror)
@@ -40,8 +39,9 @@ Table of Contents
 
 ## Requirements (the most up-to-date requirements are found in the environment.yml file)
 
+* Some version of Anaconda to set up the environment
 * Python == 3.10.*
-* poetry
+* poetry == 1.3.*
 * biopython == 1.79
 * pkginfo
 * matplotlib
@@ -71,13 +71,20 @@ the NEAT repo, after creating the conda environment:
 > poetry install
 ```
 
+Notes: If any packages are struggling to resolve, check the channels and try to manually pip install the package to see if that helps (but note that NEAT is not tested on the pip versions.)
+
 Test your install by running:
 ```
 > neat --help
 ```
 
+You can also try running it using the python command directly:
+```
+> python -m neat --help
+```
+
 ## Usage
-NEAT's core functionality is invoked using the read-simulator command. Here's the simplest invocation of read-simulator using default parameters. This command produces a single ended fastq file with reads of length 101, ploidy 2, coverage 10X, using the default sequencing substitution, GC% bias, and mutation rate models.
+NEAT's core functionality is invoked using the read-simulator command. Here's the simplest invocation of read-simulator using default parameters. This command produces a single ended fastq file with reads of length 151, ploidy 2, coverage 10X, using the default sequencing substitution, and mutation rate models.
 
 Contents of neat_config.yml
 ```
@@ -110,7 +117,6 @@ The default is given:
 
 produce_bam: False
 produce_vcf: False
-produce_fasta: False
 produce_fastq: True
 
 error_model: full path to an error model generated by NEAT. Leave empty to use default model
@@ -119,11 +125,7 @@ mutation_model: full path to a mutation model generated by NEAT. Leave empty to
     model (default model based on human data sequenced by Illumina)
 fragment_model: full path to fragment length model generate by NEAT. Leave empty to use default model
     (default model based on human data sequenced by Illumina)
-gc_model: Full path to model for correlating GC concentration and coverage, produced by NEAT.
-    (default model is based on human data, sequenced by Illumina)
 
-partition_mode: by chromosome ("chrom"), or subdivide the chromosomes ("subdivision").
-    Note: this feature is not yet fully implemented
 threads: The number of threads for NEAT to use.
     Note: this feature is not yet fully implemented
 avg_seq_error: average sequencing error rate for the sequencing machine. Use to increase or
@@ -134,20 +136,14 @@ include_vcf: full path to list of variants in vcf format to include in the simul
     appear in the input VCF into the final VCF, and the corresponding fastq and bam files, if requested.
 target_bed: full path to list of regions in bed format to target. 
     All areas outside these regions will have coverage of 0.
-off_target_scalar: manually set the off-target-scalar when using a target bed (if you want to have some percentage of 
-    reads from outside the targeted regions. Default is 0. (i.e., setting this to 0.02 would mean off-target areas will 
-    have a coverage of ~2% of the total coverage). This is an experimental feature.
 discard_bed: full path to a list of regions to discard, in BED format.
 mutation_rate: Desired rate of mutation for the dataset. Float between 0.0 and 0.3
     (default is determined by the mutation model)
 mutation_bed: full path to a list of regions with a column describing the mutation rate of that region,
     as a float with values between 0 and 0.3. The mutation rate must be in the third column as, e.g., mut_rate=0.00.
-no_coverage_bias: Set to true to produce a dataset free of coverage bias
 rng_seed: Manually enter a seed for the random number generator. Used for repeating runs. Must be an integer.
 min_mutations: Set the minimum number of mutations that NEAT should add, per contig. Default is 0. We recommend setting 
-    this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig.
-fasta_per_ploid: Produce one fasta per ploid. Default behavior is to produce
-    a single fasta showing all variants.                                                                                                                                                                        |
+    this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig. |
 
 The command line options for NEAT are as follows:
 
@@ -156,10 +152,9 @@ Universal options can be applied to any subfunction. The commands should come be
 |---------------------|--------------------------------------|
 | -h, --help          | Displays usage information           |
 | --no-log            | Turn off log file creation           |
-| --log-dir LOG_DIR   | Sets the log directory to custom path (default is current working directory |
-| --log-name LOG_NAME | Custom name for log file (default is timestamped) |
+| --log-name LOG_NAME | Custom name for log file, can be a full path (default is current working directory with a name starting with a timestamp)|
 | --log-level VALUE   | VALUE must be one of [DEBUG, INFO, WARN, WARNING, ERROR] - sets level of log to display |
-| --log-detal VALUE   | VALUE must be one of [LOW, MEDIUM, HIGH] - how much info to write for each log record |
+| --log-detail VALUE   | VALUE must be one of [LOW, MEDIUM, HIGH] - how much info to write for each log record |
 | --silent-mode       | Writes logs, but suppresses stdout messages |
 
 read-simulator command line options
@@ -184,9 +179,8 @@ Features:
 - Can simulate targeted sequencing via BED input specifying regions to sample from
 - Can accurately simulate large, single-end reads with high indel error rates (PacBio-like) given a model
 - Specify simple fragment length model with mean and standard deviation or an empirically learned fragment distribution using utilities/computeFraglen.py
-- Simulates quality scores using either the default model or empirically learned quality scores using utilities/fastq_to_qscoreModel.py
+- Simulates quality scores using either the default model or empirically learned quality scores using `neat gen_mut_model`
 - Introduces sequencing substitution errors using either the default model or empirically learned from utilities/
-- Accounts for GC% coverage bias using model learned from utilities/computeGC.py
 - Output a VCF file with the 'golden' set of true positive variants. These can be compared to bioinformatics workflow output (includes coverage and allele balance information)
 - Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
 - Create paired tumour/normal datasets using characteristics learned from real tumour data
@@ -288,27 +282,6 @@ neat read-simulator                 \
 # Utilities	
 Several scripts are distributed with gen_reads that are used to generate the models used for simulation.
 
-## neat compute_gc_bias
-
-Computes GC% coverage bias distribution from sample (bedrolls genomecov) data.
-Takes .genomecov files produced by BEDtools genomeCov (with -d option).
-(Not yet implemented in NEAT 4.0)
-
-```
-bedtools genomecov
-        -d                          \
-        -ibam normal.bam            \
-        -g reference.fa
-```
-
-```
-neat compute_gc_bias                \
-        -r reference.fa             \
-        -i genomecovfile            \
-        -w [sliding window length]  \
-        -o /path/to/prefix
-```
-
 ## neat model-fraglen
 
 Computes empirical fragment length distribution from sample data.