Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run MAFFT #204

Closed
poddarharsh15 opened this issue Apr 2, 2024 · 9 comments
Closed

Unable to run MAFFT #204

poddarharsh15 opened this issue Apr 2, 2024 · 9 comments

Comments

@poddarharsh15
Copy link

Hello @ACEnglish
I encountered an error while running Truvari refine with MAFFT, as indicated in the reference below. Despite this issue, I was able to complete the process.
Could you please provide some advice on how to improve my analysis and address the error with MAFFT? Any guidance or recommendations you can offer would be greatly appreciated.
Thank you for your assistance.

truvari refine --regions ~/HG002_WGS_150/no_duplicates/140616_D00360_0028_AHA2RLADXX/bwamem/dysgu/candidate.refine.bed --use-original-vcfs --reference /home/tigem/h.poddar/short_reads_pipeline/reference_genome/GRCh37/GRCh37.p13.genome.fa ~/HG002_WGS_150/no_duplicates/140616_D00360_0028_AHA2RLADXX/bwamem/dysgu/ -t 30 --use-region-coords --align mafft

2024-04-02 11:43:07,577 [INFO] Truvari v4.2.2
2024-04-02 11:43:07,577 [INFO] Command /home/tigem/h.poddar/anaconda3/bin/truvari refine --regions /home/tigem/h.poddar/HG002_WGS_150/no_duplicates/140616_D00360_0028_AHA2RLADXX/bwamem/dysgu/candidate.refine.bed --use-original-vcfs --reference /home/tigem/h.poddar/short_reads_pipeline/reference_genome/GRCh37/GRCh37.p13.genome.fa /home/tigem/h.poddar/HG002_WGS_150/no_duplicates/140616_D00360_0028_AHA2RLADXX/bwamem/dysgu/ -t 30 --use-region-coords --align mafft
2024-04-02 11:43:07,577 [INFO] Params:
{
"benchdir": "/home/tigem/h.poddar/HG002_WGS_150/no_duplicates/140616_D00360_0028_AHA2RLADXX/bwamem/dysgu/",
"reference": "/home/tigem/h.poddar/short_reads_pipeline/reference_genome/GRCh37/GRCh37.p13.genome.fa",
"regions": "/home/tigem/h.poddar/HG002_WGS_150/no_duplicates/140616_D00360_0028_AHA2RLADXX/bwamem/dysgu/candidate.refine.bed",
"use_original_vcfs": true,
"use_region_coords": true,
"recount": false,
"threads": 30,
"align": "mafft",
"mafft_params": "--auto --thread 1",
"debug": false
}
2024-04-02 11:43:07,577 [INFO] Setting up regions
2024-04-02 11:43:08,119 [INFO] 5878 --regions reduced to 5878 after intersecting with 34830 from --includebed
2024-04-02 11:43:08,120 [INFO] Extending the regions by 100 bases
2024-04-02 11:43:09,284 [INFO] 711 regions to be refined
2024-04-02 11:43:09,287 [INFO] Preparing regions
2024-04-02 11:43:09,292 [INFO] Extracting haplotypes
2024-04-02 11:43:09,913 [INFO] Harmonizing variants
2024-04-02 11:43:10,129 [ERROR] Unable to run MAFFT
2024-04-02 11:43:10,130 [ERROR]
2024-04-02 11:43:10,132 [ERROR] Unable to run MAFFT
.............................................................................................

2024-04-02 11:44:05,019 [INFO] Running bench
2024-04-02 11:44:05,035 [INFO] Including 711 bed regions
2024-04-02 11:44:05,242 [INFO] Zipped 11608 variants Counter({'base': 5804, 'comp': 5804})
2024-04-02 11:44:05,242 [INFO] 399 chunks of 11608 variants Counter({'__filtered': 11208, 'base': 396, 'comp': 4})
2024-04-02 11:44:05,454 [INFO] Finished refine

ACEnglish added a commit that referenced this issue Apr 3, 2024
should throw more informative errors. Allows debugging for #204
@ACEnglish
Copy link
Owner

I have put in a commit to develop that should help that blank log line after "Unable to run MAFFT" inform you about what error mafft encountered. Please run from develop and see if that give you helpful additional information for debugging.

@poddarharsh15
Copy link
Author

Hi @ACEnglish
Apologies, I lack experience in running a debug from the development environment. However, I can attempt to run it from my terminal to gather more information.
The blank lines were not generated by truvari; they continued with "UNABLE TO RUN MAFFT" and ended with "Running bench."
Screenshot from 2024-04-03 16-44-29

@ACEnglish
Copy link
Owner

Building directly from develop is documented in the (wiki). Alternatively, you can use pip install git+https://github.com/acenglish/truvari@a8dbd3066f7572e9b7487c470485de34930eee12 to directly install from the current head of the development branch.

@poddarharsh15
Copy link
Author

Hi @ACEnglish, I ran a debug on the development branch, and the outputs indicated the following:

2024-04-04 11:51:40,541 [DEBUG] Comparing chunk 385
2024-04-04 11:51:40,541 [DEBUG] All FN -> False 0 ->
chr9 113491624 . A ACCATCCTGGCTAACAAGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGCGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAGTCCGGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAATAGCTGGAAATTAAATTTGTATGGGAAAAATTCTACTTTTTCCCCAACAGTTTTCTGTCAACAGTTTCTGTTTTCTGTAAAAAAAAAAAAAAAAAA . . . GT 0/1 0/0
None
2024-04-04 11:51:40,542 [DEBUG] Comparing chunk 386
2024-04-04 11:51:40,542 [DEBUG] All FN -> False 0 ->
chr9 114868921 . G GAGCTCTTTGGAGAGATAATACATGTTAAAAATAAGAAACTGGTAATATAATGCTATACTAAACCAATAACCTAAGAGAAGAGAGAGGGGAAGAAATAAAAGGAAGTTTTAAAAATAAAAGGAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCGGCTAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCC . . . GT 0/1 0/0
None
2024-04-04 11:51:40,542 [DEBUG] Comparing chunk 387
2024-04-04 11:51:40,542 [DEBUG] All FP -> False 0 ->
None chr9 121869848 . C CGTGTATATATATATATACACACACGTGTATATATATATATACACACACGTG . . . GT 0/0 1/1

2024-04-04 11:51:40,544 [DEBUG] Comparing chunk 388
2024-04-04 11:51:40,544 [DEBUG] All FN -> False 0 ->
chr9 130644938 . G GGGAAGGGAAGAAGGGAAGAAGGGAAGGGAAGAGGGAAGGGAAGAGGGGAAGGGAACAAGGGAAGGAAAGAAGGGAAGGGAAGAAGGGAAGGGAAGGAAAGGGAAGGGAAGAGGGCGAGGGAAGGGAAGGAAAGAGGGGGAGGGAAGGGGGAAGAGAGGGAGGGAAGAGGGGAAGGGAAAAAAGGAAGAGAAGAAGGGAAGGGAAGAAGGAAAGAAGGGAAGGGAAGAAAGGAAGGGAAGAAGGGAAGGGAAGGAGGGAAGGGAAGGGAAGAAAGGAAGGGAAGAGAAGAAGGGAAGAAAAGAAGGGAAGGGAAGGGAAGGGAAAAAAATTCCAAGGCTTGAAAACAAGCTGGCATGGCGGTTGCAGGAGATGTGCCAGGG.. . GT 1/1 0/0
None
2024-04-04 11:51:40,544 [DEBUG] Comparing chunk 389
2024-04-04 11:51:40,544 [DEBUG] All FN -> False 0 ->
chr9 132621534 . A TTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTATTTTATTTTATTTATTTTATTATTTTTTTTAATTTTTTTAATTTTATTTTAATTTTATTTTATTTTAT . . . GT 1/1 0/0
None
2024-04-04 11:51:40,544 [DEBUG] Comparing chunk 390
2024-04-04 11:51:40,544 [DEBUG] All FN -> False 0 ->
chrX 18886932 . A AAATTCTCTTCTGATTTCTGTTGGTTTATTATTTTGCTGTTTTTTTCCCTAATTTCTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTTGCTCTGTCGCCCAGGCTGGAGTGCGGTGGCGC . . . GT 1/1 0/0
None
2024-04-04 11:51:40,545 [DEBUG] Comparing chunk 391
2024-04-04 11:51:40,545 [DEBUG] All FN -> False 0 ->
chrX 30133830 . T TTATTCTGTGTCTCAGTTTAGTTTCTTTTGAACCAGAAACCTAAAAAAAGGGGGGGTGGAAATTTGTCTGTTTGGTTGGCTGCTATATGTCCACCACTAGTTGATGTTAGGTATATTGCATA . . . GT 1/1 0/0
None
2024-04-04 11:51:40,545 [DEBUG] Comparing chunk 392
2024-04-04 11:51:40,545 [DEBUG] All FN -> False 0 ->
chrX 40822717 . T TAATGTATGATATATGATATGATATGATATATATCACATATCATATATCATATATGATATGATATATGATATATGATATATTATGTAATATATCATATATTATATATATATTTT . . . GT 0/1 0/0
None
2024-04-04 11:51:40,545 [DEBUG] Comparing chunk 393
2024-04-04 11:51:40,545 [DEBUG] All FN -> False 0 ->
chrX 64113287 . A GATATATATATGATATATATGATATATATATGATATATATATCATATATATGATATATATATCATATATATCATATATATCATATATATATCATATATATCATATATATCTCATATATATGATATATATA . . . GT 1/10/0
None
2024-04-04 11:51:40,546 [DEBUG] Comparing chunk 394
2024-04-04 11:51:40,546 [DEBUG] All FN -> False 0 ->
chrX 81900786 . G GAAACTTATATAACTATATAAGTTTATATAAAACTATAAGTTTATATAAACTTATAGTTTTATAT . . . GT 1/1 0/0
None
2024-04-04 11:51:40,546 [DEBUG] Comparing chunk 395
2024-04-04 11:51:40,546 [DEBUG] All FN -> False 0 ->
chrX 90320097 . A ATATATATATATGTGTATATATATATATATACACATATATATATGTGTGTATATATATATACACATATATATATGTGTATATATATATACACATATATATATATGTGTATATATATATACACATATATATATGTGTATATATATATACACATATATATATATGTGTATATATATATATATACACACACA . . . GT 1/1 0/0
None
2024-04-04 11:51:40,547 [DEBUG] Comparing chunk 396
2024-04-04 11:51:40,547 [DEBUG] All FN -> False 0 ->
chrX 109948659 . A AATGGGGGAAAAGATTCCCTATTTTATATATATATAAATATTTATATATATTATATTTTATATATATATATTTTACATATATTTATATTACATATAAATATTTATATATATATAAAATAGGGAATCTTTTCCCCCATTACTTGT . . . GT 1/1 0/0
None
2024-04-04 11:51:40,547 [DEBUG] Comparing chunk 397
2024-04-04 11:51:40,547 [DEBUG] All FN -> False 0 ->
chrX 116484619 . C CTCCTGACCTCAGATGATCCGCCCGCCTCAGCCTCTGAAAGTGCTGGGATTACAGGTGTGAGCCACCGTGCCCAGTCTGTGTTTCCTTTCAAAGGAACACGCCCAGTTGGAACTCCTCAATTTCCTCTGTTCATGTATTCATGTATTTGTCAAATAACTA . .. GT 1/1 0/0
None
2024-04-04 11:51:40,547 [DEBUG] Comparing chunk 398
2024-04-04 11:51:40,547 [DEBUG] All FN -> False 0 ->
chrX 117605151 . C CTCTGTGTCTAGCTAAAGGATTGTAAATGCACCAATCAGCACTATGTAAAAACATACCAATCAGCGT . . . GT 1/1 0/0
None
2024-04-04 11:51:40,547 [INFO] Zipped 11608 variants Counter({'base': 5804, 'comp': 5804})
2024-04-04 11:51:40,547 [INFO] 399 chunks of 11608 variants Counter({'__filtered': 11208, 'base': 396, 'comp': 4})
2024-04-04 11:51:40,547 [DEBUG] Comparing chunk 399
2024-04-04 11:51:40,547 [DEBUG] All FN -> False 0 ->
chrX 133368071 . T TAATTATATATATAATATTATATATATAATATTATATATATAATTTTATATAATATTATATATATAATATTATATATAT . . . GT 1/1 0/0
None
2024-04-04 11:51:40,772 [INFO] Finished refine

From my perspective, it seems that the VCF file isn't phased before running refine, which could be causing these errors. What are your thoughts?

@ACEnglish
Copy link
Owner

Yes, the vcf should be phased in order for refine to correctly rebuild the samples' haplotypes. However, that doesn't fully explain the "Unable to run MAFFT" error. You would need to run from the development branch and look at the logs as mentioned above to see the problem mafft is encountering.

@poddarharsh15
Copy link
Author

refine.log.txt
Here is the log file for reference there is a presence of unknown character -i

@ACEnglish
Copy link
Owner

ACEnglish commented Apr 4, 2024

Great, now we know what's happening. This give you two options. You can check if the i/e/p characters are inside your VCF or your reference sequence and correct them. These characters are 'contaminating' the haplotype that truvari is building and causing mafft to fail. Alternatively, if you'd prefer to keep these characters, you can use truvari refine --mafft-params '--auto --thread 1 --anysymbol' to force mafft to align them. But I can't promise that they won't cause issues downstream of the msa step.

@poddarharsh15
Copy link
Author

refine.log.txt
I tried the params you suggested but still it won't work I think I need to remove i/e/p characters. However I have different VCF files from another tools I will try to run MAFFT on them.

@poddarharsh15
Copy link
Author

Hi @ACEnglish I was able to remove i/e/p characters from .vcf files but now I am having a problem with invalid GT, will the problem be resolved if I add ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> this line in header of my vcf file?
Please let me know :) I appreciate your help thanks.

Screenshot from 2024-04-10 15-02-21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants