Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error with GTDB-Tk v2.2.5 (TypeError after skipping of ANI classification step) #493

Closed
ilnamkang opened this issue Mar 17, 2023 · 4 comments
Labels
error Help required for a GTDB-Tk error. next version Upcoming feature/fix in staging branch.

Comments

@ilnamkang
Copy link

Environment

  • Installed via pip

Server information

  • RAM: 512 G
  • OS: Ubuntu 18.04

Hi,

I've encountered an error with GTDB-Tk v2.2.5 after skipping of ANI classification step.

My command was as below.
$ gtdbtk classify_wf --genome_dir Input -x fa --out_dir GTDB --cpus 72 --pplacer_cpus 64 --full_tree --skip_ani_screen

The excerpt of the gtdbtk.log is as below.

If this error is not from a bug, would you let me know how I can avoid this error?

-----
[2023-03-17 20:07:41] TASK: Placing 319 bacterial genomes into reference tree with pplacer using 64 CPUs (be patient).
[2023-03-17 20:07:41] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2023-03-17 21:20:45] INFO: Calculating RED values based on reference tree.
[2023-03-17 21:21:01] TASK: Traversing tree to determine classification method.
[2023-03-17 21:21:01] INFO: ANI classification has been skipped (--genes option used).
[2023-03-17 21:21:01] ERROR: Uncontrolled exit resulting from an unexpected error.

================================================================================
EXCEPTION: TypeError
MESSAGE: stat: path should be string, bytes, os.PathLike or integer, not dict
________________________________________________________________________________

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/gtdbtk/main.py", line 101, in main
gt_parser.parse_options(args)
File "/usr/local/lib/python3.8/dist-packages/gtdbtk/main.py", line 1172, in parse_options
self.classify(options,all_classified_ani= all_classified_ani)
File "/usr/local/lib/python3.8/dist-packages/gtdbtk/main.py", line 587, in classify
reports = classify.run(genomes=genomes,
File "/usr/local/lib/python3.8/dist-packages/gtdbtk/classify.py", line 661, in run
class_level_classification, classified_user_genomes,warning_counter = self._parse_tree(tree_to_process, genomes, msa_dict, perce
nt_multihit_dict,
File "/usr/local/lib/python3.8/dist-packages/gtdbtk/classify.py", line 1210, in _parse_tree
class_level_classification,warning_counter = self._classify_red_topology(tree, msa_dict, percent_multihit_dict,
File "/usr/local/lib/python3.8/dist-packages/gtdbtk/classify.py", line 850, in _classify_red_topology
user_genome_ids = set(read_fasta(user_msa_file).keys())
File "/usr/local/lib/python3.8/dist-packages/gtdbtk/biolib_lite/seq_io.py", line 48, in read_fasta
if not os.path.exists(fasta_file):
File "/usr/lib/python3.8/genericpath.py", line 19, in exists
os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not dict
================================================================================

@ilnamkang ilnamkang added the error Help required for a GTDB-Tk error. label Mar 17, 2023
@ilnamkang
Copy link
Author

It is likely that this error may have been introduced in v2.2.5.

I downgraded GTDB-Tk to v2.2.4, which ran successfully without any errors with the same input.

pchaumeil added a commit that referenced this issue Mar 20, 2023
- issue #493 is fixed for --full_tree
- fix issue with mash_db, we
regenate the genome path of rep genomes and mash_db.msh to reflect the
current filesystem.
@pchaumeil
Copy link
Collaborator

Thanks for your feedback, we will release a new version of Tk in the coming days to patch this issue.

@pchaumeil pchaumeil added the next version Upcoming feature/fix in staging branch. label Mar 20, 2023
@GabeAl
Copy link

GabeAl commented Mar 22, 2023

I'm not running --full-tree or --skip-ani-screen. I get the same error with 2.2.5.

$ gtdbtk classify_wf --cpus 16 -x fa --genome_dir splits --out_dir classy --pplacer_cpus 16 --mash_db mashy

[2023-03-22 03:22:30] INFO: GTDB-Tk v2.2.5
[2023-03-22 03:22:30] INFO: gtdbtk classify_wf --cpus 16 -x fa --genome_dir splits --out_dir classy --pplacer_cpus 16 --mash_db mashy
[2023-03-22 03:22:30] INFO: Using GTDB-Tk reference data version r207: /gdat/db/gtdbtk/release207_v2
[2023-03-22 03:22:30] INFO: Loading reference genomes.
[2023-03-22 03:22:30] INFO: Using Mash version 2.3
[2023-03-22 03:22:30] INFO: Creating Mash sketch file: classy/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-03-22 03:22:42] INFO: Completed 1,625 genomes in 11.93 seconds (136.18 genomes/second).
[2023-03-22 03:22:42] INFO: Creating Mash sketch file: mashy.msh                    
[2023-03-22 03:34:42] INFO: Completed 65,703 genomes in 12.00 minutes (5,476.91 genomes/minute).
[2023-03-22 03:34:42] INFO: Calculating Mash distances.                              
[2023-03-22 03:43:49] INFO: Calculating ANI with FastANI v1.32.
[2023-03-22 04:13:27] INFO: Completed 41,972 comparisons in 29.59 minutes (1,418.38 comparisons/minute).
[2023-03-22 04:14:20] INFO: Summary of results saved to: classy/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv
[2023-03-22 04:14:20] INFO: 892 genome(s) have been classified using the ANI pre-screening step.
[2023-03-22 04:14:20] INFO: Done.
[2023-03-22 04:14:20] INFO: Identifying markers in 733 genomes with 16 threads.
[2023-03-22 04:14:20] TASK: Running Prodigal V2.6.3G to identify genes.
[2023-03-22 04:35:29] INFO: Completed 733 genomes in 21.15 minutes (34.67 genomes/minute).
[2023-03-22 04:35:29] TASK: Identifying TIGRFAM protein families.                
[2023-03-22 04:38:38] INFO: Completed 733 genomes in 3.15 minutes (232.44 genomes/minute).
[2023-03-22 04:38:38] TASK: Identifying Pfam protein families.                   
[2023-03-22 04:38:49] INFO: Completed 733 genomes in 10.86 seconds (67.52 genomes/second).
[2023-03-22 04:38:49] INFO: Annotations done using HMMER 3.3.2 (Nov 2020).       
[2023-03-22 04:38:49] TASK: Summarising identified marker genes.
[2023-03-22 04:39:01] INFO: Completed 733 genomes in 11.82 seconds (62.01 genomes/second).
[2023-03-22 04:39:01] INFO: Done.
[2023-03-22 04:39:02] INFO: Aligning markers in 733 genomes with 16 CPUs.
[2023-03-22 04:39:02] INFO: Processing 731 genomes identified as bacterial.
[2023-03-22 04:39:09] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2023-03-22 04:39:09] TASK: Generating concatenated alignment for each marker.
[2023-03-22 04:39:11] INFO: Completed 731 genomes in 0.82 seconds (894.14 genomes/second).
[2023-03-22 04:39:12] TASK: Aligning 120 identified markers using hmmalign 3.3.2 (Nov 2020).
[2023-03-22 04:39:25] INFO: Completed 120 markers in 12.47 seconds (9.62 markers/second).
[2023-03-22 04:39:26] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2023-03-22 04:41:36] INFO: Completed 63,022 sequences in 2.17 minutes (29,105.35 sequences/minute).
[2023-03-22 04:41:36] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2023-03-22 04:41:36] INFO: 0 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2023-03-22 04:41:36] INFO: Creating concatenated alignment for 63,022 bacterial GTDB and user genomes.
[2023-03-22 04:41:55] INFO: Creating concatenated alignment for 731 bacterial user genomes.
[2023-03-22 04:41:55] INFO: Processing 2 genomes identified as archaeal.
[2023-03-22 04:41:56] INFO: Read concatenated alignment for 3,412 GTDB genomes.
[2023-03-22 04:41:56] TASK: Generating concatenated alignment for each marker.
[2023-03-22 04:41:59] INFO: Completed 2 genomes in 0.14 seconds (14.57 genomes/second).
[2023-03-22 04:41:59] TASK: Aligning 43 identified markers using hmmalign 3.3.2 (Nov 2020).
[2023-03-22 04:42:01] INFO: Completed 43 markers in 0.87 seconds (49.41 markers/second).
[2023-03-22 04:42:02] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2023-03-22 04:42:06] INFO: Completed 3,414 sequences in 4.11 seconds (830.05 sequences/second).
[2023-03-22 04:42:06] INFO: Masked archaeal alignment from 13,540 to 10,153 AAs.
[2023-03-22 04:42:06] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2023-03-22 04:42:06] INFO: Creating concatenated alignment for 3,414 archaeal GTDB and user genomes.
[2023-03-22 04:42:08] INFO: Creating concatenated alignment for 2 archaeal user genomes.
[2023-03-22 04:42:08] INFO: Done.
[2023-03-22 04:42:08] TASK: Placing 2 archaeal genomes into reference tree with pplacer using 16 CPUs (be patient).
[2023-03-22 04:42:08] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2023-03-22 04:45:20] INFO: Calculating RED values based on reference tree.                           
[2023-03-22 04:45:21] TASK: Traversing tree to determine classification method.
[2023-03-22 04:45:21] INFO: ANI classification has been skipped (--genes option used).
[2023-03-22 04:45:21] ERROR: Uncontrolled exit resulting from an unexpected error.

================================================================================
EXCEPTION: TypeError
  MESSAGE: stat: path should be string, bytes, os.PathLike or integer, not dict
________________________________________________________________________________

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/__main__.py", line 101, in main
    gt_parser.parse_options(args)
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/main.py", line 1172, in parse_options
    self.classify(options,all_classified_ani= all_classified_ani)
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/main.py", line 587, in classify
    reports = classify.run(genomes=genomes,
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/classify.py", line 661, in run
    class_level_classification, classified_user_genomes,warning_counter = self._parse_tree(tree_to_process, genomes, msa_dict, percent_multihit_dict,
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/classify.py", line 1210, in _parse_tree
    class_level_classification,warning_counter = self._classify_red_topology(tree, msa_dict, percent_multihit_dict,
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/classify.py", line 850, in _classify_red_topology
    user_genome_ids = set(read_fasta(user_msa_file).keys())
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/biolib_lite/seq_io.py", line 48, in read_fasta
    if not os.path.exists(fasta_file):
  File "/home/ubuntu/miniconda3/envs/gtdbtk/lib/python3.8/genericpath.py", line 19, in exists
    os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not dict
================================================================================

But this is also fixed when I replaced the mash.py and classify.py from staging branch 👍

@pchaumeil
Copy link
Collaborator

pchaumeil commented Apr 4, 2023

Solved with 2.2.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error Help required for a GTDB-Tk error. next version Upcoming feature/fix in staging branch.
Projects
None yet
Development

No branches or pull requests

3 participants