Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iTOL compatibility #373

Closed
valentynbez opened this issue Apr 3, 2022 · 12 comments
Closed

iTOL compatibility #373

valentynbez opened this issue Apr 3, 2022 · 12 comments
Labels
error Help required for a GTDB-Tk error. next version Upcoming feature/fix in staging branch.

Comments

@valentynbez
Copy link
Contributor

Output trees from gtdb-tk are not compatible with iTOL, and iTOL is a very popular software for tree visualization. Standardization would help a lot.

@valentynbez valentynbez added the error Help required for a GTDB-Tk error. label Apr 3, 2022
pchaumeil pushed a commit that referenced this issue Apr 6, 2022
@pchaumeil
Copy link
Collaborator

Hello,
A new functionality 'remove_labels' will be available in the next version of GTDB-Tk. This will remove all internal labels from a newick tree , making it compatible with Itol

@pchaumeil pchaumeil added the next version Upcoming feature/fix in staging branch. label Apr 6, 2022
@valentynbez
Copy link
Contributor Author

I think bootstrap values would be also nice to have, maybe it would be possible to give a user control over the format.
iTOL accepts bootstrap variables as:
(A:0.1,(B:0.1,C:0.1)INT1:0.1[90])INT2:0.3[98])

  • A, B, C : leaf names
  • INT1, INT2 : internal node IDs
  • 0.1, 0.3 : branch lengths
  • 90,98 : bootstrap values

A regex could possibly be enough to change it.

@pchaumeil
Copy link
Collaborator

we have added a functionality to GTDB-Tk called 'convert_to_itol'. This command will convert a GTDB-Tk tree to an itol tree (with the format displayed above).
This option will be available in the next release of GTDB-Tk

@Biofarmer
Copy link

Hi, may I ask how to use 'convert_to_itol' to get itol tree, or the itol tree has been created in the output folder? Thanks

@pchaumeil
Copy link
Collaborator

Hi,
"convert_to_itol" command works as following:
gtdbtk convert_to_itol --input_tree INPUT_TREE --output_tree OUTPUT_TREE

where INPUT_TREE is the path to the tree in Newick format generated by GTDB-Tk and OUTPUT_TREE path to output the tree

@Biofarmer
Copy link

Hi, "convert_to_itol" command works as following: gtdbtk convert_to_itol --input_tree INPUT_TREE --output_tree OUTPUT_TREE

where INPUT_TREE is the path to the tree in Newick format generated by GTDB-Tk and OUTPUT_TREE path to output the tree

Hi, thanks for reply. May I ask which folder the Newick format tree generated by GTDB-Tk is located? There are 7 trees files (gtdbtk.bac120.classify.tree.1.tree,...,gtdbtk.bac120.classify.tree.7.tree) for bacteria and 1 tree file (gtdbtk.ar53.classify.tree) for archaea in 'classify' folder.
By the way, may I ask the difference between 'gtdbtk.bac120.user_msa.fasta.gz' and 'gtdbtk.bac120.msa.fasta.gz'? They are all protein sequence alignment, right?

Thanks

@pchaumeil
Copy link
Collaborator

There are 7 trees files (gtdbtk.bac120.classify.tree.1.tree,...,gtdbtk.bac120.classify.tree.7.tree) for bacteria and 1 tree file (gtdbtk.ar53.classify.tree) for archaea in 'classify' folder.
-you can pick whichever tree you want to convert. The classify command will produce 7 bacterial trees when using the split approach ( backbone + 6 class level tree) so you will need to pick your tree of interest out of those 7 . Those 7 tree are basically the GTDB reference tree split in smaller subtrees.
To have only one tree you can either run classify_wf with the --full_tree flag ( memory intensive) or the de_novo_wf command

By the way, may I ask the difference between 'gtdbtk.bac120.user_msa.fasta.gz' and 'gtdbtk.bac120.msa.fasta.gz'?
-'gtdbtk.bac120.user_msa.fasta.gz' is a MSA only with your genomes, 'gtdbtk.bac120.msa.fasta.gz' is a MSA with your genomes and the GTDB representatives ( larger file)

@Biofarmer
Copy link

There are 7 trees files (gtdbtk.bac120.classify.tree.1.tree,...,gtdbtk.bac120.classify.tree.7.tree) for bacteria and 1 tree file (gtdbtk.ar53.classify.tree) for archaea in 'classify' folder. -you can pick whichever tree you want to convert. The classify command will produce 7 bacterial trees when using the split approach ( backbone + 6 class level tree) so you will need to pick your tree of interest out of those 7 . Those 7 tree are basically the GTDB reference tree split in smaller subtrees. To have only one tree you can either run classify_wf with the --full_tree flag ( memory intensive) or the de_novo_wf command

By the way, may I ask the difference between 'gtdbtk.bac120.user_msa.fasta.gz' and 'gtdbtk.bac120.msa.fasta.gz'? -'gtdbtk.bac120.user_msa.fasta.gz' is a MSA only with your genomes, 'gtdbtk.bac120.msa.fasta.gz' is a MSA with your genomes and the GTDB representatives ( larger file)

thanks, if using ' --full_tree flag', bacteria and archaea will be also in one tree, or still each tree from them? Is gtdbtk.bac120.user_msa.fasta.gz protein sequence alignment?

@pchaumeil
Copy link
Collaborator

bacteria and archaea will always be on different trees ( their MSA, markers are completely different) and yes gtdbtk.bac120.user_msa.fasta.gz is a protein sequence alignement file

@Biofarmer
Copy link

There are 7 trees files (gtdbtk.bac120.classify.tree.1.tree,...,gtdbtk.bac120.classify.tree.7.tree) for bacteria and 1 tree file (gtdbtk.ar53.classify.tree) for archaea in 'classify' folder. -you can pick whichever tree you want to convert. The classify command will produce 7 bacterial trees when using the split approach ( backbone + 6 class level tree) so you will need to pick your tree of interest out of those 7 . Those 7 tree are basically the GTDB reference tree split in smaller subtrees. To have only one tree you can either run classify_wf with the --full_tree flag ( memory intensive) or the de_novo_wf command

By the way, may I ask the difference between 'gtdbtk.bac120.user_msa.fasta.gz' and 'gtdbtk.bac120.msa.fasta.gz'? -'gtdbtk.bac120.user_msa.fasta.gz' is a MSA only with your genomes, 'gtdbtk.bac120.msa.fasta.gz' is a MSA with your genomes and the GTDB representatives ( larger file)

Thanks. May I further ask if there are any differences between the 7 trees each other, and also the 7 trees comparing to the one tree if using --full_tree flag?

@pchaumeil
Copy link
Collaborator

To speed up the classification process, the full reference tree ( used by --full_tree) has been split in 6 smaller trees (class-level tree) , and each of them contains all representative genomes for a different set of class.
With the new split approach , each genome is now placed in a backbone tree ( tree containing only one genome per family) and, based on its placement and its RED value in this backbone tree, this genome will get a class rank classification.
Once the genomes get a class ( lets say class c__Bacteroidia) from the backbone tree , Tk will place them on a one of the 6 class level tree ( The class level tree containing c__Bacteroidia) to finalise the Taxonomy down to the species level

@Biofarmer
Copy link

To speed up the classification process, the full reference tree ( used by --full_tree) has been split in 6 smaller trees (class-level tree) , and each of them contains all representative genomes for a different set of class. With the new split approach , each genome is now placed in a backbone tree ( tree containing only one genome per family) and, based on its placement and its RED value in this backbone tree, this genome will get a class rank classification. Once the genomes get a class ( lets say class c__Bacteroidia) from the backbone tree , Tk will place them on a one of the 6 class level tree ( The class level tree containing c__Bacteroidia) to finalise the Taxonomy down to the species level

Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error Help required for a GTDB-Tk error. next version Upcoming feature/fix in staging branch.
Projects
None yet
Development

No branches or pull requests

3 participants