Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide the description of an HMM #76

Closed
oschwengers opened this issue Sep 24, 2024 · 13 comments · Fixed by #78
Closed

Provide the description of an HMM #76

oschwengers opened this issue Sep 24, 2024 · 13 comments · Fixed by #78
Labels
enhancement New feature or request

Comments

@oschwengers
Copy link

oschwengers commented Sep 24, 2024

Hi @althonos and many thanks for maintaining and pushing forward this wonderful project!

I just started to use PyHMMER in Bakta for user-provided HMMs. Here, I'd like to provide the users with an option to feed some extra information on each HMM into the pipeline via a custom format for the description field in the HMM file.

However, there seems no such field in the API of the TopHits object. For NAME, ACC and LENG there are TopHits.query_name, TopHits.query_accession and TopHits.query_length, respectively.

Hence, could you maybe add TopHits.query_description for the related DESC?

Thanks again!

PS: Just in case I overlooked something, could please point me to the right API object?

@jpjarnoux
Copy link

Hi!

I checked and query_length seems to already be in the object. https://pyhmmer.readthedocs.io/en/stable/api/plan7/results.html#pyhmmer.plan7.TopHits.query_length

top_hits = hmmsearch(**kwargs)
query_length = top_hits.query_length

Hope it helps you.

@oschwengers
Copy link
Author

Thanks @jpjarnoux for the prompt reply.
However, there was a typo (wrongly copy-pasted) in the text above. Instead of TopHits.query_length (which indeed already exists), I meant TopHits.query_description for the related DESC field of the HMM files.

I amended my initial query above.

@jpjarnoux
Copy link

jpjarnoux commented Sep 24, 2024

Ok, so a way to do what you want is:

top_hits = hmmsearch(**kwargs)
hit = next(top_hits)
query_desc = hit.query_desc

I have not tested this, but I think it works.

Obviously, an attribute could be really useful here.

@oschwengers
Copy link
Author

oschwengers commented Sep 29, 2024

@jpjarnoux I've just tested it. There is no such attribute hit.query_desc. Can you point me to an entry in the docs: https://pyhmmer.readthedocs.io/en/stable/api/index.html ?

@jpjarnoux
Copy link

Hi!
Sorry about that I made a confusion somehow.
If you want the description of the target https://pyhmmer.readthedocs.io/en/stable/api/plan7/results.html#pyhmmer.plan7.Hit.description :

top_hits = hmmsearch(**kwargs)
hit = next(top_hits)
target_desc = hit.description

This is not possible for the query. I swear I saw something, but I must have been confused with another thing.

@oschwengers
Copy link
Author

Thanks @jpjarnoux for the clarification.

Unfortunately, I also haven't seen anything close to the query description (the DESC line in the HMM files). However, for Bakta we need exactly this field to provide users with an option to feed-in additional information into the workflow via a customized format, just like what, for example, Prokka does (https://github.com/tseemann/prokka?tab=readme-ov-file#fasta-database-format, using the original hmmsearch).

@althonos I'd love to use Pyhmmer for this in Bakta. Hence, would it be possible to add such a query description field TopHits.query_description to the API? I haven't looked into the code, but I could image that it should not be that much of a deal, but a huge advantage for many users.

@althonos
Copy link
Owner

althonos commented Sep 30, 2024

Hi Oliver,

It depends how you are calling the pipeline, but by design the hmmsearch function yields the hits in the same order the HMMs are passed. This means that with the right use of zip and itertools.tee you can get an iterator on both the query HMM and the TopHits:

queries = HMMFile("...")
hmms1, hmms2 = itertools.tee(queries)
for hmm, hits in zip(hmms1, hmmsearch(hmms2, target_sequences)):
   print(hmm.description, len(hits)))

@oschwengers
Copy link
Author

oschwengers commented Oct 2, 2024

Thanks @althonos for the explanation and sorry for being so persistent on this, but wouldn't it be much easier for Pyhmmer users to simply provide the HMM description in the TopHits object?

For instance, in v0.10.5, the TopHits.query_length attribute was added as part of the API. Wouldn't it be possible (and much easier to use) to add a related TopHits.query_description attribute, as well?

Or is there maybe a technical reason why some fields (query_name, query_accession, query_length) are available in the TopHits object, but description is not? It would make life so much easier to use Pyhmmer. If this is not possible, I think I have to go back using the original hmmsearch - though I really don't want to and would love you use Pyhmmer.

@oschwengers
Copy link
Author

OK, I found an acceptable workaround:

with pyhmmer.plan7.HMMFile(user_hmms_path) as hmms_fh:
        hmms = list(hmms_fh)
        hmms_id_desc = { hmm.accession.decode(): hmm.description.decode() for hmm in hmms }
        for hmm_query_hits in pyhmmer.hmmsearch(hmms, proteins):

However, I still feel like having a direct access to the TopHits.query_description field would be a plus.

Please, feel free to just close this, or leave it upon just in case you consider adding the desc field. Anyway, thank you very much for this great Python wrapper (and many others) - very much appreciated!

@althonos
Copy link
Owner

althonos commented Oct 2, 2024

Or is there maybe a technical reason why some fields (query_name, query_accession, query_length) are available in the TopHits object, but description is not?

The main problem is that for every field I am reserving more space in each TopHits object which I was reluctant to do because of memory usage. But I think it may actually better to add a reference to the original query rather than manually handling all the fields. I'll experiment a bit to see if this can be done properly, so that you could do e.g. TopHits.query.description.

@althonos althonos added the enhancement New feature or request label Oct 2, 2024
@oschwengers
Copy link
Author

That would be awesome, thanks a lot!

@oschwengers
Copy link
Author

Awesome! Thank you so much!

@althonos
Copy link
Owner

althonos commented Oct 8, 2024

I'm making a release now 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants