Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bacpop 17 #201

Merged
merged 7 commits into from
Apr 29, 2022
Merged

Bacpop 17 #201

merged 7 commits into from
Apr 29, 2022

Conversation

muppi1993
Copy link
Contributor

assign.py:

  • removed arguments web &json_sketch
  • assign_query() now creates the hd5 db and then calls assign_query_hd5(), which holds most of the code that was previously in assign_query()

web.py:

  • sketch_to_hd5() now accepts multiple json sketches in form of a dictionary
  • query names as supplied as dict keys are prefixed with a "query_" to distinguish them from reference samples
  • summarise_clusters() reflects this change by searching for multiple queries in the result df, clusters and cluster prevalences are returned in lists, and include.txt is created for each cluster separately (not sure if we actually need these)

@muppi1993
Copy link
Contributor Author

Note on sketch_to_hdf5(): Since the hd5 db only takes the attributes sketch_version, codon_phased and densified once at the top level group ("sketches"), I am not sure how we should handle cases, in which the different json_sketches might differ in these attributes. I guess this is most relevant for sketch_version, if the user combines new samples with some they uploaded a while ago (Not sure if codon_phased and densifiedwill ever be different?). Currently, they will be set to the values from the last json_sketch that is processed.
I am not even sure if this is relevant at all?

@muppi1993 muppi1993 requested review from richfitz and johnlees April 27, 2022 10:58
Copy link
Member

@johnlees johnlees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, one function to be renamed, and one question about the query_ prefix.

Maybe we should also get in the habit of increasing the version number with each PR, I haven't done this before but it was useful in dust and friends. I'd probably still just make manual releases on minor version increases.

In answer to your questions:

  • densified I don't expect this to be written to the database, is there an example in the code you can point me to where this is read/written?
  • codon_phased has to match between databases, and this should be enforced by the sketchlib code https://github.com/bacpop/pp-sketchlib/blob/master/src/database/database.hpp#L41.
  • sketch_version mismatch is ignored at the moment, but was implemented to be available as a check in future. At the moment I expect this to change too often to be useful, but at some point it will stabilise and I will probably implement a check as for codon_phased but which just emits a warning

PopPUNK/web.py Outdated
k_spec.attrs['kmer-size'] = kmers[k_index]

for top_key, top_value in sketches_dict.items():
qNames.append("query_" + top_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the rationale for this, the user wants to query without worrying about uniqueness of name versus the references, but:

  • Will this prefix be removed at some point in the process? It can be difficult to match up to names in if it's unexpectedly changed.
  • What about if query_x is already in the database? This could happen if this was run on a database which had already had queries added to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed that, and now summarise_clusters() does the subsetting based on qNames (without prefix) which will be supplied as an additional argument.

@muppi1993
Copy link
Contributor Author

  • densified I don't expect this to be written to the database, is there an example in the code you can point me to where this is read/written?

densified is written to the db here:

sketches.attrs['densified'] = value

Do we ever expect that this would be different for the sketches we generate with the wasm code? If we assume they will all have the same value here, it should not be a problem to just have this value once for the db containing multiple jsons generated at different times.

@johnlees
Copy link
Member

  • densified I don't expect this to be written to the database, is there an example in the code you can point me to where this is read/written?

densified is written to the db here:

sketches.attrs['densified'] = value

Ah I see. I think that shouldn't be written to the DB as it's not used anywhere else. Basically this happens when the sketch size is 'too big' for the genome length, but it is an adaptation of the algorithm not a warning/error. In the sketching code we write to the terminal when this occurs for a sample, but this feels like something you'd only ever want to know from the CLI when you're creating a database manually.

So I'd suggest simply removing that line and ignoring that field (and we can remove it from the web sketch too, if necessary).

Do we ever expect that this would be different for the sketches we generate with the wasm code? If we assume they will all have the same value here, it should not be a problem to just have this value once for the db containing multiple jsons generated at different times.

For any bacterial data this should always be false (and definitely always false for S. pneumoniae), but the sketchlib code is set up to handle differences so I think you can just leave this as-is.

@johnlees
Copy link
Member

@muppi1993 I'm happy for this to be merged if you bump the version btw

@muppi1993 muppi1993 requested a review from johnlees April 29, 2022 15:27
@johnlees johnlees merged commit f2f0fcf into master Apr 29, 2022
@johnlees johnlees deleted the bacpop-17 branch April 29, 2022 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants