-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bacpop 17 #201
Conversation
Note on sketch_to_hdf5(): Since the hd5 db only takes the attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, one function to be renamed, and one question about the query_
prefix.
Maybe we should also get in the habit of increasing the version number with each PR, I haven't done this before but it was useful in dust and friends. I'd probably still just make manual releases on minor version increases.
In answer to your questions:
densified
I don't expect this to be written to the database, is there an example in the code you can point me to where this is read/written?codon_phased
has to match between databases, and this should be enforced by the sketchlib code https://github.com/bacpop/pp-sketchlib/blob/master/src/database/database.hpp#L41.sketch_version
mismatch is ignored at the moment, but was implemented to be available as a check in future. At the moment I expect this to change too often to be useful, but at some point it will stabilise and I will probably implement a check as for codon_phased but which just emits a warning
PopPUNK/web.py
Outdated
k_spec.attrs['kmer-size'] = kmers[k_index] | ||
|
||
for top_key, top_value in sketches_dict.items(): | ||
qNames.append("query_" + top_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the rationale for this, the user wants to query without worrying about uniqueness of name versus the references, but:
- Will this prefix be removed at some point in the process? It can be difficult to match up to names in if it's unexpectedly changed.
- What about if
query_x
is already in the database? This could happen if this was run on a database which had already had queries added to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed that, and now summarise_clusters() does the subsetting based on qNames (without prefix) which will be supplied as an additional argument.
Co-authored-by: John Lees <[email protected]>
Co-authored-by: John Lees <[email protected]>
Line 194 in 46aff5d
Do we ever expect that this would be different for the sketches we generate with the wasm code? If we assume they will all have the same value here, it should not be a problem to just have this value once for the db containing multiple jsons generated at different times. |
Ah I see. I think that shouldn't be written to the DB as it's not used anywhere else. Basically this happens when the sketch size is 'too big' for the genome length, but it is an adaptation of the algorithm not a warning/error. In the sketching code we write to the terminal when this occurs for a sample, but this feels like something you'd only ever want to know from the CLI when you're creating a database manually. So I'd suggest simply removing that line and ignoring that field (and we can remove it from the web sketch too, if necessary).
For any bacterial data this should always be false (and definitely always false for S. pneumoniae), but the sketchlib code is set up to handle differences so I think you can just leave this as-is. |
@muppi1993 I'm happy for this to be merged if you bump the version btw |
assign.py
:web
&json_sketch
web.py
: