You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice to have a way of defining queries on samples using the command-line interface in addition to the Python API.
Option 1: use Unix pipes
Queries in my system operate on sets of sample IDs, which are encoded using pyroaring. These can be easily serialized/deserialized into bytes.
So, this gives me an idea of a command-line interface using Unix pipes and sample sets. For example:
gdi query hasa 'S:D614G'| gdi query hasa 'S:G142D' --summarize
This would select those samples with the D614G mutation, encode the sample IDs using pyroaring and pass to stdin of another gdi instance which would deserialize the sample sets and then select those with the G142D mutation. Finally, a summary table of the results would be printed to the user. This would be the equivalent in the Python API of:
This could be combined with building trees/alignments. For example:
gdi query hasa 'S:D614G'| gdi build alignment > d614g.aln
This would build an alignment of all those genomes with a D614G mutation.
This still needs some thought on how to work out. For example, I need to encode both present/unknown sample sets and find some way of automatically detecting if data is being piped or printed to a terminal/file (where I may want to display human-readable results instead of encoded sets of sample identifiers).
The text was updated successfully, but these errors were encountered:
As an alternative, I could just specify a string query language which can be passed directly to a single instance of gdi:
gdi query 'not hasa:S:D614G or hasa:S:G142D'
The advantage here is that I don't have the overhead of creating multiple instances of gdi over and over again. Plus, this query language could be re-used for e.g., web searching.
A disadvantage is that it becomes difficult to encode complicated queries using strings like this. For example, for distance-based queries how would I specify units and type of distance query? With a command-line interface I can include them as options --distance-unit or --distance-type.
apetkau
changed the title
Pipe sample sets via command-line interface
Implement query language in command-line interface
Jul 27, 2021
It would be nice to have a way of defining queries on samples using the command-line interface in addition to the Python API.
Option 1: use Unix pipes
Queries in my system operate on sets of sample IDs, which are encoded using pyroaring. These can be easily serialized/deserialized into bytes.
So, this gives me an idea of a command-line interface using Unix pipes and sample sets. For example:
This would select those samples with the
D614G
mutation, encode the sample IDs using pyroaring and pass to stdin of anothergdi
instance which would deserialize the sample sets and then select those with theG142D
mutation. Finally, a summary table of the results would be printed to the user. This would be the equivalent in the Python API of:This could be combined with building trees/alignments. For example:
This would build an alignment of all those genomes with a
D614G
mutation.This still needs some thought on how to work out. For example, I need to encode both present/unknown sample sets and find some way of automatically detecting if data is being piped or printed to a terminal/file (where I may want to display human-readable results instead of encoded sets of sample identifiers).
The text was updated successfully, but these errors were encountered: