Implement query language in command-line interface #74

apetkau · 2021-07-27T21:16:38Z

It would be nice to have a way of defining queries on samples using the command-line interface in addition to the Python API.

Option 1: use Unix pipes

Queries in my system operate on sets of sample IDs, which are encoded using pyroaring. These can be easily serialized/deserialized into bytes.

So, this gives me an idea of a command-line interface using Unix pipes and sample sets. For example:

gdi query hasa 'S:D614G' | gdi query hasa 'S:G142D' --summarize

This would select those samples with the D614G mutation, encode the sample IDs using pyroaring and pass to stdin of another gdi instance which would deserialize the sample sets and then select those with the G142D mutation. Finally, a summary table of the results would be printed to the user. This would be the equivalent in the Python API of:

db.samples_query().hasa('S:D614G').hasa('S:G142D').summary()

This could be combined with building trees/alignments. For example:

gdi query hasa 'S:D614G' | gdi build alignment > d614g.aln

This would build an alignment of all those genomes with a D614G mutation.

This still needs some thought on how to work out. For example, I need to encode both present/unknown sample sets and find some way of automatically detecting if data is being piped or printed to a terminal/file (where I may want to display human-readable results instead of encoded sets of sample identifiers).

The text was updated successfully, but these errors were encountered:

apetkau · 2021-07-27T21:20:58Z

AND/OR/NOT boolean operations could be specified with a command-line option:

gdi query --not hasa 'S:D614G' | gdi query --or hasa 'S:G142D' --summarize

This would be read as "find all samples that do NOT have a D614G mutation OR have a G142D mutation".

apetkau · 2021-07-27T21:23:49Z

Option 2: Decode queries from a single string

As an alternative, I could just specify a string query language which can be passed directly to a single instance of gdi:

gdi query 'not hasa:S:D614G or hasa:S:G142D'

The advantage here is that I don't have the overhead of creating multiple instances of gdi over and over again. Plus, this query language could be re-used for e.g., web searching.

A disadvantage is that it becomes difficult to encode complicated queries using strings like this. For example, for distance-based queries how would I specify units and type of distance query? With a command-line interface I can include them as options --distance-unit or --distance-type.

apetkau added the enhancement New feature or request label Jul 27, 2021

apetkau changed the title ~~Pipe sample sets via command-line interface~~ Implement query language in command-line interface Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement query language in command-line interface #74

Implement query language in command-line interface #74

apetkau commented Jul 27, 2021 •

edited

Loading

apetkau commented Jul 27, 2021 •

edited

Loading

apetkau commented Jul 27, 2021 •

edited

Loading

Implement query language in command-line interface #74

Implement query language in command-line interface #74

Comments

apetkau commented Jul 27, 2021 • edited Loading

Option 1: use Unix pipes

apetkau commented Jul 27, 2021 • edited Loading

apetkau commented Jul 27, 2021 • edited Loading

Option 2: Decode queries from a single string

apetkau commented Jul 27, 2021 •

edited

Loading

apetkau commented Jul 27, 2021 •

edited

Loading

apetkau commented Jul 27, 2021 •

edited

Loading