Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement query language in command-line interface #74

Open
apetkau opened this issue Jul 27, 2021 · 2 comments
Open

Implement query language in command-line interface #74

apetkau opened this issue Jul 27, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@apetkau
Copy link
Owner

apetkau commented Jul 27, 2021

It would be nice to have a way of defining queries on samples using the command-line interface in addition to the Python API.

Option 1: use Unix pipes

Queries in my system operate on sets of sample IDs, which are encoded using pyroaring. These can be easily serialized/deserialized into bytes.

So, this gives me an idea of a command-line interface using Unix pipes and sample sets. For example:

gdi query hasa 'S:D614G' | gdi query hasa 'S:G142D' --summarize

This would select those samples with the D614G mutation, encode the sample IDs using pyroaring and pass to stdin of another gdi instance which would deserialize the sample sets and then select those with the G142D mutation. Finally, a summary table of the results would be printed to the user. This would be the equivalent in the Python API of:

db.samples_query().hasa('S:D614G').hasa('S:G142D').summary()

This could be combined with building trees/alignments. For example:

gdi query hasa 'S:D614G' | gdi build alignment > d614g.aln

This would build an alignment of all those genomes with a D614G mutation.

This still needs some thought on how to work out. For example, I need to encode both present/unknown sample sets and find some way of automatically detecting if data is being piped or printed to a terminal/file (where I may want to display human-readable results instead of encoded sets of sample identifiers).

@apetkau apetkau added the enhancement New feature or request label Jul 27, 2021
@apetkau
Copy link
Owner Author

apetkau commented Jul 27, 2021

AND/OR/NOT boolean operations could be specified with a command-line option:

gdi query --not hasa 'S:D614G' | gdi query --or hasa 'S:G142D' --summarize

This would be read as "find all samples that do NOT have a D614G mutation OR have a G142D mutation".

@apetkau
Copy link
Owner Author

apetkau commented Jul 27, 2021

Option 2: Decode queries from a single string

As an alternative, I could just specify a string query language which can be passed directly to a single instance of gdi:

gdi query 'not hasa:S:D614G or hasa:S:G142D'

The advantage here is that I don't have the overhead of creating multiple instances of gdi over and over again. Plus, this query language could be re-used for e.g., web searching.

A disadvantage is that it becomes difficult to encode complicated queries using strings like this. For example, for distance-based queries how would I specify units and type of distance query? With a command-line interface I can include them as options --distance-unit or --distance-type.

@apetkau apetkau changed the title Pipe sample sets via command-line interface Implement query language in command-line interface Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant