Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: ska map to only report positions of variable split-kmers #65

Closed
rderelle opened this issue Feb 12, 2024 · 7 comments · Fixed by #66
Closed

Suggestion: ska map to only report positions of variable split-kmers #65

rderelle opened this issue Feb 12, 2024 · 7 comments · Fixed by #66
Assignees
Labels
enhancement New feature or request

Comments

@rderelle
Copy link
Collaborator

Not issue, a suggestion:

ska map generates a VCF or alignment file corresponding to all positions of the reference genome.

Would it be interesting to have an option to restrict the mapping only to variable split-kmers? This could generate smaller output files and would avoid to manually filter the output if only variable sites (SNPs) are of interest.

@johnlees johnlees self-assigned this Feb 12, 2024
@johnlees johnlees added the enhancement New feature or request label Feb 12, 2024
@johnlees johnlees changed the title ska map to only report positions of variable split-kmers? Suggestion: ska map to only report positions of variable split-kmers Feb 12, 2024
@johnlees
Copy link
Member

The VCF should be variant only, I think (let me know if not right, as this is what it says in the docs).

I could do this for .aln files too, but there is also snp-sites which offers the same functionality

@rderelle
Copy link
Collaborator Author

From what I remember, the VCF output all positions (I would need to confirm this).

But I'm possibly now encountering a bug. The ska map function generates an alignment without any issue, but SKA2 crashes when '-f vcf' is specified using the same skf file and same reference genome:

./ska_0.3.5 map -v -f vcf Spn_ATCC_700669.fna D39V__out0.skf
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-02-12T10:51:25.285Z INFO [ska] Loading skf as dictionary
2024-02-12T10:51:25.285Z INFO [ska::io_utils] Single file as input, trying to load as skf 64-bits
2024-02-12T10:51:26.612Z INFO [ska] Making skf of reference k=31 rc=true
2024-02-12T10:51:26.660Z INFO [ska::generic_modes] Converting skf to dictionary
2024-02-12T10:51:26.854Z INFO [ska::generic_modes] Mapping
2024-02-12T10:51:27.276Z INFO [ska::generic_modes] Generating VCF with 1 threads
2024-02-12T10:51:27.404Z INFO [ska::ska_ref] Converting to VCF
thread 'main' panicked at src/ska_ref.rs:451:32:
Could not add contig to header: Invalid
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

@johnlees
Copy link
Member

Looks like it could be an issue with noodles_vcf. Which version of that and noodles is your ska built with, do you know?

Also, what do you get if you run grep ">" Spn_ATCC_700669.fna

@rderelle
Copy link
Collaborator Author

grep ">" Spn_ATCC_700669.fna
NC_011900.1 Streptococcus pneumoniae ATCC 700669, complete sequence

I'm afraid I do not know about noodles.

@johnlees
Copy link
Member

Can you attach the .fna file here so I can take a look

For the versions, look in Cargo.lock for noodles-core and noodles-vcf
e.g.

[[package]]
name = "noodles-vcf"
version = "0.22.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "631a8103ca24338a93d0ed4b14c03a3d11d0290c2c643f7dc28327603ca3eb40"
dependencies = [
 "indexmap",
 "memchr",
 "nom",
 "noodles-bgzf",
 "noodles-core",
 "noodles-csi",
 "noodles-tabix",
 "percent-encoding",
]

@rderelle
Copy link
Collaborator Author

I can't check this as I only keep the binaries after compiling.
I put the skf file and reference genomes are on my google drive: https://drive.google.com/drive/folders/1OWEQtq0UdnzZMPEIi-X0O_wxqEPn1mOo?usp=drive_link

@johnlees
Copy link
Member

I don't think it likes the gaps in the chromsome name in the header, if I change the title in the ref fasta to:

>NC_011900.1

It works (and outputs only variant sites)

I will add a fix which deals with these formats of titles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants