Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqkit sana fails on valid FASTQ #429

Closed
mw55309 opened this issue Dec 24, 2023 · 10 comments
Closed

Seqkit sana fails on valid FASTQ #429

mw55309 opened this issue Dec 24, 2023 · 10 comments
Labels

Comments

@mw55309
Copy link

mw55309 commented Dec 24, 2023

e.g. this is valid FASTQ from the SRA:

@SRR2175749.1/1
GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
+SRR2175749.1/1
BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
@SRR2175749.10/1
GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
+SRR2175749.10/1
BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB

output:

[INFO] File: test1.fastq        Discarded line: Invalid line states!    1:       @SRR2175749.1/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    2:       GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
[INFO] File: test1.fastq        Discarded line: Invalid line states!    3:       +SRR2175749.1/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    4:       BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
[INFO] File: test1.fastq        Discarded line: Invalid line states!    5:       @SRR2175749.10/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    6:       GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
[INFO] File: test1.fastq        Discarded line: Invalid line states!    7:       +SRR2175749.10/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    8:       BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB
[INFO] File: test1.fastq        Pass records: 0 Discarded lines: 8

However, remove cany characters after the + sign, and it validates fine:

@SRR2175749.1/1
GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
+
BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
@SRR2175749.10/1
GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
+
BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB

output:

[INFO] File: test1.fastq        Pass records: 2 Discarded lines: 0
@SRR2175749.1/1
GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
+
BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
@SRR2175749.10/1
GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
+
BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB

If you want the actual files, they are here:

https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmwgsqc/v2/SRS1041116.tar.bz2

@shenwei356
Copy link
Owner

This was also reported here: #408.

I just checked the code (written by @botond-sipos ) and found the method did not consider cases where the + line would be followed by some contents, which is a valid format. While character + would also appear in the quality line, it's hard to distinguish them just according to a single line. So I think it needs some big changes with the code, however, @botond-sipos is on his Christmas vacation.

I might rewrite it If I have time (not now, sorry).

@shenwei356 shenwei356 added the bug label Dec 24, 2023
@mw55309
Copy link
Author

mw55309 commented Dec 24, 2023

Thanks!

If you point me to the relevant code, I can have a quick look to see how it might be fixed. I see where the error is thrown, but where is the matching/checking code?

@mw55309
Copy link
Author

mw55309 commented Dec 24, 2023

Note also that #golang FASTQ parsers exist e.g. https://pkg.go.dev/github.com/TimothyStiles/[email protected]/io/fastq

@shenwei356
Copy link
Owner

The code is here: https://github.com/shenwei356/seqkit/blob/v2.6.1/seqkit/cmd/sana.go#L274-L279

Other subcommands in seqkit use my parser here: https://github.com/shenwei356/bio/blob/master/seqio/fastx/reader.go . It looks complicated because I want to seamlessly parse both FASTA and FASTQ with single- or multiple-line sequence/quality.

@botond-sipos
Copy link
Contributor

@mw55309 My apologies for the delayed response! Unfortunately, sana currently doesn't support FASTQ files with identifiers in the separator lines. I'll update the help message in a PR to clarify this. As @shenwei356 mentioned, it's difficult to reliably distinguish separator lines with identifiers from quality lines within sana's parsing system.

While other Golang FASTQ parsers exist, they typically stop at the first error. Sana's parser is designed to be more forgiving, skipping malformed lines and continuing to process the rest of the file.

I hope this explains sana's current limitations!

@mw55309
Copy link
Author

mw55309 commented Mar 4, 2024

I think the logic that makes sense to me is:

  1. If the line starts with + and it is 1 character long, it is a plus line
    (by the way this will fail for sequences of length 1 which have a quality of + but this is an extreme edge case)

  2. If the line starts with + and it is more than 1 character long, if the previous line was a sequence line, then it is the plus line, if the previous line was plus line, it is quality line, if the previous line is quality line, it is quality line

So can fix the problem by keeping a record of what the previous line category is?

@shenwei356
Copy link
Owner

Yes. We have to think about it. It's feasible.

@shenwei356 shenwei356 reopened this Mar 4, 2024
@botond-sipos
Copy link
Contributor

@mw55309 I started working on parts of the suggested logic. It is a quick hack, but I implemented the lookback to decide the state of the separator lines. You can find it here, I would appreciate it if you tested it and let me know if it meets the expectations. If all goes well I will submit a PR after implementing some tests.

What we definitely cannot support in the current parsing framework is multiline sequences and qualities. The parser architecture is complicated by the fact that is an "online" parser capable of streaming records from files, which are being written. This feature is used in the scat subcommand.
That said, a robust parser could be simpler and we discussed the possibility of a rewrite with @shenwei356 .

@shenwei356
Copy link
Owner

@shenwei356
Copy link
Owner

Fixed in v2.8.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants