Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming adaptor sequences counts #83

Open
cnluzon opened this issue Nov 1, 2020 · 3 comments
Open

Trimming adaptor sequences counts #83

cnluzon opened this issue Nov 1, 2020 · 3 comments

Comments

@cnluzon
Copy link
Collaborator

cnluzon commented Nov 1, 2020

I have a question about the way cutadapt removes adaptor sequences. I see this table in the results of one recent run:

Overview of removed sequences
length  count   expect  max.err error counts
3       1380352 1004680.1       0       1380352
4       510013  251170.0        0       510013
5       188447  62792.5 0       188447
6       129181  15698.1 0       57978 71203
7       351126  3924.5  1       25914 325212

My question is whether length means length of the sequence trimmed, and if so, isn't that a very short match for a 34bp search?

@marcelm
Copy link
Collaborator

marcelm commented Nov 1, 2020

If the adapter is specified as a “regular 3’ adapter” (using option -A), then partial matches are allowed. Since the sequence that is searched for is TTTTTCTTTTCTTTTTTCTTTTCCTTCCTTCTAA, we get an adapter match if a read ends with TTT, TTTT, TTTTT or TTTTTC, TTTTTCT etc. This is meant for finding adapters ligated to the 3’ end of a variable-length molecule because then then the read isn’t guaranteed to cover the entire adapter. If I remember correctly, the nature of that contaminant sequence was a bit unclear, so it’s possible this should be done somehow differently.

Considering that we use --discard-trimmed, which means that we throw away any read that has a match, even a short partial one, we should probably increase the minimum overlap length (option -O). Depending on what you want to achieve, it can be set to the length of the sequence to avoid partial matches entirely.

@marcelm
Copy link
Collaborator

marcelm commented Dec 2, 2021

@cnluzon Do you think we should increase the minimum overlap length as I suggested or should this issue just be closed?

@cnluzon
Copy link
Collaborator Author

cnluzon commented Oct 3, 2023

Somehow I missed this issue for a very long time, so far it seems to have worked just fine as it is, but I wonder if @simonelsasser has some comments on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants