Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: --replace on a subset of fasta records #348

Closed
2 tasks done
sklages opened this issue Nov 18, 2022 · 3 comments
Closed
2 tasks done

Q: --replace on a subset of fasta records #348

sklages opened this issue Nov 18, 2022 · 3 comments

Comments

@sklages
Copy link

sklages commented Nov 18, 2022

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Great piece of software.. it's my personal swiss army knife for sequencing data ;-)

Is it possible to use replace on a given subset (fasta IDs) of a larger file?

E.g. I have a fasta file with 1000 contigs: I want to mask the first five bases of just ten (known by ID) contigs and leave the other contigs untouched.

seqkit replace \
  --by-seq \
  --pattern "^.{5}" \
  --replacement 'nnnnn' \
  input.fa > output_mod.fa

.. acts on all contigs in input.fa. So what I am actually missing is a kind of filter option like --file <str> which provides fasta record IDs on which the replace command should work. All other contigs should be printed unaltered.

I could achieve that with separate seqkit commands. But not in a (simple) pipe AFAICS .. and not with a single seqkit command.

Did I miss something? Any idea how to achieve this with seqkit only in a simple way?

@shenwei356
Copy link
Owner

I get it. It may be useful for others too.

I could achieve that with separate seqkit commands.

Yes, it could be achieved by:

seqkit grep -f ids.txt input.fa    -o to_edit.fa
seqkit grep -f ids.txt input.fa -v -o left_seqs.fa

seqkit replace -p xx -r xx  to_edit.fa -o edited.fa

cat edited.fa left_seqs.fa > result.fa

@shenwei356
Copy link
Owner

Added. Please have some tests (I've done some).

      --f-by-name                [target filter] match by full name instead of just ID
      --f-by-seq                 [target filter] search subseq on seq, both positive and negative strand are searched, and mismatch allowed using flag -m/--max-mismatch
      --f-ignore-case            [target filter] ignore case
      --f-invert-match           [target filter] invert the sense of matching, to select non-matching records
      --f-only-positive-strand   [target filter] only search on positive strand
      --f-pattern strings        [target filter] search pattern (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"'))
      --f-pattern-file string    [target filter] pattern file (one record per line)
      --f-use-regexp             [target filter] patterns are regular expression

@sklages
Copy link
Author

sklages commented Mar 19, 2023

@shenwei356 - some quick tests showed that it works fine. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants