Q: --replace on a subset of fasta records #348

sklages · 2022-11-18T11:35:24Z

Prerequisites

make sure you're are using the latest version by seqkit version
read the usage

Great piece of software.. it's my personal swiss army knife for sequencing data ;-)

Is it possible to use replace on a given subset (fasta IDs) of a larger file?

E.g. I have a fasta file with 1000 contigs: I want to mask the first five bases of just ten (known by ID) contigs and leave the other contigs untouched.

seqkit replace \
  --by-seq \
  --pattern "^.{5}" \
  --replacement 'nnnnn' \
  input.fa > output_mod.fa

.. acts on all contigs in input.fa. So what I am actually missing is a kind of filter option like --file <str> which provides fasta record IDs on which the replace command should work. All other contigs should be printed unaltered.

I could achieve that with separate seqkit commands. But not in a (simple) pipe AFAICS .. and not with a single seqkit command.

Did I miss something? Any idea how to achieve this with seqkit only in a simple way?

The text was updated successfully, but these errors were encountered:

shenwei356 · 2022-11-18T12:49:48Z

I get it. It may be useful for others too.

I could achieve that with separate seqkit commands.

Yes, it could be achieved by:

seqkit grep -f ids.txt input.fa    -o to_edit.fa
seqkit grep -f ids.txt input.fa -v -o left_seqs.fa

seqkit replace -p xx -r xx  to_edit.fa -o edited.fa

cat edited.fa left_seqs.fa > result.fa

…artly records to edit. #348

shenwei356 · 2023-03-14T14:15:16Z

Added. Please have some tests (I've done some).

      --f-by-name                [target filter] match by full name instead of just ID
      --f-by-seq                 [target filter] search subseq on seq, both positive and negative strand are searched, and mismatch allowed using flag -m/--max-mismatch
      --f-ignore-case            [target filter] ignore case
      --f-invert-match           [target filter] invert the sense of matching, to select non-matching records
      --f-only-positive-strand   [target filter] only search on positive strand
      --f-pattern strings        [target filter] search pattern (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"'))
      --f-pattern-file string    [target filter] pattern file (one record per line)
      --f-use-regexp             [target filter] patterns are regular expression

sklages · 2023-03-19T17:15:08Z

@shenwei356 - some quick tests showed that it works fine. Thank you!

shenwei356 added the new feature label Nov 18, 2022

shenwei356 added a commit that referenced this issue Mar 14, 2023

replace: add some flags similar to those in "seqkit grep" to choose p…

ff417fa

…artly records to edit. #348

shenwei356 closed this as completed Mar 15, 2023

shenwei356 mentioned this issue Mar 17, 2023

Update SeqKit to v2.4.0 bioconda/bioconda-recipes#39957

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q: --replace on a subset of fasta records #348

Q: --replace on a subset of fasta records #348

sklages commented Nov 18, 2022 •

edited

Loading

shenwei356 commented Nov 18, 2022

shenwei356 commented Mar 14, 2023

sklages commented Mar 19, 2023

Q: --replace on a subset of fasta records #348

Q: --replace on a subset of fasta records #348

Comments

sklages commented Nov 18, 2022 • edited Loading

Prerequisites

shenwei356 commented Nov 18, 2022

shenwei356 commented Mar 14, 2023

sklages commented Mar 19, 2023

sklages commented Nov 18, 2022 •

edited

Loading