Skip to content

Commit

Permalink
[MISC] Remove field::alignment from SAM file.
Browse files Browse the repository at this point in the history
  • Loading branch information
smehringer committed Nov 1, 2022
1 parent 334b790 commit f19b5cb
Show file tree
Hide file tree
Showing 37 changed files with 329 additions and 1,290 deletions.
9 changes: 1 addition & 8 deletions doc/cookbook/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,14 +137,7 @@ To search for either 1 insertion or 1 deletion you can use the seqan3::search_cf

\snippet doc/tutorial/09_search/search_small_snippets.cpp error_search

# Reading the CIGAR information into an actual alignment

In SeqAn, the conversion from a CIGAR string to an alignment (two `seqan3::aligned_sequence`s) is done automatically for
you. You can access the resulting alignment via seqan3::sam_record::alignment:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp main

# Combining sequence and alignment files
# Reading the CIGAR information from a SAM file and constructing an alignment

This recipe can be used to:
1. Read in a FASTA file with the reference and a SAM file with the alignment
Expand Down
22 changes: 19 additions & 3 deletions doc/fragments/io_sam_file_input.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,14 @@ of the file name. Constructing from a stream can be used if you have a non-file
std::istringstream. It also comes in handy, if you cannot use file-extension based detection, but know that
your input file has a certain format.
<br><br>
The reference information is specific to the SAM format. The SAM format only stores a "semi-alignment" meaning that
it has the query sequence and the cigar string representing the gap information but not the reference information.
If you want to retrieve valid/full alignments, you need to pass the corresponding reference information:
Passing reference information, e.g.

- ref_ids: The name of the references, e.g. "chr1", "chr2", ...
- ref_sequences: The reference sequence information **in the same order as the ref_ids**.

comes in handy once you want to convert the CIGAR string, read from your file, into an actual alignment.
This will be covered in the section \ref transform_cigar "Transforming the CIGAR information into an actual alignment".

In most cases the template parameters are deduced automatically:

\include test/snippet/io/sam_file/sam_file_input_construction_from_filename.cpp
Expand Down Expand Up @@ -80,6 +81,21 @@ of seqan3::sam_file_input::mapq_type.

\note But beware: with structured bindings you do need to get the order of elements correctly!

#### Transforming the CIGAR information into an actual alignment
\anchor transform_cigar

In SeqAn, we represent an alignment as a tuple of two `seqan3::aligned_sequence`s.

The conversion from a CIGAR string to an alignment can be done with the function `seqan3::alignment_from_cigar`.
You need to pass the reference sequence with the position the read was aligned to and the read sequence. All
of it is already in the `record` when reading a SAM file:

\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.cpp

The code will print the following:

\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.err

#### Views on files

Since SeqAn files are ranges, you can also create views over files. A useful example is to filter the records
Expand Down
4 changes: 2 additions & 2 deletions doc/fragments/io_sam_file_output.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ opening from stream it would not have.

\include test/snippet/io/sam_file/record_based_writing.cpp

The easiest way to write to an alignment file is to use the `push_back()` member functions. These
The easiest way to write to a SAM/BAM file is to use the `push_back()` member functions. These
work similarly to how they work on a std::vector.
You may also use a tuple like interface or the `emplace_back()`
function but this is not recommended since one would have to keep track of the
Expand All @@ -57,7 +57,7 @@ writing to another, because you don't have to configure the output file to match

\include test/snippet/io/sam_file/sam_file_output_custom_fields.cpp

This will copy the \ref seqan3::field "seqan3::field::flag" and \ref seqan3::field "seqan3::field::ref_offset" value
This will copy the \ref seqan3::field "seqan3::field::flag" and \ref seqan3::field "seqan3::field::ref_offset" value
into the new output file.

\note Note that the other SAM columns in the output file will have a default value, so unless you specify to read
Expand Down
61 changes: 28 additions & 33 deletions doc/tutorial/10_sam_file/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,19 +58,19 @@ differences in most use cases.

To make things clearer, here is the table of SAM columns and the corresponding fields of a SAM file record:

| # | SAM Column ID | FIELD name | seqan3::field |
|:--:|:--------------|:-----------------------------------------------------------------------------------------------------|:-------------------------------------------------|
| 1 | QNAME | seqan3::sam_record::id | seqan3::field::id |
| 2 | FLAG | seqan3::sam_record::flag | seqan3::field::flag |
| 3 | RNAME | seqan3::sam_record::reference_id | seqan3::field::ref_id |
| 4 | POS | seqan3::sam_record::reference_position | seqan3::field::ref_offset |
| 5 | MAPQ | seqan3::sam_record::mapping_quality | seqan3::field::mapq |
| 6 | CIGAR | implicitly stored in seqan3::sam_record::alignment <br> explicitly stored in seqan3::sam_record::cigar_sequence | seqan3::field::alignment <br> seqan3::field::cigar |
| 7 | RNEXT | seqan3::sam_record::mate_reference_id | seqan3::field::mate |
| 8 | PNEXT | seqan3::sam_record::mate_position | seqan3::field::mate |
| 9 | TLEN | seqan3::sam_record::template_length | seqan3::field::mate |
| 10 | SEQ | seqan3::sam_record::sequence | seqan3::field::seq |
| 11 | QUAL | seqan3::sam_record::base_qualities | seqan3::field::qual |
| # | SAM Column ID | FIELD name | seqan3::field |
|:--:|:--------------|:---------------------------------------|:------------------------- |
| 1 | QNAME | seqan3::sam_record::id | seqan3::field::id |
| 2 | FLAG | seqan3::sam_record::flag | seqan3::field::flag |
| 3 | RNAME | seqan3::sam_record::reference_id | seqan3::field::ref_id |
| 4 | POS | seqan3::sam_record::reference_position | seqan3::field::ref_offset |
| 5 | MAPQ | seqan3::sam_record::mapping_quality | seqan3::field::mapq |
| 6 | CIGAR | seqan3::sam_record::cigar_sequence | seqan3::field::cigar |
| 7 | RNEXT | seqan3::sam_record::mate_reference_id | seqan3::field::mate |
| 8 | PNEXT | seqan3::sam_record::mate_position | seqan3::field::mate |
| 9 | TLEN | seqan3::sam_record::template_length | seqan3::field::mate |
| 10 | SEQ | seqan3::sam_record::sequence | seqan3::field::seq |
| 11 | QUAL | seqan3::sam_record::base_qualities | seqan3::field::qual |

SAM files provide following additional fields:
* seqan3::sam_record::sequence_position (seqan3::field::offset)
Expand Down Expand Up @@ -138,14 +138,12 @@ Average: 27.4

\endsolution

# Alignment representation in the SAM format

In SeqAn, we represent an alignment as a tuple of two `seqan3::aligned_sequence`s.
# Alignment representation in SAM/BAM files

The SAM format is the common output format of read mappers where you align short read sequences
to one or more large reference sequences.
In fact, the SAM format stores those alignment information only partially:
It **does not store the reference sequence** but only the read sequence and a *CIGAR* string
It **does not store the reference sequence** but only the query/read sequence and a *CIGAR* string
representing the alignment based on the read.

Take this SAM record as an example:
Expand Down Expand Up @@ -183,26 +181,17 @@ into a std::vector\<seqan3::cigar\>:

\snippet doc/tutorial/10_sam_file/sam_file_read_cigar.cpp main

## Reading the CIGAR information into an actual alignment

In SeqAn, the conversion from a CIGAR string to an alignment (two seqan3::aligned_sequence's) is done automatically for you.
You can access it by accessing seqan3::sam_record::alignment from the record:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp main
## Transforming the CIGAR information into an alignment

In the example above, you can only safely access the aligned read.

\attention The **unknown** aligned reference sequence at the first position in the alignment tuple cannot be accessed
(e.g. via the `operator[]`). It is represented by a dummy type that throws on access.
In SeqAn, we represent an alignment as a tuple of two `seqan3::aligned_sequence`s.

Although the SAM format does not handle reference sequence information,
you can provide these information to the seqan3::sam_file_input which automatically fills the alignment object.
You can pass reference ids and reference sequences as additional constructor parameters:
The conversion from a CIGAR string to an alignment can be done with the function `seqan3::alignment_from_cigar`.
You need to pass the reference sequence with the position the read was aligned to and the read sequence:

\snippet doc/tutorial/10_sam_file/sam_file_alignments_with_ref.cpp main
\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.cpp

The code will print the following:
\include doc/tutorial/10_sam_file/sam_file_sam_record.err
\include snippet/alignment/cigar_conversion/alignment_from_cigar_io.err

\assignment{Assignment 2: Combining sequence and alignment files}

Expand All @@ -218,14 +207,20 @@ Only use
* seqan3::sam_record::id,
* seqan3::sam_record::reference_id,
* seqan3::sam_record::mapping_quality, and
* seqan3::sam_record::alignment.
* seqan3::sam_record::cigar.

With that information do the following:
* Filter the alignment records and only take those with a mapping quality >= 30.
(Take a look at the tutorial \ref sequence_file_section_fun_with_ranges for a reminder about using views on files)
* For the resulting alignments, print which read was mapped against which reference id and
the number of `seqan3::gap`s in each sequence (aligned reference and read sequence).

\hint
Given your reference sequences, you need to know which reference sequence to use for the alignment.
For this purpose, you can access `record.reference_id()` which gives you the position of the respective reference
sequence.
\endhint

Your program should print the following:

```
Expand Down
30 changes: 0 additions & 30 deletions doc/tutorial/10_sam_file/sam_file_alignments_with_ref.cpp

This file was deleted.

5 changes: 0 additions & 5 deletions doc/tutorial/10_sam_file/sam_file_alignments_with_ref.err

This file was deleted.

25 changes: 0 additions & 25 deletions doc/tutorial/10_sam_file/sam_file_alignments_without_ref.cpp

This file was deleted.

5 changes: 0 additions & 5 deletions doc/tutorial/10_sam_file/sam_file_alignments_without_ref.err

This file was deleted.

10 changes: 8 additions & 2 deletions doc/tutorial/10_sam_file/sam_file_solution2.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ r003 2064 chr2 18 10 5M * 0 0 TAGGC *
#include <string>
#include <vector>

#include <seqan3/alignment/cigar_conversion/alignment_from_cigar.hpp>
#include <seqan3/alphabet/gap/gap.hpp>
#include <seqan3/alphabet/nucleotide/dna5.hpp>
#include <seqan3/core/debug_stream.hpp>
Expand Down Expand Up @@ -59,14 +60,19 @@ int main()

for (auto & record : mapping_file | mapq_filter)
{
auto alignment = seqan3::alignment_from_cigar(record.cigar_sequence(),
reference_sequences[record.reference_id().value()],
record.reference_position().value(),
record.sequence());

// as loop
size_t sum_reference{};
for (auto const & char_reference : std::get<0>(record.alignment()))
for (auto const & char_reference : std::get<0>(alignment))
if (char_reference == seqan3::gap{})
++sum_reference;

// or via std::ranges::count
size_t sum_read = std::ranges::count(std::get<1>(record.alignment()), seqan3::gap{});
size_t sum_read = std::ranges::count(std::get<1>(alignment), seqan3::gap{});

// The reference_id is ZERO based and an optional. -1 is represented by std::nullopt (= reference not known).
std::optional reference_id = record.reference_id();
Expand Down
22 changes: 7 additions & 15 deletions doc/tutorial/10_sam_file/sam_file_writing.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,27 @@ seqan3::test::create_temporary_snippet_file example_sam{"out.sam", ""};
//![main]
#include <seqan3/io/sam_file/all.hpp>

using aligned_sequence_type = std::vector<seqan3::gapped<seqan3::dna5>>;
using alignment_type = std::pair<aligned_sequence_type, aligned_sequence_type>;

using types = seqan3::type_list<std::vector<seqan3::dna5>, std::string, alignment_type>;
using fields = seqan3::fields<seqan3::field::seq, seqan3::field::id, seqan3::field::alignment>;

int main()
{
using namespace seqan3::literals;

auto filename = std::filesystem::current_path() / "out.sam";

seqan3::sam_file_output fout{filename};

using types = seqan3::type_list<std::vector<seqan3::dna5>, std::string, std::vector<seqan3::cigar>>;
using fields = seqan3::fields<seqan3::field::seq, seqan3::field::id, seqan3::field::cigar>;
using sam_record_type = seqan3::sam_record<types, fields>;

// write the following to the file
// r001 0 * 0 0 4M2I2M2D * 0 0 ACGTACGT *
sam_record_type record{};
record.id() = "r001";
record.sequence() = "ACGTACGT"_dna5;
auto & [reference_sequence, read_sequence] = record.alignment();

// ACGT--GTTT
seqan3::assign_unaligned(reference_sequence, "ACGTGTTT"_dna5);
seqan3::insert_gap(reference_sequence, reference_sequence.begin() + 4, 2);

// ACGTACGT--
seqan3::assign_unaligned(read_sequence, record.sequence());
seqan3::insert_gap(read_sequence, read_sequence.end(), 2);
record.cigar_sequence() = {{4, 'M'_cigar_operation},
{2, 'I'_cigar_operation},
{2, 'M'_cigar_operation},
{2, 'D'_cigar_operation}};

fout.push_back(record);
}
Expand Down
7 changes: 4 additions & 3 deletions doc/tutorial/11_read_mapper/read_mapper_step4.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# include <fstream>
# include <span>

# include <seqan3/alignment/cigar_conversion/cigar_from_alignment.hpp>
# include <seqan3/alignment/configuration/all.hpp>
# include <seqan3/alignment/pairwise/align_pairwise.hpp>
# include <seqan3/argument_parser/all.hpp>
Expand Down Expand Up @@ -53,7 +54,7 @@ void map_reads(std::filesystem::path const & query_path,
seqan3::field::id,
seqan3::field::ref_id,
seqan3::field::ref_offset,
seqan3::field::alignment,
seqan3::field::cigar,
seqan3::field::qual,
seqan3::field::mapq>{}};
//! [sam_file_output]
Expand All @@ -80,15 +81,15 @@ void map_reads(std::filesystem::path const & query_path,

for (auto && alignment : seqan3::align_pairwise(std::tie(text_view, query), align_config))
{
auto aligned_seq = alignment.alignment();
auto cigar = seqan3::cigar_from_alignment(alignment.alignment());
size_t ref_offset = alignment.sequence1_begin_position() + 2 + start;
size_t map_qual = 60u + alignment.score();

sam_out.emplace_back(query,
record.id(),
storage.ids[result.reference_id()],
ref_offset,
aligned_seq,
cigar,
record.base_qualities(),
map_qual);
}
Expand Down
4 changes: 1 addition & 3 deletions include/seqan3/alphabet/cigar/cigar.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,10 @@ class cigar : public alphabet_tuple_base<cigar, uint32_t, exposition_only::cigar
* Example usage:
* \include test/snippet/alphabet/cigar/cigar_operation.cpp
*
* \if DEV
* \note Usually you do not want to manipulate cigar elements and vectors on
* your own but convert an alignment to a cigar and back. See
* seqan3::detail::get_cigar_vector for how to convert two aligned sequences into
* seqan3::cigar_from_alignment for how to convert two aligned sequences into
* a cigar_vector.
* \endif
*
* \sa https://samtools.github.io/hts-specs/SAMv1.pdf#page=8
*
Expand Down
Loading

0 comments on commit f19b5cb

Please sign in to comment.