Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map reads to rodrep #63

Open
cnluzon opened this issue Sep 25, 2020 · 2 comments
Open

Map reads to rodrep #63

cnluzon opened this issue Sep 25, 2020 · 2 comments

Comments

@cnluzon
Copy link
Collaborator

cnluzon commented Sep 25, 2020

Sometimes we are interested in mapping reads to repetitive references to get an idea of repetitive element representation in the sample.

However this is not the same case as genomic reference since there is no generation of bigwig files and so on. It is just a step that would do a mapping and idxstats/flagstat of the resulting BAM file. These values would ideally be included in the mapping report (as global #reads mapped to given reference) and an extra report, resulting in a table of counts per sample and reference (as in idxstats file).

This can be done including some optional extra references index in the config.yaml and adding the necessary extra steps.

@marcelm
Copy link
Collaborator

marcelm commented Oct 8, 2020

So if I understand correctly, "rodrep" refers to the part of RepBase that covers repeats in rodents.

If we add such a step, we have to exclude it from automated testing because of the way RepBase is licensed, see this Bioinformatics Stack Exchange question for a discussion.

Due to the licensing problems, we should IMO invest a little bit of time into investigating whether it would be possible to use some alternative as discussed in the answers to the SE question. Perhaps Dfam as suggested there works.

@cnluzon
Copy link
Collaborator Author

cnluzon commented Oct 9, 2020

Due to the licensing problems, we should IMO invest a little bit of time inte investigating whether it would be possible to use some alternative as discussed in the answers to the SE answer. Perhaps Dfam as suggested there works.

I agree. I have had Dfam in my radar for a while because of this.

It can be kept in mind that the functionality of allowing this mapping step to a reference that is not necessarily a genome, where we are only interested in counts but not in bigwig files) can be conceived independently of where the data comes from.

It is true that we need some kind of dataset for testing, and if there is no useful open alternative to do this (in case Dfam wouldn't work) maybe there is no point in implementing it. But I guess we could also come up with some self-generated useful annotation.

There are other situations where this mapping option could be of use:

  • If set to another genome it allows to account for some degree of cross-contamination. We have had experimental settings before where this was meaningful because of how cells were grown, so we wanted to map to hg38 really, but account for mm9 mappings, for instance.
  • This may be redundant with the cutadapt step anyway, but: "decoy" like sequences like the ones present in Analysis sets. In this case it would not work as decoy but it would be some kind of QC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants