Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to interpret bedGraph files after conversion from bigWig #4

Open
sean-la opened this issue Jun 7, 2017 · 3 comments
Open

How to interpret bedGraph files after conversion from bigWig #4

sean-la opened this issue Jun 7, 2017 · 3 comments

Comments

@sean-la
Copy link
Collaborator

sean-la commented Jun 7, 2017

Hi all,

I've converted the mapability bigWig files to bedGraph format to prepare them for liftover to h38 and I'm doing some manual sanity checks to make sure I converted them properly. They don't look right, however. For example, this is an excerpt of the 50mer mapability data for chromosome 17 of the human genome.

chr17   0       145     1
chr17   145     153     0.5
chr17   153     200     1
chr17   200     208     0.5
chr17   208     881     1
chr17   881     882     0.166667
chr17   882     883     0.333333
chr17   883     884     0.5
chr17   884     886     0.25
chr17   886     887     0.1
chr17   887     890     0.333333
chr17   890     893     0.25
chr17   893     894     0.125
chr17   894     895     0.166667
chr17   895     896     0.125
chr17   896     898     0.166667
chr17   898     900     0.1

If I understand the correctly, the start position and stop positions should have a difference of 50 bp since this from the 50mer mapability file. However, as you can see, many of the start and stop positions have differences that are not equal to 50. Am I interpretting this correctly?

@fritzsedlazeck
Copy link
Collaborator

Hi,
50mer mappability means that the kmer is 50bp. So these scores are with respect to the uniqueness of 50bp sequences.
Cheers
Fritz

@sean-la
Copy link
Collaborator Author

sean-la commented Jun 7, 2017

Could you explain what you mean by "these scores are with respect to the uniqueness of the 50bp sequences"?

@fritzsedlazeck
Copy link
Collaborator

Sure sorry for the short reply.
So mappability is basically coming from the mapping world, where it represents a measurement of how uniquely a read can be placed. For simplicity think about the distance of the best scoring and the second best scoring alignment of a given read in comparison to the reference genome. If that distance is small the mappability will be small, which means its not clear from which of these regions the read is coming from. Now it is more likely to find a 50bp or smaller stretch of sequence identical compared to a e.g. 100bp stretch. From a naive point: 4^50 << 4^100 which is the probability to see the exact sequence (given some naive assumptions)

To make this more comparable and identify hard to map to regions, people computed such mappability tracks. Now that must always be with respect of some read or sequence length. In this case the 50mer mappability means it is computed with respect to a sequence length of 50bp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants