Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public mean coverage data #15

Open
adriafarres opened this issue Jun 12, 2023 · 4 comments
Open

Public mean coverage data #15

adriafarres opened this issue Jun 12, 2023 · 4 comments

Comments

@adriafarres
Copy link

Hello,

Is there any public dataset or website where one can download an already pre-computed mean_coverage.txt file? I have a very small dataset for which I'm trying to compute CNVs.

Thank you

@adriafarres adriafarres changed the title Public coverage data Public mean coverage data Jun 12, 2023
@tf2
Copy link
Owner

tf2 commented Jun 12, 2023

Sure - i dont think it would make sense to use one computed on a different datasets to you own - this is only used to exclude positions from CNV calling (based on variability of the position) - CNest is not really designed to operate on very small datasets, having said that it probably will work quite well. How many samples do you have?

There would be a way to skip this and just use all positions... or if you have a fair amount of samples will probably still work ok....

@adriafarres
Copy link
Author

adriafarres commented Jun 12, 2023

Thank you for your reply, Tomas.

As of right now I'm interested in finding CNVs for 3 genomes. It can't even be considered a dataset haha. One of these genomes was sequenced by Dante Labs. They provide a list of CNVs that are obtained with Dragen, Illumina's caller (the other ones didn't come with CNVs). After annotating those CNVs and filtering them by haploinsufficiency, I noticed that there's a a bunch of them in genes that are highly haploinsufficiency and in regions that are extremely conserved (according to gnomAD and the data from this study).

Furthermore, I annotated the others with CNVPytor (without mean coverage) and the CNVs are vastly distinc, which is highly suspicious considering those genomes belong to siblings. So at this point I don't know if the callers (or CNV callers in general) are very imprecise, if the lack of mean coverage really affects the output, or if Dragen's CNVs are actually correct. I was hoping I could get a second opinion on those CNVs by running CNest.

Maybe you can shed some light on this.

Thank you.

@tf2
Copy link
Owner

tf2 commented Jun 15, 2023

Im afraid to say that with only 3 genomes CNest is not really appropriate to use - it needs to estimate a base line and certian other noise characteristics - and with 3 that is just not going to be enough. Another complication is it seems these genomes are related right, siblings? This is going to be tricky because the way CNest works (and I believe most CNV callers) will use other samples in the set to create a baseline - if its only related samples in the set it is very likely that the real CNV events might be normalised out i.e. deletion seen it most of the samples will look like normal copy number (2 copies) etc.

Is there any way for you to obtain a set of e.g. 5 unrelated genomes from the same sequencing platform to help create a baseline?

@adriafarres
Copy link
Author

Thank you again for your reply. I will try to do that. Can the results vary substantially when using different sequencing machines? I apologize if these questions are very basic. I have never worked with CNVs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants