-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME
61 lines (44 loc) · 3.02 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
This is the collection of codes used in analysis of european POPRES data
(see Nelson et al, http://www.ncbi.nlm.nih.gov/pubmed/18760391;
obtain access to data at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000145.v1.p1 .)
for the paper "The geography of recent genetic ancestry across Europe",
by Peter Ralph and Graham Coop,
published May 7, 2013, in PLoS Biology, at
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001555
To see what makes what, see the makefile.
Most of the files are pretty clean, but a few have some clutter.
None will actually do anything without obtaining the POPRES data and running BEAGLE http://faculty.washington.edu/browning/beagle/beagle.html on them.
Please use these if they are useful, and credit us if appropriate.
--------------
If you are looking to use the inference method, look for the functions sinv() and squoosh().
Also, linv() gives the least-squares approximation, useful for finding a good starting point.
For instance, see the block in inversion-writeup-plots.R that sets things up to infer histories for all the pairs of populations:
# Set-up: the operator L and the false positive rate estimates:
gens <- cumsum( rep(1:36, each=10) )
L <- error.disc.trans( lenbins=lenbins, gens=gens, chrlens=.chrlens[setdiff(seq_along(.chrlens),omitchroms)] )) # this takes a while!! do it once and save.
fp <- disc.fp(lenbins, chrlens=.chrlens)
fp <- runmed(fp,k=15)
# Set-up for intial least-squares estimate:
S <- as.vector( smooth( hist( blocks$maplen, breaks=c(lenbins,100), plot=FALSE )$counts ) )
S <- S/mean(S)
S[lenbins>20] <- exp( predict( lm( log(S) ~ lenbins + I(lenbins^2), subset=(lenbins>20) & (S>0) ), newdata=list(lenbins=lenbins[lenbins>20]) ) )
est.Sigma <- function ( np, X, alpha=100 ) {
# estimate Sigma by combining the population average with the smoothed, observed data
# where Sigma is the covariance matrix of X/np
smX <- as.vector( smooth(X) )
ppp <- (alpha*np/(2543640+alpha*np)) # 2543640 = total # pairs
Sigma <- ppp*smX + (1-ppp)*sum(smX)*S/sum(S)
return( Sigma/np^2 )
}
Ksvd <- svd( 1/sqrt(as.vector(S)) * L ) # pre-compute SVD of L
total.npairs <- choose( length(unique(blocks$id1)), 2 )
linverse <- linv( X=lendist/total.npairs, S=as.vector(S), Sigma=est.Sigma(np=total.npairs,X=lendist), Ksvd=Ksvd, maxk=30, fp=fp )
sminverse <- smoothed.nonneg( lS=linverse, Ksvd=Ksvd, bestk=15, lam=1 )
# Here is where we get the actual estimate:
ans <- sinv( lendist, sminverse, L, fp, npairs=total.npairs, lam=0, gamma=1e6 )
# squoosh iterates sinv, looking for the smoothest solution that drops no more than 2 units of log-likelihood
smooth = squoosh( lendist, xinit=ans, L, fp, npairs=ans$npairs, minlen=minlen, gamma=100, lam=0, fitfn="loglik", relthresh=2/ans$npairs, twid=.05 )
--------------
Notes:
* The .gmap files provide a mapping from the endpoint indices that BEAGLE uses to genetic distances (e.g. in centiMorgans). Important: BEAGLE uses 0-based indices.
* BEAGLE 4.0 is out, and improved over the version we used (3.3).