analysis.Rmd

---
title: "Unclear Evidence for a Limit to Human Lifespan"
author: "Philipp Berens and Tom Wallis"
date: "October 14, 2016"
output: github_document
# output:
#   tufte::tufte_handout: default
#   # tufte::tufte_html: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(ggplot2)
library(dplyr)
library(rstan)
library(rstanarm)
library(BayesFactor)
library(XML)
library(stringr) 
```

Dong et al. claim to present statistical evidence in favor of an absolute limit to the human lifespan. Here we present a reanalysis of central figures in their paper showing that (1) the main data of their paper does not provide evidence one way or another and (2) data analyzed in the Extended Data may be incomplete but completing it showed some evidence for their claims. Nevertheless, modeling maxima of distributions by assuming Gaussian noise makes inappropriate assumptions. A new analysis using extreme value theory is required to determine whether the data provide evidence for the authors' claims. 

# The model by the authors (Figure 2a)

The authors graph the maximum age reported at death (MRAD) for each year between 1968 and 2006. We acquired the data using WebPlotDigitizer and rounded the numbers to full years (which is what likely was the case for the original data). Originally the data came from the [IDL Database](http://www.supercentenarians.org/). 

Here is the raw data, as presented by the authors, fitting separate regression for years up to 1994 and after 1995.  


```{r, echo=FALSE}
tbl <- read.csv('lifeexpectancy.csv')
tbl <- round(tbl)
tbl$Group <- factor(tbl$Year>=1995, levels = c("FALSE", "TRUE"), labels = c("<1995", ">=1995"))

```

The plot shows the raw data points in black and separate linear fits with 95%-CIs for years before and after 1995. It is not clear from the paper why the authors chose 1995 as a point to separate models. In a [response](https://publons.com/review/480517/) to a critical review on publons, they provide an analysis with a segmented regression package, choosing 1998 as break point.

We can also obtain the statistics for this model by fitting a linear model with the additional group-variable as predictor including interactions, allowing for a changes slope and offset for the data after 1995.  

```{r author_model, fig.width=4.5, fig.height=3.2}

mdl1 <- lm(Age~Year*Group,tbl)
summary.lm(mdl1)

tbl2 <- tbl
tbl2$yhat <- predict(mdl1)

plt <- ggplot(tbl2, aes(x = Year, y = Age, colour = Group)) +
  geom_point() + 
  geom_line(aes(y = yhat)) + 
  ylab('Yearly maximum reported age at death (years)')

plt
```

Consistent with the paper, the fitted model has a slope of `r signif(mdl1$coefficients['Year'],digits = 3)` years for years before 1995 and one of `r signif(mdl1$coefficients['Year']+mdl1$coefficients['Year:Group>=1995'],digits = 3)` for years afterwards (compare their Figure 2a).
We also plot the regression line from this combined model to show that it is the same as the authors' separate regression fits.

# A linear model

A simple alternative hypothesis to the claim of the authors would be that MRAD actually keeps increasing and therefore, that there is no limit to human lifespan. We are agnostic as to whether or not this model makes plausible predictions (apparently people have argued both ways), but it does represent a simple comparison model:

```{r, linear_model, echo=FALSE, fig.width=3.5, fig.height=3.2}

ggplot(tbl,aes(x=Year,y=Age)) + geom_point() + 
  geom_smooth(method=lm) + 
  ylab('Yearly maximum reported age at death (years)')  

```

The plots shows the raw data points again, with a linear regression with 95% CIs fitted to all the data. 

```{r}

mdl2 <- lm(Age~Year,tbl)
summary.lm(mdl2)

```

In this case, MRAD increases slightly by `r signif(mdl2$coefficients['Year'],digits = 2)` years per year. 

# Showing the models side-by-side

```{r combined_plot, fig.width=6, fig.height=3.2}

# prepare data:
tmp1 <- tbl
tmp2 <- tbl
tmp1$model <- "Trend break"
tmp1$yhat <- predict(mdl1)
tmp2$model <- "Linear"
tmp2$yhat <- predict(mdl2)

combined_data <- rbind(tmp1, tmp2)

plt <- ggplot(combined_data, aes(x = Year, y = Age, colour = Group)) + 
  facet_grid(~ model) + 
  geom_point() +
  geom_line(aes(y = yhat))

# appearance:
plt <- plt + 
  scale_y_continuous(name = "Yearly maximum reported age at death (years)",
                     breaks = seq(110, 126, by = 2)) + 
  ylab("Yearly maximum reported age at death (years)") +
  theme_minimal(base_size = 8) + 
  theme(panel.grid.major = element_line(colour = "grey90", size = 0.5)) + 
  theme(panel.margin = unit(2, "lines")) + 
  scale_color_manual(values = c("#4A87CB", "#E76826"), name = "") + 
  theme(legend.position = "bottom")

plt

#ggsave("combined_data_plot.pdf", width = 3.5, height = 3.2)
#ggsave("combined_data_plot.eps", width = 3.5, height = 3.2)
```


# Model comparison

Which model is better? In the paper, the authors fail to provide evidence for their model, they seem to argue that the data looks like there is a saturation effect or a decline in MRAD after 1995.

One can do better and objectively compare the two fitted models. If we look at the output of the models above, the model by the authors explains a little more variance in the data than the linear model (0.42 vs. 0.29). On the other hand, the model also uses four parameters to do so, compared to only two in the linear model.

We can therefore ask if the increase in explained variance is "worth" the additional parameters.
A number of model comparison metrics exist; in general these weigh the tradeoff between model fit and complexity differently. 
We present the results of several classical model comparison metrics below.

## Classical ANOVA

First, we consider the two models as a nested set and compare them using classical ANOVA.

```{r}
anova(mdl2, mdl1)
```

The two extra degrees of freedom in the Dong et al model does not lead to a statistically-significant improvement in residual error over the simple linear model (following the traditional $p < .05$ cutoff for significance).

## AIC

Another comparison metric with appealing theoretical links to information theory (see Burnham & Anderson, 2002) is the so-called Akaike Information Criterion (AIC). 
In the AIC, smaller values denote better models.

```{r}

AIC(mdl1)
AIC(mdl2)

```

Here, the model of Dong et al has the lower AIC and is thus the preferred model. 
However, following the heuristics suggested by Burnham & Anderson (2002, p.70), an AIC difference of `r signif(AIC(mdl1) - AIC(mdl2), digits=2)` indicates that the data provide "substantial" support for the simpler linear model.

## BIC

A related model comparison metric is the Bayesian Information Criterion (BIC), which is more conservative than AIC because it additionally penalises models with more parameters.

```{r}

BIC(mdl1)
BIC(mdl2)

```

Following Raftery (1995), a BIC difference of `r signif(BIC(mdl1) - BIC(mdl2), digits=2)` is not worth mentioning, providing no evidence of one versus the other model. 

## Bayes Factors

Finally, Bayes Factors (the ratio of the posterior model evidence), can be easily computed to compare simple linear models using the BayesFactor package (Morey & Rouder, 2015).

```{r}
bf1 <- lmBF(Age~Year*Group, tbl)
bf2 <- lmBF(Age~Year, tbl)

bf1 / bf2
```

Under these default priors (assuming for example that both models are equally likely *a priori*), the models receive approximately equal support from the data (the Dong et al model is favoured with an odds ratio of 1.24-to-1).

```{r test_bf_effects, echo=FALSE, results="hide"}
bf1_wide <- lmBF(Age~Year*Group, tbl, rscaleFixed = "ultrawide", rscaleCont = "ultrawide")
bf2_wide <- lmBF(Age~Year, tbl, rscaleFixed = "ultrawide", rscaleCont = "ultrawide")

bf1_wide / bf2_wide
```

Changing the priors of BayesFactor from "medium" to "ultrawide" on the standardised effect size scale did not appreciably affect these conclusions; with ultrawide effect-size priors the simpler model is now preferred with odds of 1.09-to-1.

Note that in the above mentioned [response](https://publons.com/review/480517/) the authors argue that the segmented regression model provides a better AIC than the linear model. We will add the segmented model here later.

## Bayesian estimation of model parameters

Here we take a Bayesian approach to model estimation, and fit the full linear model including interaction terms.
We employ a Student-t prior with a mean of zero, standard deviation of 2.5 and five degrees of freedom, which yields modest shrinkage of the coefficients towards zero, i.e. enforcing some conservatism in inference.  

We fit the models using the package `rstanarm`, which allows relatively straightforward use of Bayesian methods.  

```{r fit_stan_model}

bmdl <- stan_glm(Age~Year*Group, tbl, 
                 prior = student_t(5, 0, 2.5), 
                 prior_intercept = student_t(5, 0, 50),
                 family = gaussian(), 
                 adapt_delta = 0.99)

```

We can summarize the fitted model and plot the posterior density over the parameters:

```{r, fig.width=5, fig.height=5}

plot(bmdl,'dens')

bmdl
```

Comparing the fitted model to the frequentist models above shows that the posterior median of the linear effect of Year (`r signif(bmdl$coefficients['Year'],digits=3)`) is similar to the estimated value above (`r signif(mdl1$coefficients['Year'],digits=3)`), but shrunken towards zero by the prior. 
The posterior density on the interaction term is centered around zero during inference, arguing that there is little evidence of a different slope after 1995. 
There is a small effect of the interaction term on the y-intercept, increasing the estimated y-intercept by `r signif(mdl1$coefficients['Group>=1995'],digits=1)`. This is likely an artefact of the model parametrization.


```{r plot_bayes_preds, fig.width=3.5, fig.height=3.2}
draws <- as.data.frame(as.matrix(bmdl))
X <- draws[1:200,1:4]
foomdl <- mdl1
tbl2 <- tbl["Year"]

base <-ggplot(tbl, aes(x = Year, y = Age))
for (i in 1:200){
  foomdl$coefficients <- c(X[i,1], X[i,2], X[i,3], X[i,4]) 
  tbl2["Pred"] <- predict.lm(foomdl,tbl)
  
  base <- base + geom_line(data=tbl2,mapping = aes(x=Year, y=Pred), color="skyblue", alpha=0.5,size=1.1)
}

X = coef(bmdl)
foomdl$coefficients <- c(X[1], X[2], X[3], X[4]) 
tbl2["Pred"] <- predict.lm(foomdl,tbl)


base + 
  geom_point() + 
  geom_line(data=tbl2,mapping = aes(x=Year, y=Pred),size=1.1)

```


# Extended data figure

The authors present a similar dataset from an independent source, the Gerontological Research Group, in Extended Data Figure 6.
They find similar results to those reported for the main analysis, and they argue that this provides independent evidence for their central analysis. 

We again aquired these data using WebPlotDigitizer, and present them below.

```{r extended_data_import, echo=FALSE}
ext_tbl <- read.csv('extended_data_6.csv')
ext_tbl <- round(ext_tbl)
ext_tbl$Group <- factor(ext_tbl$Year>=1995, levels = c("FALSE", "TRUE"), labels = c("<1995", ">=1995"))

```

```{r author_model_extended, echo=FALSE, fig.width=3.5, fig.height=3.2}

ext_mdl1 <- lm(Age~Year*Group,ext_tbl)
summary.lm(ext_mdl1)

# additionally show that you get the same model fit from this combined regression:
blah <- ext_tbl
blah$yhat <- predict(ext_mdl1)

plt <- ggplot(blah, aes(x = Year, y = Age, colour = Group)) +
  geom_point() + 
  geom_line(aes(y = yhat)) + 
  ylab('Yearly maximum reported age at death (years)')  

plt
```

The fitted model has a slope of `r signif(ext_mdl1$coefficients['Year'],digits = 3)` years for years before 1995 (their slope = 0.1194) and one of `r signif(ext_mdl1$coefficients['Year']+ext_mdl1$coefficients['Year:Group>=1995'],digits = 3)` (their slope = -0.1367) for years afterwards (compare their Extended Data Figure 6).
Differences to their model fit likely reflect data uncertainty from the digitisation process (owing to the poorer resolution of this figure in the paper).

```{r extended_data_lm, echo=FALSE}
ext_mdl2 <- lm(Age~Year,ext_tbl)

anova(ext_mdl2, ext_mdl1)

ext_bf1 <- lmBF(Age~Year*Group, ext_tbl)
ext_bf2 <- lmBF(Age~Year, ext_tbl)

ext_bf1 / ext_bf2

```

Here, the ANOVA analysis does provide support for the Dong et al model.
What about the other model comparison metrics?
The AIC difference between the models is `r signif(AIC(ext_mdl1) - AIC(ext_mdl2), digits=2)`, which corresponds to "considerably less" support for the simple linear model relative to the authors' model.
The BIC difference is `r signif(BIC(ext_mdl1) - BIC(ext_mdl2), digits=2)`, which provides "positive evidence" (Raftery, 1995) in support of the authors' model.
Finally, the Bayes factor computed as above is about 8.8, which also provides positive support for the authors' model.

An important caveat for these data is that they are missing observations between 1989-1996. Since this data spans the key years of the "break", they could have an important influence on these conclusions. The Dong et al paper does not clarify why this data is missing from the authors' plot (compare to data table online [here](http://www.grg.org/Adams/A.HTM)).

# Recovering missing data

In collaboration with [Adam Lenart](http://www.sdu.dk/staff/alenart), we recovered the full dataset from the Gerontology Research Group [website](http://www.grg.org/Adams/A.HTM), assuming that this was the source of the data the authors used.
This is a dataset of "verified supercentenarians" as of January 1, 2014.

```{r pull_grg_data, echo=FALSE, warning=FALSE}
# This code thanks to Adam Lenart

# read data in
theurl <- "http://www.grg.org/Adams/A_files/sheet001.htm"
tables <- readHTMLTable(theurl,header=TRUE,skip.rows=1:12)
dat <- tables[[1]][1:1627,]

# format data set
dat$Birthpl <- as.character(dat$Birthplace)
dat$Deathpl <- as.character(dat$"Residence/Place of death")
removeParsQM <- function(x)  unlist(strsplit(x,split=" \\(|\\?"))[1]

dat$BirthC <- sapply(FUN=removeParsQM,X=dat$Birthpl)
dat$DeathC <- sapply(FUN=removeParsQM,X=dat$Deathpl)
dat$Dead <- ifelse(dat$Died=="",yes=1,no=0)
dat$Age <- as.numeric(as.character(dat$Years))+(as.numeric(as.character(dat$Days))+0.5)/365.25
dat2 <- dat

# ---------------- Date of Death ------------------- #

Sys.setlocale("LC_TIME", 'English')
dD <- as.character(dat$Died)
dD <- gsub("Sept.", "September", dD)
deathDate <- as.Date(as.character(dD),"%b. %d, %Y")
deathDate[is.na(deathDate)==TRUE] <-
    as.Date(as.character(dD)[is.na(deathDate)==TRUE],"%B %d, %Y")
dat$deathDate <- deathDate
dat2$deathDatePX <- as.POSIXlt(deathDate)
dat2$deathYear <- dat2$deathDatePX$year+1900 

grg <- dat2

# create group code:
grg$Group <- factor(grg$deathYear>=1995, levels = c("FALSE", "TRUE"), labels = c("<1995", ">=1995"))

# save to a binary:
save(grg, file = "GRG_data.RData")
write.csv2(grg, file = "GRG_data.csv")
```

Here's all the data. 
The smoothing curve shows a local polynomial regression, and indicates an *upward* trend in age-at-death in recent years:

```{r plot_all_grg, echo=FALSE}

plt <- ggplot(grg, aes(x = deathYear, y = Age)) +
  geom_point(alpha = 0.5) + 
  stat_smooth(method = "loess")
  
plt
```

Now we can do the appropriate transformations and get the MRAD data. 
Note that unlike in their analysis, here we're not rounding to the nearest year.

```{r compute_mrad_grg}
grg_mrad <- grg %>% 
  select(-deathDatePX) %>% 
  rename(Year = deathYear) %>% 
  filter(Year >= 1950) %>% 
  group_by(Year) %>% 
  summarise(Age = max(Age)) %>% 
  mutate(Group = factor(Year>=1995, 
                        levels = c("FALSE", "TRUE"), 
                        labels = c("<1995", ">=1995")))
```

Here's a plot of that MRAD data:

```{r plot_grg_MRAD, echo=FALSE}
plt <- ggplot(grg_mrad, aes(x = Year, y = Age)) +
  geom_point(alpha = 0.5) + 
  stat_smooth(method = "loess") + 
  ylab("Yearly maximum reported age at death (years)")
  
plt
```

There seem to be no shortage of datapoints between the years missing in the authors' plot.
Note that the smooth indicates a downward trend in the maximum ages.

The authors' Figure 2a shows MRAD computed from the [IDL Database](http://www.supercentenarians.org/), only for countries France, Japan, UK and US.
The GRG dataset is worldwide. 
How similar are the two datasets? 

```{r compare_datasets, echo=FALSE}

merge1 <- grg_mrad
merge1$source <- "GRG"

merge2 <- tbl
merge2$source <- "IDL"

merged <- rbind(merge1, merge2)

# merged$Age <- round(merged$Age)

ggplot(merged, aes(x = Year, y = Age, colour = source)) + 
  geom_point(alpha = 0.7) + 
  stat_smooth(method = "loess") + 
  ylab("Yearly maximum reported age at death (years)")


```

Note the lack of rounding and wider range of years for the GRG set.

## Fitting models to the GRG dataset

We can now fit and compare the two candidate models for this dataset.

```{r author_model_grg}

grg_m1 <- lm(Age~Year*Group, grg_mrad)
summary.lm(grg_m1)
  
# additionally show that you get the same model fit from this combined regression:
blah <- grg_mrad
blah$yhat <- predict(grg_m1)

plt <- ggplot(blah, aes(x = Year, y = Age, colour = Group)) +
  geom_point() + 
  geom_line(aes(y = yhat)) + 
  ylab('Yearly maximum reported age at death (years)') + 
  ggtitle("Author model fit to GRG data")

plt
```

```{r lin_model_grg, echo=FALSE}
grg_m2 <- lm(Age~Year, grg_mrad)
summary(grg_m2)
anova(grg_m2, grg_m1)

grg_bf1 <- lmBF(Age~Year*Group, as.data.frame(grg_mrad))
grg_bf2 <- lmBF(Age~Year, as.data.frame(grg_mrad))
# 
grg_bf1 / grg_bf2

```

Including the missing data from the table substantially increases support for the authors' model over a simple linear model.
The ANOVA is highly significant; the AIC difference is `r signif(AIC(grg_m1) - AIC(grg_m2), digits=2)`, which corresponds to "essentially no" support for the simple linear model relative to the authors' model.
The BIC difference is `r signif(BIC(grg_m1) - BIC(grg_m2), digits=2)`, which provides "very strong" (Raftery, 1995) support for the authors' model over the simple linear one.
Finally, the Bayes factor computed as above is about 150, which also provides very strong positive support for the authors' model.


# Conclusion

The model comparison metrics presented here, using both Frequentist, information theoretic and Bayesian approaches, suggest that the main data reported in the paper provide no support for or against the idea that there is a limit to human lifespan. A simple linear model showing a positive relationship between year and lifespan is just as plausible given the data from Figure 2a. However, the data we recovered from GRG website do provide strong support for a segmented "trend-break" model over a linear one. 

Nevertheless, there are deeper statistical issues regarding these models and the analysis presented by Dong et al. For example, it is questionable to model the extreme values of a distribution such as age-at-death by assuming linearity and Gaussian noise (Coles, 2001). Even if one accepts this as a valid modeling framework however, a model comparison like we performed here should have clearly been part of the Dong et al. paper. 


# References

* Burnham, K. P., & Anderson, D. R. (2002): Model selection and multimodel inference a practical information-theoretic approach. New York: Springer. 

* Coles, S. (2001): An introduction to statistical modeling of extreme values. London: Springer.

* Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795.

* Morey and Rouder (2015). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-2. https://CRAN.R-project.org/package=BayesFactor

* Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 111–163.

* Vehtari, Gelman and Gabry (2016): Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC, arxiv [link](https://arxiv.org/pdf/1507.04544v5.pdf)