104-appendix-aoa.Rmd

# Estimating Age of Acquisition {#appendix-aoa}

It is frequently useful to have an estimate of the age at which children produce a particular word with a probability greater than some threshold; these are commonly referred to as the word's age of acquisition [AoA; @goodman2008]. In this Appendix, we compare methods for estimating age of acquisition, using the English Words & Sentences data as a case study. 

```{r appaoa-load_data}
# eng_ws <- read_feather("data/psychometrics/eng_ws_raw_data.feather")
load("data/psychometrics/eng_ws_raw_data.Rds")
```

```{r appaoa-means}
ms <- eng_ws %>%
  group_by(definition, age, category) %>%
  summarise(prop = mean(value == "produces", na.rm = TRUE), 
            num_true = sum(value == "produces", na.rm = TRUE), 
            num_false = sum(value != "produces", na.rm = TRUE), 
            n = sum(c(num_true, num_false))) %>%
  filter(!is.na(category))
```

```{r appaoa-empirical}
empirical_aoas <- ms %>%
  group_by(definition, category) %>%
  summarise(empirical_aoa = min(age[prop > .5]))

# qplot(empirical_aoa, data = empirical_aoas)
# qplot(empirical_aoa, facets = ~ category, data = empirical_aoas)
```

The simplest and most obvious measure is to use the empirically-determined first month at which the proportion producing a word exceeds the threshold. We will use 50% of children producing as our threshold in all subsequent discussion, following previous literature [@goodman2008]. This approach is simple, but results in the exclusion of a large number of words. Of the total, `r roundp(mean(!is.finite(empirical_aoas$empirical_aoa)) * 100, 0)`% do not reach 50% production by the ceiling of the form and must be discarded. Further, this method is very sensitive to sparse data. A dataset with highly clustered data will show clustered AoAs. For example, in the Swedish data, there are `r sum(admins$language == "Swedish" & admins$age == 22)` 22-month-olds and `r sum(admins$language == "Swedish" & admins$age == 21)` 21-month-olds. Thus, there will be no 21-month AoAs in this dataset. For these reasons, a model-based approach is likely to be more robust. 

We initially investigated three model-based methods. Each of these models was fit to data from each word (proportion of children producing at each age) individually, resulting in a continuous curve that can be used to predict AoA more precisely. The three models we examined were:

1. Standard generalized linear model with a logistic link (GLM)
2. Robust GLM
3. Bayesian GLM with hand-tuned prior parameters [@gelman2008]

For the Bayesian GLM, we took an ad-hoc approach, experimenting substantially with prior setting in order to incorporate information about the slopes and intercepts we expected for words. We adopted default, long-tailed (Cauchy) priors over coefficients. We then used the empirical distribution of GLM slopes to set a strict prior over the age slopes we expected (guided in part by our investigations in Chapter \@ref(psychometrics), which indicated that most items should show positive developmental change). We then set a much weaker prior over intercepts. While these choices are somewhat arbitrary, in practice, in larger datasets only the most extreme words (e.g. *mommy*, for which there is a ceiling effect for nearly every age) were affected by this choice. 

In addition to these models, we investigated a hierarchical Bayesian model with shared distributional components across words. This model obviated the ad-hoc prior determination that we performed for the individual Bayesian GLMs, but it appeared to perform very similarly (at least in the presence of sufficient data) and was quite expensive to fit in terms of computation, so we do not discuss it here. 


```{r appaoa-glm}
fit_glm <- function(data) {
  model <- glm(cbind(num_true, num_false) ~ age, family = "binomial", 
               data = data)
  fit <- predict(model, newdata = data.frame(age = 0:36), se.fit = TRUE)
  aoa <- -model$coefficients[["(Intercept)"]] / model$coefficients[["age"]]
  
  tibble(definition = data$definition[1],
             category = data$category[1],
             glm_slope = model$coefficients[["age"]],
             glm_aoa = aoa)
}

glm_aoas <- ms %>%
  split(.$definition) %>%
  map(fit_glm) %>%
  bind_rows
```


```{r appaoa-rglm}
fit_rglm <- function(data) {
  model <- robustbase::glmrob(cbind(num_true, num_false) ~ age, family = "binomial", 
               data = data)
  fit <- predict(model, newdata = data.frame(age = 0:36), se.fit = TRUE)
  aoa <- -model$coefficients[["(Intercept)"]] / model$coefficients[["age"]]
  
  tibble(definition = data$definition[1],
             category = data$category[1],
             rglm_slope = model$coefficients[["age"]],
             rglm_aoa = aoa)
}

rglm_aoas <- ms %>%
  split(.$definition) %>%
  map(fit_rglm) %>%
  bind_rows
```

```{r appaoa-bayes_all}
fit_bglm <- function(data) {
  model <- arm::bayesglm(cbind(num_true, num_false) ~ age, 
                         family = "binomial", 
                         prior.mean = .3,
                         prior.scale = c(.01),
                         prior.mean.for.intercept = 0,
                         prior.scale.for.intercept = 2.5,
                         prior.df = 1,
                         data = data)
  aoa <- -model$coefficients[["(Intercept)"]] / model$coefficients[["age"]]
  
  tibble(definition = data$definition[1],
         category = data$category[1],
         bglm_slope = model$coefficients[["age"]],
         bglm_aoa = aoa)
}

bglm_aoas <- ms %>%
  split(.$definition) %>%
  map(fit_bglm) %>%
  bind_rows
```

```{r appaoa-aoas}
aoas <- left_join(glm_aoas, empirical_aoas) %>%
  left_join(rglm_aoas) %>%
  left_join(bglm_aoas) %>%
  gather(measure, aoa, ends_with("aoa")) %>%
  mutate(measure = factor(measure, 
                          levels = c("empirical_aoa", 
                                     "glm_aoa",
                                     "rglm_aoa",
                                     "bglm_aoa"),
                          labels = c("Empirical",
                                     "GLM",
                                     "Robust GLM",
                                     "Bayesian GLM")))
```

Figure \@ref(fig:appaoa-compare-dist) shows a comparison of these methods, in the form of histograms of recovered 50% values for American English AoA data. The empirical AoAs are clearly clumpy in precisely the way we describe above, even with a substantial amount of data in the analysis (*N* = `r n_distinct(eng_ws$data_id)`). In contrast, all three models smooth the AoA distribution substantially, which is likely beneficial to downstream analyses. Although there are some subtle differences in the shape of the main distribution between models, the main action is found on the tails. The different models treat floor and ceiling items differently. 

Both the GLM and robust GLM recover two AoAs that are below zero, which is logically impossible (*mommy* and *daddy* are both presumed learned before birth). In contrast, the priors of the Bayesian GLM regularize these AoAs to be `r roundp(filter(aoas, definition == "mommy*", measure == "Bayesian GLM")$aoa, 0)` and `r roundp(filter(aoas, definition == "daddy*", measure == "Bayesian GLM")$aoa, 0)` months respectively. Further, the Bayesian GLM estimates `r roundp(mean(filter(aoas, measure == "Bayesian GLM")$aoa >30) * 100, 0)`% of AoAs above 30 months (the max value in the data), while the other two methods estimate slightly fewer: `r roundp(mean(filter(aoas, measure == "GLM")$aoa >30) * 100, 0)`% and `r roundp(mean(filter(aoas, measure == "Robust GLM")$aoa >30) * 100, 0)`% respectively. These Bayesian GLM results strike us as more reasonable than those returned by the other methods (although they only affect a small minority of words).

```{r appaoa-compare-dist, fig.cap="Histogram of English (American) age of acquisition values as estimated via a variety of statistical methods (panels).", fig.height=6}
ggplot(aoas, aes(x = aoa)) +
  facet_wrap(~measure) +
  geom_histogram(binwidth = 1) +
  labs(x = "Age of acquisition", y = "Count")
```

To further validate the Bayesian GLM approach, we tested the accuracy of the method in recovering AoAs for much smaller datasets. We did this by taking a subsample of only 100 children from the full English (American) WS dataset. We then fit standard and Bayesian GLMs to this sparse subsample. The resulting AoA estimates are plotted in Figure \@ref(fig:appaoa-sparsity). The Bayesian GLM shows the same minor bias to lower AoAs for hard words that the regular GLM does (a slightly below-diagonal slope), but the GLM shows noisier estimates for the earliest and (especially) the latest-learned words, suggesting that the regularization from the prior values in the Bayesian model is allowing it to deal with sparse data more effectively.

```{r appaoa-sparsity, fig.height=6, fig.cap="Recovered AoAs from a sparse subsample (100 children), plotted by the Bayesian GLM AoAs from the full dataset. Left panel shows standard GLM, right panel shows Bayesian GLM. Differences of AoA > 4 months between methods are labeled."}
set.seed(12)

ids <- eng_ws %>%
  group_by(data_id) %>%
  count %>%
  ungroup %>%
  sample_n(size = 100, replace = TRUE) %>%
  pull(data_id)

ms_sparse <- eng_ws %>%
  filter(data_id %in% ids) %>%
  group_by(definition, age, category) %>%
  summarise(prop = mean(value == "produces", na.rm = TRUE),
            num_true = sum(value == "produces", na.rm = TRUE),
          num_false = sum(value != "produces", na.rm = TRUE),
            n = sum(c(num_true,num_false))) %>%
  filter(!is.na(category))

bglm_aoas_sparse <- ms_sparse %>%
  split(.$definition) %>%
  map(fit_bglm) %>%
  bind_rows %>%
  rename(bglm_aoa_sparse = bglm_aoa, 
         bglm_slope_sparse = bglm_slope)

glm_aoas_sparse <- ms_sparse %>%
  split(.$definition) %>%
  map(fit_glm) %>%
  bind_rows %>%
  rename(glm_aoa_sparse = glm_aoa, 
         glm_slope_sparse = glm_slope)

aoas <- bglm_aoas %>%
  left_join(bglm_aoas_sparse) %>%
  left_join(glm_aoas_sparse) %>%
  select(definition, bglm_aoa, glm_aoa_sparse, bglm_aoa_sparse) %>%
  gather(model, aoa_sparse, glm_aoa_sparse, bglm_aoa_sparse) %>%
  mutate(model = model %>%
           fct_relevel("glm_aoa_sparse", "bglm_aoa_sparse") %>%
           fct_recode("GLM" = "glm_aoa_sparse",
                      "Bayes GLM" = "bglm_aoa_sparse"),
         definition = definition %>% str_remove("\\*"))

ggplot(aoas, aes(x = bglm_aoa, y = aoa_sparse)) +
  facet_wrap(~model) +
  coord_fixed() +
  geom_abline(linetype = .refline, colour = .grey) +
  geom_point(alpha = 0.3) +
  ggrepel::geom_label_repel(
    data = aoas %>% filter(abs(bglm_aoa - aoa_sparse) > 4),
    aes(label = definition), family = .font, size = 3
  ) +
  labs(x = "Full dataset AoA (Bayes GLM)", y = "Sparse dataset AoA")
```

In sum, our analyses here suggest that a Bayesian approach is useful for estimating AoA values.