01-overview.Rmd

---
title: "Landscape Study - General Overview"
author: Thomas Klebel
date: Last changed `r lubridate::today()`
output: 
  html_document:
    df_print: paged
    keep_md: true
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, dev.args = list(type = "cairo"))
```


```{r, message=FALSE}
theme_set(hrbrmisc::theme_hrbrmstr())

# import data
refined <- readd(clean_data)
refined_with_areas <- readd(clean_areas)
```

# Summary
The general impression is, that the journals most of the time do not have a 
specific policy regarding the aspect under question, or it was unclear for our
reviewers if there was a policy or not or how to interpret it.

# Missing data
```{r, fig.height=10, fig.width=10}
refined %>% 
  vis_miss()
```

There is quite some missing data, which shouldn't be like this. For example
the field U30 in excel (table RAW), which has the opr-responses for the journal
"Advanced Materials" is missing. It should probably be "Not specified".

The following are all variables, where implicit missings should be checked and
converted to explicit ones (as for opr_responses), or fixed (as for publishers).

```{r, fig.height=6, fig.width=8}
# only columns with missing data
with_missings <- refined %>% 
  select(-starts_with("G_"), -ends_with("clean")) %>% 
  select_if(~any(is.na(.))) %>% 
  # remove a few columns where missing data is ok
  select(-starts_with("review date"),
         -starts_with("To dis"),
         -starts_with("Time ta"),
         -starts_with("Review"),
         -starts_with("Free"),
         # Top journals are being removed as well, since these are legit 
         # missings
         -starts_with("Top journals in")
         )

with_missings %>% 
  vis_miss() +
  theme(plot.margin = margin(r = 50))
```

# Publisher versus subject area
The question is, whether it is reasonable to split results by publisher.
I think it does not make much sense:

- There are not many publishers, that have multiple publications
- Except Springer Nature and Elsevier, most publishers are confined to one or 
two subject areas. Looking at publishers would misinterprete differences as
being due to the publisher when they are due to the subject area.
- The only thing to compare would be thus Springer Nature vs. Elsevier


```{r, fig.width=12, fig.height=6}
refined %>% 
  count(publisher_clean, sort = T) %>% 
  filter(n > 4) %>% 
  left_join(refined_with_areas) %>% 
  select(-n) %>% 
  count(publisher_clean, area) %>% 
  ggplot(aes(publisher_clean, str_wrap(area, 20), colour = n, size = n)) +
  geom_point() +
  viridis::scale_color_viridis() +
  coord_flip() +
  labs(x = NULL, y = NULL, title = "Journals by Publisher and subject area",
       colour = NULL, size = NULL) +
  theme(legend.position = "top")
```


# Peer Review
```{r}
# small tweaks
refined_with_areas <- refined_with_areas %>% 
  order_pr_type()
```


```{r, fig.width=10.5, fig.height=5}
ggplot(refined_with_areas, aes(str_wrap(area, 20), fill = pr_type_clean)) +
  geom_bar(position = "fill", width = .7) +
  # coord_flip() +
  scale_fill_brewer(palette = "Set3") +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "top") +
  labs(fill = NULL, x = NULL, y = NULL, 
       title = "What type of peer review is used?")
```


```{r, fig.width=9, fig.height=7}
ggplot(refined_with_areas, aes(fct_rev(area), fill = area)) +
  geom_bar(show.legend = F) +
  facet_wrap(~pr_type_clean) +
  coord_flip() +
  labs(x = NULL, y = NULL, 
     title = "What type of peer review is used?")
```


```{r}
ggplot(refined_with_areas, aes(`h5-index`, pr_type_clean)) +
  geom_density_ridges()
```


```{r peer-type-publisher}
peer_type_publisher <- refined %>% 
  order_pr_type() %>% 
  mutate(has_society = case_when(
         str_detect(
           publisher, 
           "Society|Associa|Union|Academy|College|Instit|School"
           ) ~ "Published by Society/Association",
         TRUE ~ "General publisher"
  )) %>% 
  make_proportion(pr_type_clean, has_society, order_string = "Double|Single") %>% 
  drop_na()

p_cols <- c("Single blind" = "#1B9E77", "Double blind" =  "#D95F02",
            "Not blinded" =  "#7570B3",
            "Unsure" = "#A6761D", "Other" = "#666666")

ggplot(peer_type_publisher, aes(has_society, prop, 
                                fill = fct_rev(pr_type_clean))) +
  geom_chicklet(position = "fill", width = .6) +
  coord_flip() +
  scale_fill_manual(values = p_cols) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "top") +
  guides(fill = guide_legend(reverse = T)) +
  labs(fill = NULL, x = NULL, y = NULL, 
       title = "What type of peer review is used?")
```

I don't think this distinction is very interesting, since it masks disciplinary
differences, which are more fundamental.

```{r}
refined <- refined %>% 
  mutate(pr_database_clean = case_when(
    str_detect(pr_database, "^Yes") ~ "Yes",
    str_detect(pr_database, "Unsure") ~ "Unsure",
    str_detect(pr_database, "No") ~ "No"
  ))

refined %>% 
  plot_univariate(pr_database_clean) +
  coord_flip() +
  labs(title = "Is peer reviewer activity deposited into\nopen databases? (like Publons or ORCID)")
```


```{r, fig.width=10.5, fig.height=6}
refined_with_areas <- refined_with_areas %>% 
  mutate(pr_database_clean = case_when(
    str_detect(pr_database, "^Yes") ~ "Yes",
    str_detect(pr_database, "Unsure") ~ "Unsure",
    str_detect(pr_database, "No") ~ "No"
  ))

refined_with_areas %>% 
  mutate(pr_database_clean = fct_relevel(
    pr_database_clean,
    "Yes"
  )) %>% 
  plot_with_areas(pr_database_clean,
                  title = "Is peer reviewer activity deposited into open databases? (like Publons or ORCID)")

```


## Peer review transfer policy

```{r pr-transfer-compute}
pr_transfer <- refined %>% 
  select(pr_transfer_policy) %>% 
  filter(!is.na(pr_transfer_policy),
         !(pr_transfer_policy %in% c("Not specified", "Not found")))
```

Only `r nrow(pr_transfer)` out of `r nrow(refined)` do have a 
policy for manuscript transfer after rejection.
Due to policies being similar across journals of certain 
publishers, there are `r nrow(distinct(pr_transfer))`
distinct policies in our dataset.

The following table displays stemmed parts of the distinct policies, sorted by
propensity.
```{r pr-transfer-table}
pr_transfer %>% 
  distinct() %>%
  unnest_tokens(word, pr_transfer_policy) %>% 
  anti_join(stop_words) %>% 
  mutate(word = SnowballC::wordStem(word)) %>% 
  count(word, sort = T) 
```


The following graph shows the relationship between to most common bigrams (only
bigrams that occur at least three times).

```{r pr-transfer-plot, fig.width=7, fig.height=7}
pr_transfer %>% 
  make_bigram_analysis(pr_transfer_policy, cutoff = 3)
```


# Open Peer Review
```{r, fig.width=8, fig.height=7}
pdata <- refined %>% 
  select(opr_reports:opr_interaction) %>% 
  gather(var, val) %>% 
  mutate(val_clean = case_when(
    str_detect(val, "Conditional") ~ "Conditional",
    str_detect(str_to_lower(val), "not spec") ~ "Not specified",
    TRUE ~ val
  ))

labels <- pdata %>% 
  distinct(var) %>% 
  mutate(label = case_when(
    var == "opr_reports" ~ "Are peer review reports being published?",
    var == "opr_responses" ~ "Are author responses to reviews being published?",
    var == "opr_letters" ~ "Are editorial decision letters being published?",
    var == "opr_versions" ~ "Are previous versions of the manuscript being published?",
    var == "opr_identities_published" ~ "Are reviewer identities being published?",
    var == "opr_indenties_author" ~ "Are reviewer identities revealed to the author (even if not published)?",
    var == "opr_comments" ~ "Is there public commenting during formal peer review?",
    var == "opr_interaction" ~ "Is there open interaction (reviewers consult with one another)?"
  )) %>% 
mutate_at("label", ~str_wrap(., 40))

pdata %>% 
  left_join(labels) %>% 
  ggplot(., aes(label, fill = val_clean)) +
  geom_bar(position = "fill", width = .7) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette = "Set3") +
  theme(legend.position = "top") +
  labs(x = NULL, y = NULL, fill = NULL) 
```

```{r, fig.height=12, fig.width=9}
pdata <- refined_with_areas %>% 
  select(area, opr_reports:opr_interaction) %>% 
  gather(var, val, -area) %>% 
  mutate(val_clean = case_when(
    str_detect(val, "Conditional") ~ "Conditional",
    str_detect(str_to_lower(val), "not spec") ~ "Not specified",
    TRUE ~ val
  ))

labels <- pdata %>% 
  distinct(var) %>% 
  mutate(label = case_when(
    var == "opr_reports" ~ "Are peer review reports being published?",
    var == "opr_responses" ~ "Are author responses to reviews being published?",
    var == "opr_letters" ~ "Are editorial decision letters being published?",
    var == "opr_versions" ~ "Are previous versions of the manuscript being published?",
    var == "opr_identities_published" ~ "Are reviewer identities being published?",
    var == "opr_indenties_author" ~ "Are reviewer identities revealed to the author (even if not published)?",
    var == "opr_comments" ~ "Is there public commenting during formal peer review?",
    var == "opr_interaction" ~ "Is there open interaction (reviewers consult with one another)?"
  )) %>% 
mutate_at("label", ~str_wrap(., 40))

pdata %>% 
  left_join(labels) %>% 
  ggplot(., aes(label, fill = val_clean)) +
  geom_bar(position = "fill", width = .7) +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  facet_wrap(~area) +
  theme(legend.position = "top") +
  labs(x = NULL, y = NULL, fill = NULL) 
# display via plotly was removed, so it could be rendered on GitHub
```


```{r, eval=FALSE}
# these plots have been incorporated above
refined %>% 
  plot_univariate(opr_identities_published) +
  coord_flip() 

refined %>% 
  plot_univariate(opr_indenties_author) +
  coord_flip()

refined %>% 
  plot_univariate(opr_comments) +
  coord_flip()

refined %>% 
  plot_univariate(opr_interaction) +
  coord_flip()
```

# Co-Review
Let's look at the co-review policy.

```{r}
coreview_policies <- refined %>% 
  select(coreview_policy) %>% 
  filter(!is.na(coreview_policy),
         !(coreview_policy %in% c("Not specified", "Not found")))
```

Only `r nrow(coreview_policies)` out of `r nrow(refined)` do have a 
coreview-policy. Due to policies being similar across journals of certain 
publishers, there are `r nrow(distinct(coreview_policies))`
distinct policies in our dataset.

The following table displays stemmed parts of the distinct policies, sorted by
propensity.
```{r}
coreview_policies %>% 
  distinct() %>%
  unnest_tokens(word, coreview_policy) %>% 
  anti_join(stop_words) %>% 
  mutate(word = SnowballC::wordStem(word)) %>% 
  count(word, sort = T) 
```


The following graph shows the relationship between to most common bigrams (only
bigrams that occur at least three times).

```{r, fig.width=7, fig.height=7}
coreview_policies %>% 
  make_bigram_analysis(coreview_policy)
```


```{r, eval=FALSE}
# trigram analysis

trigrams <- coreview_policies %>% 
  distinct() %>%
  unnest_tokens(bigram, coreview_policy, token = "ngrams", n = 3)

trigrams_separated <- trigrams %>%
  separate(bigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word3 %in% stop_words$word)


# new bigram counts:
trigram_counts <- trigrams_filtered %>% 
  count(word1, word2, word3, sort = TRUE)

trigram_counts
```


```{r}
refined %>% 
  plot_univariate(coreview_email) +
  coord_flip() +
  labs(title = "Can co-reviewers contribute?")
```

```{r, fig.width=10.5, fig.height=5}
refined_with_areas %>% 
  mutate(coreview_email = case_when(
    coreview_email == "unsure" ~ "Unsure",
    TRUE ~ coreview_email
  )) %>% 
  ggplot(aes(str_wrap(area, 20), fill = coreview_email)) +
  geom_bar(position = "fill", width = .7) +
  scale_fill_brewer(palette = "Set3") +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "top") +
  labs(fill = NULL, x = NULL, y = NULL, 
       title = "Can co-reviewers contribute?")
```


This doesn't look interesting.
```{r}
refined %>% 
  plot_univariate(coreview_form) +
  coord_flip()
```

This doesn't look interesting.
```{r}
refined %>% 
  plot_univariate(coreview_database) +
  coord_flip()
```

# Preprints
```{r preprint version}
refined <- refined %>% 
  mutate(preprint_version_clean = case_when(
    str_detect(preprint_version, "Unsure") ~ 
      "Unsure (preprints are allowed, but it's not clear which version)",
    str_detect(preprint_version, "Any .*? or") ~ "Other",
    str_detect(preprint_version, "Other|Unclear") ~ "Other",
    str_detect(preprint_version, "None") ~ "None",
    str_detect(preprint_version, "First sub") ~ 
      "First submission only (before peer review)",
    str_detect(preprint_version, "After peer re") ~ "After peer review",
    str_detect(preprint_version, "Any|any") ~ "Any",
    str_detect(preprint_version, "No") ~ "No preprint policy",
  )) 


refined %>% 
  plot_univariate(preprint_version_clean) +
  coord_flip() +
  labs(title = "What version of a preprint\ncan be posted?")
```

```{r, fig.width=10.5, fig.height=6}
refined_with_areas <- refined_with_areas %>% 
  mutate(preprint_version_clean = case_when(
    str_detect(preprint_version, "Unsure") ~ 
      "Unsure (preprints are allowed, but it's not clear which version)",
    str_detect(preprint_version, "Any .*? or") ~ "Other",
    str_detect(preprint_version, "Other|Unclear") ~ "Other",
    str_detect(preprint_version, "None") ~ "None",
    str_detect(preprint_version, "First sub") ~ 
      "First submission only (before peer review)",
    str_detect(preprint_version, "After peer re") ~ "After peer review",
    str_detect(preprint_version, "Any|any") ~ "Any",
    str_detect(preprint_version, "No") ~ "No preprint policy",
  )) 


refined_with_areas %>% 
  mutate(preprint_version_clean = fct_relevel(preprint_version_clean,
                                              "Any", "After peer review") %>% 
           fct_relevel("None", "No preprint policy", "Other", after = 4))  %>% 
  plot_with_areas(preprint_version_clean, 
                  "What version of a preprint can be posted?")

```


```{r}
preprint_servers <- refined %>% 
  select(preprint_servers) %>% 
  filter(!is.na(preprint_servers),
         !(preprint_servers %in% c("Not specified", "Not found")))
```


```{r}
preprint_servers %>% 
  distinct() %>%
  unnest_tokens(word, preprint_servers) %>% 
  anti_join(stop_words) %>% 
  # mutate(word = SnowballC::wordStem(word)) %>% 
  count(word, sort = T)
```

```{r, fig.width=10, fig.height=10}
preprint_servers %>% 
  make_bigram_analysis(preprint_servers, 3, .distinct = T)
```


```{r}
refined <- refined %>% 
  mutate(preprint_citation_clean = case_when(
    str_detect(preprint_citation, "Not spec") ~ "Not specified",
    str_detect(preprint_citation, "Unsure") ~ "Unsure",
    str_detect(preprint_citation, "No") ~ "No",
    str_detect(preprint_citation, "only in the text") ~ "Yes, but only in the text",
    str_detect(preprint_citation, "reference list") ~ "Yes, in the reference list",
    is.na(preprint_citation) ~ NA_character_,
    TRUE ~ "Other"
  )) 

refined %>% 
  plot_univariate(preprint_citation_clean) +
  coord_flip() +
  labs(title = "Can preprints be cited?")
```

```{r, fig.width=9, fig.height=5}
refined_with_areas <- refined_with_areas %>% 
  mutate(preprint_citation_clean = case_when(
    str_detect(preprint_citation, "Not spec") ~ "Not specified",
    str_detect(preprint_citation, "Unsure") ~ "Unsure",
    str_detect(preprint_citation, "No") ~ "No",
    str_detect(preprint_citation, "only in the text") ~ "Yes, but only in the text",
    str_detect(preprint_citation, "reference list") ~ "Yes, in the reference list",
    is.na(preprint_citation) ~ NA_character_,
    TRUE ~ "Other"
  )) 

refined_with_areas %>% 
    mutate(preprint_citation_clean = 
             fct_relevel(preprint_citation_clean,
                         "Yes, in the reference list", "Yes, but only in the text",
                         "No", "Not specified", "Unsure", "Other") %>% 
             fct_explicit_na()) %>% 
    group_by(area) %>%
    count(preprint_citation_clean) %>%
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(area, prop, fill = preprint_citation_clean)) +
    geom_chicklet(position = "fill", width = .6) +
    # geom_bar(position = "fill", width = .7) +
    scale_fill_brewer(palette = "Set3") +
    scale_y_continuous(labels = scales::percent) +
    theme(legend.position = "top") +
    labs(fill = NULL, x = NULL, y = NULL,
         title = "Can preprints be cited?") +
    coord_flip()
# TODO: order areas according to Proportion of Yes
```


```{r}
refined %>% 
  plot_univariate(preprint_link) +
  coord_flip() +
  labs(title = "Is a link provided to the preprint version of a paper?")
```


# Cross tabulations
Results so far have revealed that in many cases policies are unclear. But in
which ways are policies related to each other? Do journals that allow co-review
also allow preprints? Is there a gradient between journals that are pioneers in
regard to open science, and others that lag behind? Or are there certain groups
of journals, open in one area, reluctant in the second and maybe unclear in the
third?

To answer these question, we employ Multiple Correspondence Analysis (MCA). 
The technique allows us to explore the different policies jointly
(Greenacre & Blasius 2006: 27) and thus paint
a landscape of open science practices among journals.

To facilitate interpration of the figures, variables had to be recoded. 
Categories with low counts were merged. Where feasible, we focused on whether 
certain  policies were clear or not, thus omitting the subtle differences within
the policies (for example whether journals allowed citations of preprints was 
simplified for whether the policy was clear (references allowed in text, 
reference list or not allowed) versus unclear (unsure about policy, no policy 
and other)). 

It should be noted that the procedure is strictly exploratory. We are exploring,
not testing any hypothesis.


<!-- Results have until now been examined spearated from each other. To bring them  -->
<!-- all together, we will employ Correspondence Analysis as a means of visualising -->
<!-- and investigating adjecent sets of tables.  -->

<!-- Different variables could serve as dependent variables: disciplinary area is -->
<!-- the first contender. But this could have an overly strong influence. We could -->
<!-- thus also look at peer review type, and add discipline only as a supplementary  -->
<!-- variable. -->

<!-- Or we could do multiple correspondence analysis... -->

<!-- The question we have is: are there certain journals that have unclear policies -->
<!-- in general, or not. Thus: are there pioneers that are open to new concepts and -->
<!-- clear in their policies and then the others that lag behind. -->

<!-- From Greenacre & Blasius 2006: 27, we should conclude that MCA is the preferred -->
<!-- method for this question, because it allows us to investigate the relationship -->
<!-- between different aspects of open science. CA would only allow investigation of -->
<!-- these aspects relative to the disciplines, but not relative to one another. -->


```{r}
ca_first <- refined %>% 
  mutate(coreview_email = case_when(
    coreview_email == "unsure" ~ "Unsure",
    # lump the two not specified to unsure, since this is similar in this 
    # instance
    coreview_email == "Not specified" ~ "Unsure",
    TRUE ~ coreview_email
  )) %>% 
  select(pr_type_clean, coreview_email, opr_indenties_author) %>% 
  mutate(pr_type_clean = paste0("pr_type-", pr_type_clean),
         coreview_email = paste0("coreview-", coreview_email),
         opr = paste0("reviewer_identities-", opr_indenties_author)) %>% 
  select(-opr_indenties_author) %>% 
  ca::mjca()


ca_first %>% 
  plot_ca()

```


```{r, fig.width=10, fig.height=8}
ca_second <- refined %>% 
    mutate(preprint_version_clean = case_when(
              str_detect(preprint_version, "Unsure") ~ "Unsure",
              str_detect(preprint_version, "Any .*? or") ~ "Other",
              str_detect(preprint_version, "Other|Unclear") ~ "Other",
              str_detect(preprint_version, "None") ~ "None",
              str_detect(preprint_version, "First sub") ~ "Yes",
              str_detect(preprint_version, "After peer re") ~ "Yes",
              str_detect(preprint_version, "Any|any") ~ "Yes",
              str_detect(preprint_version, "No") ~ "No policy"),
           preprint_citation_clean = case_when(
              str_detect(preprint_citation, "Not spec") ~ "Not specified",
              str_detect(preprint_citation, "Unsure") ~ "Unsure",
              str_detect(preprint_citation, "No") ~ "No",
              str_detect(preprint_citation, "only in the text") ~ "Yes",
              str_detect(preprint_citation, "reference list") ~ "Yes",
              is.na(preprint_citation) ~ NA_character_,
              TRUE ~ "Other"),
           opr_indenties_author_clean = case_when(
              str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
              str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
              str_detect(opr_indenties_author, "Optional") ~ "Optional",
              TRUE ~ opr_indenties_author),
          coreview_email = case_when(
              coreview_email == "unsure" ~ "Unsure",
              # lump the two not specified to unsure, since this is similar in this 
              # instance
              coreview_email == "Not specified" ~ "Unsure",
              TRUE ~ coreview_email)
          ) %>% 
  select(pr_type_clean, coreview_email, opr_indenties_author_clean, 
         preprint_version_clean, preprint_citation_clean) %>% 
  # Filter out these journals since they are very special
  filter(pr_type_clean != "Not blinded") %>% 
  mutate(pr_type_clean = paste0("pr_type-", pr_type_clean),
         coreview_email = paste0("coreview-", coreview_email),
         opr_indenties_author_clean = paste0("reviewer_identities-", opr_indenties_author_clean),
         preprint_version_clean = paste0("preprint_version-", preprint_version_clean),
         preprint_citation_clean = paste0("preprint_citation-", preprint_citation_clean)) %>% 
  ca::mjca()
summary(ca_second)


ca_second %>% 
  plot_ca(dimensions = c(1, 2)) +
  coord_fixed()


```


```{r, fig.width=10, fig.height=8}
refined %>% 
    mutate(preprint_version_clean = case_when(
              str_detect(preprint_version, "Unsure") ~ "Unsure",
              str_detect(preprint_version, "Any .*? or") ~ "Other",
              str_detect(preprint_version, "Other|Unclear") ~ "Other",
              str_detect(preprint_version, "None") ~ "None",
              str_detect(preprint_version, "First sub") ~ "Yes",
              str_detect(preprint_version, "After peer re") ~ "Yes",
              str_detect(preprint_version, "Any|any") ~ "Yes",
              str_detect(preprint_version, "No") ~ "No policy"),
           preprint_citation_clean = case_when(
              str_detect(preprint_citation, "Not spec") ~ "Not specified",
              str_detect(preprint_citation, "Unsure") ~ "Unsure",
              str_detect(preprint_citation, "No") ~ "No",
              str_detect(preprint_citation, "only in the text") ~ "Yes",
              str_detect(preprint_citation, "reference list") ~ "Yes",
              is.na(preprint_citation) ~ NA_character_,
              TRUE ~ "Other"),
           opr_indenties_author_clean = case_when(
              str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
              str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
              str_detect(opr_indenties_author, "Optional") ~ "Optional",
              TRUE ~ opr_indenties_author),
          coreview_email = case_when(
              coreview_email == "unsure" ~ "Unsure",
              # lump the two not specified to unsure, since this is similar in this 
              # instance
              coreview_email == "Not specified" ~ "Unsure",
              TRUE ~ coreview_email)
          ) %>% 
  select(pr_type_clean, coreview_email, opr_indenties_author_clean, 
         preprint_version_clean, preprint_citation_clean) %>% 
  # Filter out these journals since they are very special
  filter(pr_type_clean != "Not blinded") %>% 
  mutate(pr_type_clean = paste0("pr_type-", pr_type_clean),
         coreview_email = paste0("coreview-", coreview_email),
         opr_indenties_author_clean = paste0("reviewer_identities-", opr_indenties_author_clean),
         preprint_version_clean = paste0("preprint_version-", preprint_version_clean),
         preprint_citation_clean = paste0("preprint_citation-", preprint_citation_clean)) %>% 
  select(-opr_indenties_author_clean) %>% 
  ca::mjca() %>% 
  plot_ca(dimensions = c(1, 2)) +
  coord_fixed()


```

```{r, fig.width=9, fig.height=13}
multiple_ca <- refined_with_areas %>% 
    mutate(preprint_version_clean = case_when(
              str_detect(preprint_version, "Unsure") ~ "Unsure",
              str_detect(preprint_version, "Any .*? or") ~ "Other",
              str_detect(preprint_version, "Other|Unclear") ~ "Other",
              str_detect(preprint_version, "None") ~ "None",
              str_detect(preprint_version, "First sub") ~ "Yes",
              str_detect(preprint_version, "After peer re") ~ "Yes",
              str_detect(preprint_version, "Any|any") ~ "Yes",
              str_detect(preprint_version, "No") ~ "No policy"),
           preprint_citation_clean = case_when(
              str_detect(preprint_citation, "Not spec") ~ "Not specified",
              str_detect(preprint_citation, "Unsure") ~ "Unsure",
              str_detect(preprint_citation, "No") ~ "No",
              str_detect(preprint_citation, "only in the text") ~ "Yes",
              str_detect(preprint_citation, "reference list") ~ "Yes",
              is.na(preprint_citation) ~ NA_character_,
              TRUE ~ "Other"),
           opr_indenties_author_clean = case_when(
              str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
              str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
              str_detect(opr_indenties_author, "Optional") ~ "Optional",
              TRUE ~ opr_indenties_author),
          coreview_email = case_when(
              coreview_email == "unsure" ~ "Unsure",
              # lump the two not specified to unsure, since this is similar in this 
              # instance
              coreview_email == "Not specified" ~ "Unsure",
              TRUE ~ coreview_email)
          ) %>% 
  select(pr_type_clean, coreview_email, 
         preprint_version_clean, preprint_citation_clean, area) %>% 
  # Filter out these journals since they are very special
  filter(pr_type_clean != "Not blinded") %>% 
  mutate(pr_type_clean = paste0("pr_type-", pr_type_clean),
         coreview_email = paste0("coreview-", coreview_email),
         # opr_indenties_author_clean = paste0("reviewer_identities-", opr_indenties_author_clean),
         preprint_version_clean = paste0("preprint_version-", preprint_version_clean),
         preprint_citation_clean = paste0("preprint_citation-", preprint_citation_clean)) %>% 
  ca::mjca(supcol = c(5))

summary(multiple_ca)  


multiple_ca %>% 
  plot_ca(dimensions = c(1, 2)) +
  coord_fixed()

```

This looks intelligble, but lets try one more time with this crude variable

```{r, fig.width=10, fig.height=8}
multiple_ca <- refined_with_areas %>% 
    mutate(preprint_version_clean = case_when(
              str_detect(preprint_version, "Unsure") ~ "Unsure",
              str_detect(preprint_version, "Any .*? or") ~ "Other",
              str_detect(preprint_version, "Other|Unclear") ~ "Other",
              str_detect(preprint_version, "None") ~ "None",
              str_detect(preprint_version, "First sub") ~ "Yes",
              str_detect(preprint_version, "After peer re") ~ "Yes",
              str_detect(preprint_version, "Any|any") ~ "Yes",
              str_detect(preprint_version, "No") ~ "No policy"),
           preprint_citation_clean = case_when(
              str_detect(preprint_citation, "Not spec") ~ "Not specified",
              str_detect(preprint_citation, "Unsure") ~ "Unsure",
              str_detect(preprint_citation, "No") ~ "No",
              str_detect(preprint_citation, "only in the text") ~ "Yes",
              str_detect(preprint_citation, "reference list") ~ "Yes",
              is.na(preprint_citation) ~ NA_character_,
              TRUE ~ "Other"),
           opr_indenties_author_clean = case_when(
              str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
              str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
              str_detect(opr_indenties_author, "Optional") ~ "Optional",
              TRUE ~ opr_indenties_author),
          coreview_email = case_when(
              coreview_email == "unsure" ~ "Unsure",
              # lump the two not specified to unsure, since this is similar in this 
              # instance
              coreview_email == "Not specified" ~ "Unsure",
              TRUE ~ coreview_email)
          ) %>% 
  select(pr_type_clean, coreview_email, opr_indenties_author_clean,
         preprint_version_clean, preprint_citation_clean, area) %>% 
  # Filter out these journals since they are very special
  filter(pr_type_clean != "Not blinded") %>% 
  mutate(pr_type_clean = paste0("pr_type-", pr_type_clean),
         coreview_email = paste0("coreview-", coreview_email),
         opr_indenties_author_clean = paste0("reviewer_identities-", opr_indenties_author_clean),
         preprint_version_clean = paste0("preprint_version-", preprint_version_clean),
         preprint_citation_clean = paste0("preprint_citation-", preprint_citation_clean)) %>% 
  ca::mjca(supcol = c(6))

summary(multiple_ca)  


multiple_ca %>% 
  plot_ca(dimensions = c(1, 2)) +
  coord_fixed()

```


```{r, fig.width=10, fig.height=8}
# try with ca
start_data <- refined %>% 
    mutate(preprint_version_clean = case_when(
              str_detect(preprint_version, "Unsure") ~ "Unsure",
              str_detect(preprint_version, "Any .*? or") ~ "Other",
              str_detect(preprint_version, "Other|Unclear") ~ "Other",
              str_detect(preprint_version, "None") ~ "None",
              str_detect(preprint_version, "First sub") ~ "Yes",
              str_detect(preprint_version, "After peer re") ~ "Yes",
              str_detect(preprint_version, "Any|any") ~ "Yes",
              str_detect(preprint_version, "No") ~ "No policy"),
           preprint_citation_clean = case_when(
              str_detect(preprint_citation, "Not spec") ~ "Not specified",
              str_detect(preprint_citation, "Unsure") ~ "Unsure",
              str_detect(preprint_citation, "No") ~ "No",
              str_detect(preprint_citation, "only in the text") ~ "Yes",
              str_detect(preprint_citation, "reference list") ~ "Yes",
              is.na(preprint_citation) ~ NA_character_,
              TRUE ~ "Other"),
           opr_indenties_author_clean = case_when(
              str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
              str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
              str_detect(opr_indenties_author, "Optional") ~ "Optional",
              TRUE ~ opr_indenties_author),
          coreview_email = case_when(
              coreview_email == "unsure" ~ "Unsure",
              # lump the two not specified to unsure, since this is similar in this 
              # instance
              coreview_email == "Not specified" ~ "Unsure",
              TRUE ~ coreview_email)
          ) %>% 
  select(pr_type_clean, coreview_email, opr_indenties_author_clean, 
         preprint_version_clean, preprint_citation_clean) %>% 
  # Filter out these journals since they are very special
  filter(pr_type_clean != "Not blinded") %>% 
  mutate(pr_type_clean = paste0("pr_type-", pr_type_clean),
         coreview_email = paste0("coreview-", coreview_email),
         opr_indenties_author_clean = paste0("reviewer_identities-", opr_indenties_author_clean),
         preprint_version_clean = paste0("preprint_version-", preprint_version_clean),
         preprint_citation_clean = paste0("preprint_citation-", preprint_citation_clean))


obj1 <- with(start_data, table(coreview_email, pr_type_clean))
obj2 <- with(start_data, table(opr_indenties_author_clean, pr_type_clean))
obj3 <- with(start_data, table(preprint_version_clean, pr_type_clean))
obj4 <- with(start_data, table(preprint_citation_clean, pr_type_clean))

rbind(obj1, obj2, obj3, obj4) %>% 
  ca::ca(suprow = c(13, 17)) %>% 
  plot_ca() +
  coord_fixed()
```


```{r, fig.width=10, fig.height=8}
# try with ca
start_data <- refined_with_areas %>% 
    mutate(preprint_version_clean = case_when(
              str_detect(preprint_version, "Unsure") ~ "Unsure",
              str_detect(preprint_version, "Any .*? or") ~ "Other",
              str_detect(preprint_version, "Other|Unclear") ~ "Other",
              str_detect(preprint_version, "None") ~ "None",
              str_detect(preprint_version, "First sub") ~ "Yes",
              str_detect(preprint_version, "After peer re") ~ "Yes",
              str_detect(preprint_version, "Any|any") ~ "Yes",
              str_detect(preprint_version, "No") ~ "No policy"),
           preprint_citation_clean = case_when(
              str_detect(preprint_citation, "Not spec") ~ "Not specified",
              str_detect(preprint_citation, "Unsure") ~ "Unsure",
              str_detect(preprint_citation, "No") ~ "No",
              str_detect(preprint_citation, "only in the text") ~ "Yes",
              str_detect(preprint_citation, "reference list") ~ "Yes",
              is.na(preprint_citation) ~ NA_character_,
              TRUE ~ "Other"),
           opr_indenties_author_clean = case_when(
              str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
              str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
              str_detect(opr_indenties_author, "Optional") ~ "Optional",
              TRUE ~ opr_indenties_author),
          coreview_email = case_when(
              # coreview_email == "unsure" ~ "Unsure",
              # # lump the two not specified to unsure, since this is similar in this 
              # # instance
              # coreview_email == "Not specified" ~ "Unsure",
              str_detect(coreview_email, "Yes|No") ~ "Has policy",
              TRUE ~ "Has no policy")
          ) %>% 
  select(pr_type_clean, coreview_email, opr_indenties_author_clean, 
         preprint_version_clean, preprint_citation_clean, area) %>% 
  # Filter out these journals since they are very special
  filter(pr_type_clean != "Not blinded") %>% 
  mutate(pr_type_clean = paste0("pr_type-", pr_type_clean),
         coreview_email = paste0("coreview-", coreview_email),
         opr_indenties_author_clean = paste0("reviewer_identities-", opr_indenties_author_clean),
         preprint_version_clean = paste0("preprint_version-", preprint_version_clean),
         preprint_citation_clean = paste0("preprint_citation-", preprint_citation_clean))


obj1 <- with(start_data, table(coreview_email, area))
obj2 <- with(start_data, table(opr_indenties_author_clean, area))
obj3 <- with(start_data, table(preprint_version_clean, area))
obj4 <- with(start_data, table(preprint_citation_clean, area))
obj5 <- with(start_data, table(pr_type_clean, area))

# look at chisq differences
res <- rbind(obj1, obj3, obj4, obj5) %>% chisq.test() 
res

rbind(obj1, obj3, obj4, obj5) %>% questionr::cramer.v()

ca_out <- rbind(obj1, obj3, obj4, obj5) %>%
  ca::ca()


ca_out %>% 
  plot_ca(dimensions = c(1, 2)) +
  coord_fixed() +
  scale_color_manual(values = c("Zeilenprofil" = "#1B9E77", "Spaltenprofil" = "#D95F02"))

```

First axis is distinction between SSH and sciences. Second axis is more 
difficult in terms of what it means. It also has not that much inertia, thus 
being not very influential


```{r, fig.width=8, fig.height=12}
ca_out %>% 
  plot_ca(dimensions = c(1, 2), map = "colprincipal") +
  coord_fixed() +
  scale_color_manual(values = c("Zeilenprofil" = "#1B9E77", "Spaltenprofil" = "#D95F02"))

```

```{r}
summary(ca_out)
ca_out
```

Axes are mainly determined by peer review policy. Maybe remove this variable to
bring out mor subtle and potentialy more meaningful answers regarding the key 
topics (that SSH have double blind and Sciences have single blind seems to be 
well established)

```{r, fig.width=10, fig.height=8}
# try without peer review type


# obj2, obj3, obj4
ca_out <- rbind(obj1, obj3, obj4) %>%
  ca::ca()


ca_out %>% 
  plot_ca(dimensions = c(1, 2), map = "rowprincipal") +
  coord_fixed() +
  scale_color_manual(values = c("Zeilenprofil" = "#1B9E77", "Spaltenprofil" = "#D95F02"))

```


```{r, fig.width=9, fig.height=12}
summary(ca_out)
ca_out %>% 
  plot_ca(dimensions = c(1, 2)) +
  coord_fixed() +
  scale_color_manual(values = c("Zeilenprofil" = "#1B9E77", "Spaltenprofil" = "#D95F02"))

```

There is not much variation left to explain here, somehow.


```{r, fig.width=9, fig.height=12}
# explore starting with one variable
ca_out <- rbind(obj5, obj1, obj3) %>%
  ca::ca()


ca_out %>% 
  plot_ca(dimensions = c(1, 2), map = "colprincipal") +
  coord_fixed() +
  scale_color_manual(values = c("Zeilenprofil" = "#1B9E77", "Spaltenprofil" = "#D95F02"))

```

```{r}
summary(ca_out)
```

```{r}
rbind(obj5, obj1) %>% chisq.test() 
```