Merge pull request #5 from openwashdata/dev

second review
openwashdata · Apr 15, 2024 · c7a15fb · c7a15fb
2 parents ef4899d + 29fd059
commit c7a15fb
Show file tree

Hide file tree

Showing 215 changed files with 9,916 additions and 3,391 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,16 +1,23 @@
 Package: washopenresearch
 Title: Dataset about open research data information in Water, Sanitation, and Hygiene
 Version: 0.0.1
-Authors@R: 
+Authors@R: c(
     person("Mian", "Zhong", , "[email protected]", role = c("aut", "cre"),
-           comment = c(ORCID = "0009-0009-4546-7214"))
+           comment = c(ORCID = "0009-0009-4546-7214")),
+    person("Ludwig", "Luz", , "[email protected]", role = "aut",
+           comment = c(ORCID = "0009-0007-9248-3204")),
+    person("Lars", "Schöbitz", , "[email protected]", role = "aut",
+           comment = c(ORCID = "0000-0003-2196-5015"))
+  )
 Description: The goal of washopenresearch is to provide an overview of open research data related to Water Sanitation and Hygiene (WASH). The package provides access to two datasets `washdev` and `uncnewsletter`. Each dataset collects information on scientific articles about (1) article metadata (e.g. title, first author, correspondence author), (2) supplementary material information, (3) data availability statement, and (4) semantic information (e.g. keywords).
 License: CC BY 4.0
 Encoding: UTF-8
 Roxygen: list(markdown = TRUE)
-RoxygenNote: 7.2.3
+RoxygenNote: 7.3.1
 Depends: 
     R (>= 2.10)
 LazyData: true
 Config/Needs/website: rmarkdown
 Date: 2024-03-01
+URL: https://github.com/openwashdata/washopenresearch
+BugReports: https://github.com/openwashdata/washopenresearch/issues
diff --git a/R/uncnewsletter.R b/R/uncnewsletter.R
@@ -1,8 +1,13 @@
+#' Dataset about data availability in the UNC Water Newsletter
+#'
+#' @format ## `uncnewsletter`
+#'
 #' \describe{
+#'   \item{url_source}{Publisher website of the paper}
 #'   \item{paperid}{ID number of the paper on the journal website}
-#'   \item{volume}{Volume number of the journal}
-#'   \item{issue}{Issue number of the journal}
-#'   \item{url}{Official website url of the paper}
+#'   \item{issue_url}{Volume number of the journal}
+#'   \item{paper_url}{Official website url of the paper}
+#'   \item{url_source}{Publisher website of the paper}
 #'   \item{journal}{Full name of the journal}
 #'   \item{title}{Title of the paper}
 #'   \item{published_year}{Year of publication}
@@ -13,17 +18,17 @@
 #'   \item{num_authors}{Number of the authors}
 #'   \item{first_author_name}{Name of the first author}
 #'   \item{first_author_affiliation}{Academic affiliation of the first author}
-#'   \item{first_author_affiliation_region}{Country or region of the first author parsed from first_author_affiliation variable}
+#'   \item{first_author_affiliation_country}{Country of the first author directly parsed from first_author_affiliation variable encoded with United Nation names}
 #'   \item{first_author_email}{Email of the first author}
 #'   \item{first_author_orcid}{ORCID of the first author}
 #'   \item{correspondence_author_name}{Name of the correspondence author}
 #'   \item{correspondence_author_affiliation}{Academic affiliation of the correspondence author}
-#'   \item{correspondence_author_affiliation_region}{Country or region of the correspondence author parsed from correspondence_author_affiliation variable}
+#'   \item{correspondence_author_affiliation_country}{Country or region of the correspondence author directly parsed from correspondence_author_affiliation variable encoded with United Nation names}
 #'   \item{correspondence_author_email}{Email of the correspondence author}
 #'   \item{correspondence_author_orcid}{ORCID of the correspondence author}
 #'   \item{has_das}{Whether the paper has a data availability statement}
-#'   \item{das}{Original data availability statement of the paper}
-#'   \item{das_type}{Type of the data availability statement #todo}
+#'   \item{das}{Original data availability statement of the paper. NA if it does not have a data availability statement.}
+#'   \item{das_type}{Type of the data availability statement including in paper(data in full paper scope like supplementary material or appendix or main content) on request(data available on request to the authors) available in online repository(data is shared in a public online repository) not shareable(data is not shareable). NA if it does not have a data availability statement.}
 #'   \item{das_repo_url}{Website url of the data if the relevant data of the paper is shared on a public repository}
 #'   \item{keywords}{List of keywords of the paper}
 #' }

diff --git a/R/washdev.R b/R/washdev.R
@@ -1,8 +1,12 @@
+#' Dataset about data availability in the Journal of Water, Sanitation and Hygiene for Development
+#'
+#' @format ## `washdev`
+#'
 #' \describe{
 #'   \item{paperid}{ID number of the paper on the journal website}
 #'   \item{volume}{Volume number of the journal}
 #'   \item{issue}{Issue number of the journal}
-#'   \item{url}{Official website url of the paper}
+#'   \item{paper_url}{Official website url of the paper}
 #'   \item{journal}{Full name of the journal}
 #'   \item{title}{Title of the paper}
 #'   \item{published_year}{Year of publication}
@@ -22,8 +26,8 @@
 #'   \item{correspondence_author_email}{Email of the correspondence author}
 #'   \item{correspondence_author_orcid}{ORCID of the correspondence author}
 #'   \item{has_das}{Whether the paper has a data availability statement}
-#'   \item{das}{Original data availability statement of the paper}
-#'   \item{das_type}{Type of the data availability statement #todo}
+#'   \item{das}{Original data availability statement of the paper.  NA if it does not have a data availability statement.}
+#'   \item{das_type}{Type of the data availability statement including in paper(data in full paper scope like supplementary material or appendix or main content) on request(data available on request to the authors) available in online repository(data is shared in a public online repository) not shareable(data is not shareable). NA if it does not have a data availability statement.}
 #'   \item{das_repo_url}{Website url of the data if the relevant data of the paper is shared on a public repository}
 #'   \item{keywords}{List of keywords of the paper}
 #' }

diff --git a/README.Rmd b/README.Rmd
@@ -91,12 +91,12 @@ library(washopenresearch)
 
 The dataset `washdev` contains data on open access articles of the
 *Journal of Water, Sanitation & Hygiene for Development* (Vol.1 Issue
-1 - Vol.13 Issue 11). It has `r nrow(washdev)` observations from March 2011 to
-November 2023.
+1 - Vol.13 Issue 11). It has `r nrow(washdev)` observations from March
+2011 to November 2023.
 
 ```{r}
 washdev |> 
-  head() |> 
+  head(3) |> 
   gt::gt() |>
   gt::as_raw_html()
 ```
@@ -108,8 +108,8 @@ readr::read_csv("data-raw/dictionary.csv") |>
   dplyr::filter(file_name == "washdev.rda") |>
   dplyr::select(variable_name:description) |> 
   knitr::kable() |> 
-  kableExtra::kable_styling() |> 
-  kableExtra::scroll_box(height = "400px")
+  kableExtra::kable_styling("striped") |> 
+  kableExtra::scroll_box(height = "200px")
 ```
 
 ### uncnewsletter
@@ -120,7 +120,7 @@ News. It has `r nrow(uncnewsletter)` observations from 2020 to 2023.
 
 ```{r}
 uncnewsletter |> 
-  head() |> 
+  head(3) |> 
   gt::gt() |>
   gt::as_raw_html()
 ```
@@ -132,28 +132,36 @@ readr::read_csv("data-raw/dictionary.csv") |>
   dplyr::filter(file_name == "uncnewsletter.rda") |>
   dplyr::select(variable_name:description) |> 
   knitr::kable() |> 
-  kableExtra::kable_styling() |> 
-  kableExtra::scroll_box(height = "400px")
+  kableExtra::kable_styling("striped") |> 
+  kableExtra::scroll_box(height = "200px")
 ```
 
+
 ## Example
 
+### washdev
+
 1.  What are the top 10 countries(or regions) the first authors from in
     the *Journal of Water, Sanitation and Hygiene for Development*?
 
 ```{r}
 library(washopenresearch)
 
 washdev |> 
-  group_by(first_author_affiliation_region) |>
+  filter(!is.na(first_author_affiliation_country)) |>
+  group_by(first_author_affiliation_country) |>
   summarise(count=n()) |>
   arrange(desc(count)) |>
   head(10) |>
   ggplot() +
-    geom_bar(aes(x = reorder(first_author_affiliation_region, count), y = count), stat = "identity") +
-   labs(title = "Top 10 countries of first author",
+    geom_col(aes(x = reorder(first_author_affiliation_country, count), 
+                 y = count)) +
+    labs(title = "Top 10 countries of first author",
         subtitle = "in the Journal of Water, Sanitation and Hygiene for Development",
-        x = "First Author Region", y = "Count")
+        x = "First Author Country", y = "Count") +
+    scale_x_discrete(labels = scales::label_wrap(15))+
+    coord_flip() +
+    theme_classic()
 ```
 
 2.  What are the top choices of keywords in WASH Dev?
@@ -162,24 +170,24 @@ Each publication may provide a list of keywords, typically 5-7, to
 summarize the topics of the article. Here we compile all keywords and
 calculate their frequency to be used.
 
-```{r, echo=TRUE}
+```{r washdev_keyword_frequency, echo=TRUE}
 keywords_freq <- washdev$keywords |>
-    purrr::map(function(x) str_extract_all(x, pattern = "(?<=')[^',]*?(?='\\s*)")[[1]]) |>
     unlist() |>
     str_to_lower() |>
   table() |>
   as.data.frame() |>
   as_tibble() |>
   arrange(desc(Freq))
 
-# Top 30 keywords
+# Top 20 keywords
 ggplot(data = head(keywords_freq, 20)) +
   geom_bar(aes(x = reorder(Var1, Freq), y=Freq), stat = "identity") +
   coord_flip() +
-  labs(title = "Top 20 Keywords in WASH Dev Journal", x = "Keywords", y = "Count")
+  labs(title = "Top 20 Keywords in WASH Dev Journal", x = "Keywords", y = "Count") +
+  theme_bw()
 ```
 
-```{r, echo=FALSE, eval=FALSE}
+```{r washdev_wordcloud, echo=FALSE, eval=FALSE}
 
 keywords_freq <- keywords_freq |>
   rename(word=Var1, freq=Freq) |>
@@ -193,10 +201,68 @@ saveWidget(wc, "man/figures/wc.html", selfcontained = F)
 webshot("man/figures/wc.html", "man/figures/washdev_wordcloud.png", delay = 3)
 ```
 
+### uncnewsletter
+
+1.  What are the top 10 source websites of the publications selected by
+    the newsletter?
+
+```{r}
+uncnewsletter |> 
+  group_by(url_source) |>
+  summarise(count=n()) |>
+  arrange(desc(count)) |>
+  head(10) |>
+  ggplot() +
+    geom_col(aes(x = reorder(url_source, count), 
+                 y = count)) +
+   labs(title = "Top 10 publication websites",
+        subtitle = "in the selection of North Carolina Water News",
+        x = "Website URL", y = "Count") +
+   scale_x_discrete(labels = scales::label_wrap(15))+
+   coord_flip() +
+   theme_classic()
+```
+
+## Method
+
+We describe the raw data collection procedure of each dataset in this
+section. To reproduce the collection, you need to have python3 installed
+and install python libraries
+
+```         
+pip install requirements.txt
+```
+
+### washdev
+
+The collection of `washdev` is via web scraping using Python. The script
+can be found in `inst/python/washdev_scraping.py`. First, each
+publication link is scraped from iterating the table of contents of all
+volumes. This step delivers a table containing the variables paper ID,
+volume number, issue number, publication url, journal title, publication
+title, and published year. This table will be merged to get the final
+dataset.
+
+Then, for each publication, we retrieve the needed variables from the
+publication's html file using the publication url. The retrieval is
+rule-based to find the relevant fields (e.g. supplementary materials)
+and extract the value.
+
+### uncnewsletter
+
+The collection of `uncnewsletter` is a combination of web scraping and
+manual annotation. We first use the newsletter archive to scrape all
+publication website links. The code can be found at
+`inst/python/uncnewsletter_scraping.py`. Two annotators worked on the
+manual extraction of the needed variables on these publications. For
+each publication, an annotator follows the guide to fill in the value on
+an collaborative spreadsheet. The guide is converted into the data
+dictionary for this dataset.
+
 ## License
 
 Data are available as
-[CC-BY](https://github.com/openwashdata/wasteskipsblantyre/blob/main/LICENSE.md).
+[CC-BY](https://github.com/openwashdata/washopenresearch/blob/main/LICENSE.md).
 
 ## Citation