Skip to content

Commit

Permalink
Merge pull request #5 from openwashdata/dev
Browse files Browse the repository at this point in the history
second review
  • Loading branch information
Mian authored Apr 15, 2024
2 parents ef4899d + 29fd059 commit c7a15fb
Show file tree
Hide file tree
Showing 215 changed files with 9,916 additions and 3,391 deletions.
13 changes: 10 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,16 +1,23 @@
Package: washopenresearch
Title: Dataset about open research data information in Water, Sanitation, and Hygiene
Version: 0.0.1
Authors@R:
Authors@R: c(
person("Mian", "Zhong", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0009-0009-4546-7214"))
comment = c(ORCID = "0009-0009-4546-7214")),
person("Ludwig", "Luz", , "[email protected]", role = "aut",
comment = c(ORCID = "0009-0007-9248-3204")),
person("Lars", "Schöbitz", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0003-2196-5015"))
)
Description: The goal of washopenresearch is to provide an overview of open research data related to Water Sanitation and Hygiene (WASH). The package provides access to two datasets `washdev` and `uncnewsletter`. Each dataset collects information on scientific articles about (1) article metadata (e.g. title, first author, correspondence author), (2) supplementary material information, (3) data availability statement, and (4) semantic information (e.g. keywords).
License: CC BY 4.0
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
RoxygenNote: 7.3.1
Depends:
R (>= 2.10)
LazyData: true
Config/Needs/website: rmarkdown
Date: 2024-03-01
URL: https://github.com/openwashdata/washopenresearch
BugReports: https://github.com/openwashdata/washopenresearch/issues
19 changes: 12 additions & 7 deletions R/uncnewsletter.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
#' Dataset about data availability in the UNC Water Newsletter
#'
#' @format ## `uncnewsletter`
#'
#' \describe{
#' \item{url_source}{Publisher website of the paper}
#' \item{paperid}{ID number of the paper on the journal website}
#' \item{volume}{Volume number of the journal}
#' \item{issue}{Issue number of the journal}
#' \item{url}{Official website url of the paper}
#' \item{issue_url}{Volume number of the journal}
#' \item{paper_url}{Official website url of the paper}
#' \item{url_source}{Publisher website of the paper}
#' \item{journal}{Full name of the journal}
#' \item{title}{Title of the paper}
#' \item{published_year}{Year of publication}
Expand All @@ -13,17 +18,17 @@
#' \item{num_authors}{Number of the authors}
#' \item{first_author_name}{Name of the first author}
#' \item{first_author_affiliation}{Academic affiliation of the first author}
#' \item{first_author_affiliation_region}{Country or region of the first author parsed from first_author_affiliation variable}
#' \item{first_author_affiliation_country}{Country of the first author directly parsed from first_author_affiliation variable encoded with United Nation names}
#' \item{first_author_email}{Email of the first author}
#' \item{first_author_orcid}{ORCID of the first author}
#' \item{correspondence_author_name}{Name of the correspondence author}
#' \item{correspondence_author_affiliation}{Academic affiliation of the correspondence author}
#' \item{correspondence_author_affiliation_region}{Country or region of the correspondence author parsed from correspondence_author_affiliation variable}
#' \item{correspondence_author_affiliation_country}{Country or region of the correspondence author directly parsed from correspondence_author_affiliation variable encoded with United Nation names}
#' \item{correspondence_author_email}{Email of the correspondence author}
#' \item{correspondence_author_orcid}{ORCID of the correspondence author}
#' \item{has_das}{Whether the paper has a data availability statement}
#' \item{das}{Original data availability statement of the paper}
#' \item{das_type}{Type of the data availability statement #todo}
#' \item{das}{Original data availability statement of the paper. NA if it does not have a data availability statement.}
#' \item{das_type}{Type of the data availability statement including in paper(data in full paper scope like supplementary material or appendix or main content) on request(data available on request to the authors) available in online repository(data is shared in a public online repository) not shareable(data is not shareable). NA if it does not have a data availability statement.}
#' \item{das_repo_url}{Website url of the data if the relevant data of the paper is shared on a public repository}
#' \item{keywords}{List of keywords of the paper}
#' }
Expand Down
10 changes: 7 additions & 3 deletions R/washdev.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
#' Dataset about data availability in the Journal of Water, Sanitation and Hygiene for Development
#'
#' @format ## `washdev`
#'
#' \describe{
#' \item{paperid}{ID number of the paper on the journal website}
#' \item{volume}{Volume number of the journal}
#' \item{issue}{Issue number of the journal}
#' \item{url}{Official website url of the paper}
#' \item{paper_url}{Official website url of the paper}
#' \item{journal}{Full name of the journal}
#' \item{title}{Title of the paper}
#' \item{published_year}{Year of publication}
Expand All @@ -22,8 +26,8 @@
#' \item{correspondence_author_email}{Email of the correspondence author}
#' \item{correspondence_author_orcid}{ORCID of the correspondence author}
#' \item{has_das}{Whether the paper has a data availability statement}
#' \item{das}{Original data availability statement of the paper}
#' \item{das_type}{Type of the data availability statement #todo}
#' \item{das}{Original data availability statement of the paper. NA if it does not have a data availability statement.}
#' \item{das_type}{Type of the data availability statement including in paper(data in full paper scope like supplementary material or appendix or main content) on request(data available on request to the authors) available in online repository(data is shared in a public online repository) not shareable(data is not shareable). NA if it does not have a data availability statement.}
#' \item{das_repo_url}{Website url of the data if the relevant data of the paper is shared on a public repository}
#' \item{keywords}{List of keywords of the paper}
#' }
Expand Down
102 changes: 84 additions & 18 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,12 @@ library(washopenresearch)

The dataset `washdev` contains data on open access articles of the
*Journal of Water, Sanitation & Hygiene for Development* (Vol.1 Issue
1 - Vol.13 Issue 11). It has `r nrow(washdev)` observations from March 2011 to
November 2023.
1 - Vol.13 Issue 11). It has `r nrow(washdev)` observations from March
2011 to November 2023.

```{r}
washdev |>
head() |>
head(3) |>
gt::gt() |>
gt::as_raw_html()
```
Expand All @@ -108,8 +108,8 @@ readr::read_csv("data-raw/dictionary.csv") |>
dplyr::filter(file_name == "washdev.rda") |>
dplyr::select(variable_name:description) |>
knitr::kable() |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "400px")
kableExtra::kable_styling("striped") |>
kableExtra::scroll_box(height = "200px")
```

### uncnewsletter
Expand All @@ -120,7 +120,7 @@ News. It has `r nrow(uncnewsletter)` observations from 2020 to 2023.

```{r}
uncnewsletter |>
head() |>
head(3) |>
gt::gt() |>
gt::as_raw_html()
```
Expand All @@ -132,28 +132,36 @@ readr::read_csv("data-raw/dictionary.csv") |>
dplyr::filter(file_name == "uncnewsletter.rda") |>
dplyr::select(variable_name:description) |>
knitr::kable() |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "400px")
kableExtra::kable_styling("striped") |>
kableExtra::scroll_box(height = "200px")
```


## Example

### washdev

1. What are the top 10 countries(or regions) the first authors from in
the *Journal of Water, Sanitation and Hygiene for Development*?

```{r}
library(washopenresearch)
washdev |>
group_by(first_author_affiliation_region) |>
filter(!is.na(first_author_affiliation_country)) |>
group_by(first_author_affiliation_country) |>
summarise(count=n()) |>
arrange(desc(count)) |>
head(10) |>
ggplot() +
geom_bar(aes(x = reorder(first_author_affiliation_region, count), y = count), stat = "identity") +
labs(title = "Top 10 countries of first author",
geom_col(aes(x = reorder(first_author_affiliation_country, count),
y = count)) +
labs(title = "Top 10 countries of first author",
subtitle = "in the Journal of Water, Sanitation and Hygiene for Development",
x = "First Author Region", y = "Count")
x = "First Author Country", y = "Count") +
scale_x_discrete(labels = scales::label_wrap(15))+
coord_flip() +
theme_classic()
```

2. What are the top choices of keywords in WASH Dev?
Expand All @@ -162,24 +170,24 @@ Each publication may provide a list of keywords, typically 5-7, to
summarize the topics of the article. Here we compile all keywords and
calculate their frequency to be used.

```{r, echo=TRUE}
```{r washdev_keyword_frequency, echo=TRUE}
keywords_freq <- washdev$keywords |>
purrr::map(function(x) str_extract_all(x, pattern = "(?<=')[^',]*?(?='\\s*)")[[1]]) |>
unlist() |>
str_to_lower() |>
table() |>
as.data.frame() |>
as_tibble() |>
arrange(desc(Freq))
# Top 30 keywords
# Top 20 keywords
ggplot(data = head(keywords_freq, 20)) +
geom_bar(aes(x = reorder(Var1, Freq), y=Freq), stat = "identity") +
coord_flip() +
labs(title = "Top 20 Keywords in WASH Dev Journal", x = "Keywords", y = "Count")
labs(title = "Top 20 Keywords in WASH Dev Journal", x = "Keywords", y = "Count") +
theme_bw()
```

```{r, echo=FALSE, eval=FALSE}
```{r washdev_wordcloud, echo=FALSE, eval=FALSE}
keywords_freq <- keywords_freq |>
rename(word=Var1, freq=Freq) |>
Expand All @@ -193,10 +201,68 @@ saveWidget(wc, "man/figures/wc.html", selfcontained = F)
webshot("man/figures/wc.html", "man/figures/washdev_wordcloud.png", delay = 3)
```

### uncnewsletter

1. What are the top 10 source websites of the publications selected by
the newsletter?

```{r}
uncnewsletter |>
group_by(url_source) |>
summarise(count=n()) |>
arrange(desc(count)) |>
head(10) |>
ggplot() +
geom_col(aes(x = reorder(url_source, count),
y = count)) +
labs(title = "Top 10 publication websites",
subtitle = "in the selection of North Carolina Water News",
x = "Website URL", y = "Count") +
scale_x_discrete(labels = scales::label_wrap(15))+
coord_flip() +
theme_classic()
```

## Method

We describe the raw data collection procedure of each dataset in this
section. To reproduce the collection, you need to have python3 installed
and install python libraries

```
pip install requirements.txt
```

### washdev

The collection of `washdev` is via web scraping using Python. The script
can be found in `inst/python/washdev_scraping.py`. First, each
publication link is scraped from iterating the table of contents of all
volumes. This step delivers a table containing the variables paper ID,
volume number, issue number, publication url, journal title, publication
title, and published year. This table will be merged to get the final
dataset.

Then, for each publication, we retrieve the needed variables from the
publication's html file using the publication url. The retrieval is
rule-based to find the relevant fields (e.g. supplementary materials)
and extract the value.

### uncnewsletter

The collection of `uncnewsletter` is a combination of web scraping and
manual annotation. We first use the newsletter archive to scrape all
publication website links. The code can be found at
`inst/python/uncnewsletter_scraping.py`. Two annotators worked on the
manual extraction of the needed variables on these publications. For
each publication, an annotator follows the guide to fill in the value on
an collaborative spreadsheet. The guide is converted into the data
dictionary for this dataset.

## License

Data are available as
[CC-BY](https://github.com/openwashdata/wasteskipsblantyre/blob/main/LICENSE.md).
[CC-BY](https://github.com/openwashdata/washopenresearch/blob/main/LICENSE.md).

## Citation

Expand Down
Loading

0 comments on commit c7a15fb

Please sign in to comment.