-
Notifications
You must be signed in to change notification settings - Fork 7
/
methodshub.qmd
140 lines (93 loc) · 5.46 KB
/
methodshub.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
title: oolong - Create Validation Tests for Automated Content Analysis
format:
html:
embed-resources: true
gfm: default
---
## Description
<!-- - Provide a brief and clear description of the method, its purpose, and what it aims to achieve. Add a link to a related paper from social science domain and show how your method can be applied to solve that research question. -->
Intended to create standard human-in-the-loop validity tests for typical automated content analysis such as topic modeling and dictionary-based methods. This package offers a standard workflow with functions to prepare, administer and evaluate a human-in-the-loop validity test. This package provides functions for validating topic models using word intrusion, topic intrusion (Chang et al. 2009, <https://papers.nips.cc/paper/3700-reading-tea-leaves-how-humans-interpret-topic-models>) and word set intrusion (Ying et al. 2021) [doi:10.1017/pan.2021.33](https://doi.org/10.1017/pan.2021.33) tests. This package also provides functions for generating gold-standard data which are useful for validating dictionary-based methods. The default settings of all generated tests match those suggested in Chang et al. (2009) and Song et al. (2020) [doi:10.1080/10584609.2020.1723752](https://doi.org/10.1080/10584609.2020.1723752).
## Keywords
* Validity
* Text Analysis
* Topic Model
## Science Usecase(s)
<!-- - Include usecases from social sciences that would make this method applicable in a certain scenario. -->
<!-- The use cases or research questions mentioned should arise from the latest social science literature cited in the description. -->
This package was used in the literature to valid topic models and prediction models trained on text data, e.g. [Rauchfleisch et al. (2023)](https://doi.org/10.1080/17512786.2022.2110928), [Rothut, et al. (2023)](https://doi.org/10.1177/14614448231164409), [Eisele, et al. (2023)](https://doi.org/10.1080/19312458.2023.2230560).
## Repository structure
This repository follows [the standard structure of an R package](https://cran.r-project.org/doc/FAQ/R-exts.html#Package-structure).
## Environment Setup
With R installed:
```r
install.packages("oolong")
```
<!-- ## Hardware Requirements (Optional) -->
<!-- - The hardware requirements may be needed in specific cases when a method is known to require more memory/compute power. -->
<!-- - The method need to be executed on a specific architecture (GPUs, Hadoop cluster etc.) -->
## Input Data
<!-- - The input data has to be a Digital Behavioral Data (DBD) Dataset -->
<!-- - You can provide link to a public DBD dataset. GESIS DBD datasets (https://www.gesis.org/en/institute/digital-behavioral-data) -->
The input data has to be a topic model or prediction model trained on text data. For example, one can train a topic model from the text data (tweets from Donald trump) included in the package by:
```r
library(seededlda)
library(quanteda)
trump_corpus <- corpus(trump2k)
tokens(trump_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE,
split_hyphens = TRUE, remove_url = TRUE) %>% tokens_tolower() %>%
tokens_remove(stopwords("en")) %>% tokens_remove("@*") -> trump_toks
model <- textmodel_lda(x = dfm(trump_toks), k = 8, verbose = TRUE)
```
## Sample Input and Output Data
<!-- - Show how the input data looks like through few sample instances -->
<!-- - Providing a sample output on the sample input to help cross check -->
A sample input is a model trained on text data, e.g.
```{r}
#| message: false
library(oolong)
library(seededlda)
abstracts_seededlda
```
The sample output is an oolong [R6 object](https://r6.r-lib.org/).
## How to Use
Please refer to the [overview of this package](https://gesistsa.github.io/oolong/articles/overview.html) for a comprehensive introduction of all test types.
Suppose there is a topic model trained on some text data called `abstracts_seededlda`, which is included in the package.
```{r}
library(oolong)
abstracts_seededlda
```
Suppose one would like to conduct a word intrusion test (Chang et al. 2009) to validate this topic model. This test can be generated by the `wi()` function.
```{r}
oolong_test <- wi(abstracts_seededlda, userid = "Hadley")
oolong_test
```
One can then conduct the test following the instruction displayed, i.e. `oolong_test$$do_word_intrusion_test()`.
```r
oolong_test$do_word_intrusion_test()
```
One should see a graphic interface like the following and conduct the test.
<img src="man/figures/oolong_demo.gif" align="center" height="400" />
After the test, one can finalize the test by locking the test.
```{r}
#| include: false
### Mock this process
oolong_test$.__enclos_env__$private$test_content$wi$answer <- oolong_test$.__enclos_env__$private$test_content$wi$intruder
oolong_test$.__enclos_env__$private$test_content$wi$answer[1] <- "wronganswer"
```
```{r}
oolong_test$lock()
```
And then obtain the result of the test. For example:
```{r}
oolong_test
```
## Contact Details
Maintainer: Chung-hong Chan <[email protected]>
Issue Tracker: [https://github.com/gesistsa/oolong/issues](https://github.com/gesistsa/oolong/issues)
## Publication
1. Chan, C. H., & Sältzer, M. (2020). oolong: An R package for validating automated content analysis tools. The Journal of Open Source Software: JOSS, 5(55), 2461. https:://doi.org/10.21105/joss.02461
<!-- ## Acknowledgements -->
<!-- - Acknowledgements if any -->
<!-- ## Disclaimer -->
<!-- - Add any disclaimers, legal notices, or usage restrictions for the method, if necessary. -->