-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmethodshub.qmd
112 lines (72 loc) · 4.01 KB
/
methodshub.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: grafzahl - Supervised Machine Learning for Textual Data Using Transformers and 'Quanteda'
format:
html:
embed-resources: true
gfm: default
---
## Description
<!-- - Provide a brief and clear description of the method, its purpose, and what it aims to achieve. Add a link to a related paper from social science domain and show how your method can be applied to solve that research question. -->
Duct tape the 'quanteda' ecosystem (Benoit et al., 2018) [doi:10.21105/joss.00774](https://doi.org/10.21105/joss.00774) to modern Transformer-based text classification models (Wolf et al., 2020) [doi:10.18653/v1/2020.emnlp-demos.6](https://doi.org/10.18653/v1/2020.emnlp-demos.6), in order to facilitate supervised machine learning for textual data. This package mimics the behaviors of 'quanteda.textmodels' and provides a function to setup the 'Python' environment to use the pretrained models from 'Hugging Face' <https://huggingface.co/>. More information: [doi:10.5117/CCR2023.1.003.CHAN](https://doi.org/10.5117/CCR2023.1.003.CHAN).
## Keywords
<!-- EDITME -->
* Deep Learning
* Supervised machine learning
* Text analysis
## Science Usecase(s)
<!-- - Include usecases from social sciences that would make this method applicable in a certain scenario. -->
<!-- The use cases or research questions mentioned should arise from the latest social science literature cited in the description. -->
<!-- This is an example -->
This package can be used in any typical supervised machine learning usecase involving text data. In the software paper ([Chan et al.](https://doi.org/10.5117/CCR2023.1.003.CHAN)), several cases were presented, e.g. Prediction of incivility based on tweets ([Theocharis et al., 2020](https://doi.org/10.1177/2158244020919447)).
## Repository structure
This repository follows [the standard structure of an R package](https://cran.r-project.org/doc/FAQ/R-exts.html#Package-structure).
## Environment Setup
With R installed:
```r
install.packages("grafzahl")
```
## Hardware Requirements (Optional)
A GPU that supports CUDA is optional.
## Input Data
<!-- - The input data has to be a Digital Behavioral Data (DBD) Dataset -->
<!-- - You can provide link to a public DBD dataset. GESIS DBD datasets (https://www.gesis.org/en/institute/digital-behavioral-data) -->
<!-- This is an example -->
`grafzahl` accepts text data as either character vector or the `corpus` data structure of `quanteda`.
## Sample Input and Output Data
<!-- - Show how the input data looks like through few sample instances -->
<!-- - Providing a sample output on the sample input to help cross check -->
A sample input is a `corpus`. This is an example dataset:
```{r}
#| message: false
library(grafzahl)
library(quanteda)
unciviltweets
```
The output is an S3 object.
## How to Use
Before training, please setup the conda environment.
```r
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)
```
A typical way to train and make predictions.
```r
input <- corpus(ecosent, text_field = "headline")
training_corpus <- corpus_subset(input, !gold)
```
Use the `x` (text data), `y` (label, in this case a [`docvar`](https://quanteda.io/reference/docvars.html)), and `model_name` (Model name, from Hugging Face) parameters to control how the supervised machine learning model is trained.
```r
model2 <- grafzahl(x = training_corpus,
y = "value",
model_name = "GroNLP/bert-base-dutch-cased")
test_corpus <- corpus_subset(input, gold)
predict(model2, test_corpus)
```
## Contact Details
Maintainer: Chung-hong Chan <[email protected]>
Issue Tracker: [https://github.com/gesistsa/grafzahl/issues](https://github.com/gesistsa/grafzahl/issues)
## Publication
1. Chan, C. H. (2023). grafzahl: fine-tuning Transformers for text data from within R. Computational Communication Research, 5(1), 76. <https://doi.org/10.5117/CCR2023.1.003.CHAN>
<!-- ## Acknowledgements -->
<!-- - Acknowledgements if any -->
<!-- ## Disclaimer -->
<!-- - Add any disclaimers, legal notices, or usage restrictions for the method, if necessary. -->