grafzahl

The goal of grafzahl (Gracious R Analytical Framework for Zappy Analysis of Human Languages [1]) is to duct tape the quanteda ecosystem to modern Transformer-based text classification models, e.g. BERT, RoBERTa, etc. The model object looks and feels like the textmodel S3 object from the package quanteda.textmodels.

If you don’t know what I am talking about, don’t worry, this package is gracious. You don’t need to know a lot about Transformers to use this package. See the examples below.

Please cite this software as:

Chan, C., (2023). grafzahl: fine-tuning Transformers for text data from within R. Computational Communication Research 5(1): 76-84. https://doi.org/10.5117/CCR2023.1.003.CHAN

Installation: Local environment

Install the CRAN version

install.packages("grafzahl")

After that, you need to setup your conda environment

require(grafzahl)
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)

On remote environments, e.g. Google Colab

On Google Colab, you need to enable non-Conda mode

install.packages("grafzahl")
require(grafzahl)
use_nonconda()

Please refer the vignette.

Usage

Suppose you have a bunch of tweets in the quanteda corpus format. And the corpus has exactly one docvar that denotes the labels you want to predict. The data is from this repository (Theocharis et al., 2020).

unciviltweets
#> Corpus consisting of 19,982 documents and 1 docvar.
#> text1 :
#> "@ @ Karma gave you a second chance yesterday.  Start doing m..."
#> 
#> text2 :
#> "@ With people like you, Steve King there's still hope for we..."
#> 
#> text3 :
#> "@ @ You bill is a joke and will sink the GOP. #WEDESERVEBETT..."
#> 
#> text4 :
#> "@ Dream on. The only thing trump understands is how to enric..."
#> 
#> text5 :
#> "@ @ Just like the Democrat taliban party was up front with t..."
#> 
#> text6 :
#> "@ you are going to have more of the same with HRC, and you a..."
#> 
#> [ reached max_ndoc ... 19,976 more documents ]

In order to train a Transfomer model, please select the model_name from Hugging Face’s list. The table below lists some common choices. In most of the time, providing model_name is sufficient, there is no need to provide model_type.

Suppose you want to train a Transformer model using “bertweet” (Nguyen et al., 2020) because it matches your domain of usage. By default, it will save the model in the output directory of the current directory. You can change it to elsewhere using the output_dir parameter.

model <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
### If you are hardcore quanteda user:
## model <- textmodel_transformer(unciviltweets,
##                                model_type = "bertweet", model_name = "vinai/bertweet-base")

Make prediction

predict(model)

That is it.

Extended examples

Several extended examples are also available.

Examples	file
van Atteveldt et al. (2021)	paper/vanatteveldt.md
Dobbrick et al. (2021)	paper/dobbrick.md
Theocharis et al. (2020)	paper/theocharis.md
OffensEval-TR (2020)	paper/coltekin.md
Amharic News Text classification Dataset (2021)	paper/azime.md

Some common choices of `model_name`

Your data	model_type	model_name
English tweets	bertweet	vinai/bertweet-base
Lightweight	mobilebert	google/mobilebert-uncased
	distilbert	distilbert-base-uncased
Long Text	longformer	allenai/longformer-base-4096
	bigbird	google/bigbird-roberta-base
English (General)	bert	bert-base-uncased
	bert	bert-base-cased
	electra	google/electra-small-discriminator
	roberta	roberta-base
Multilingual	xlm	xlm-mlm-17-1280
	xml	xlm-mlm-100-1280
	bert	bert-base-multilingual-cased
	xlmroberta	xlm-roberta-base
	xlmroberta	xlm-roberta-large

References

Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.
Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.

Yes, I totally made up the meaningless long name. Actually, it is the German name of the Sesame Street character Count von Count, meaning “Count (the noble title) Number”. And it seems to be so that it is compulsory to name absolutely everything related to Transformers after Seasame Street characters.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
R		R
data		data
inst		inst
man		man
paper		paper
rawdata		rawdata
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
_quarto.yml		_quarto.yml
apt.txt		apt.txt
install.R		install.R
methodshub.qmd		methodshub.qmd
postBuild		postBuild
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

grafzahl

Installation: Local environment

On remote environments, e.g. Google Colab

Usage

Extended examples

Some common choices of `model_name`

References

About

Releases 3

Packages

Contributors 2

Languages

License

gesistsa/grafzahl

Folders and files

Latest commit

History

Repository files navigation

grafzahl

Installation: Local environment

On remote environments, e.g. Google Colab

Usage

Extended examples

Some common choices of model_name

References

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Some common choices of `model_name`

Packages