Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in makeTopMatrix(prevalence, data) #290

Open
violetmv opened this issue May 16, 2024 · 0 comments
Open

Error in makeTopMatrix(prevalence, data) #290

violetmv opened this issue May 16, 2024 · 0 comments

Comments

@violetmv
Copy link

violetmv commented May 16, 2024

I am receiving the following error message:

Error in makeTopMatrix(prevalence, data) : Error creating model matrix.
This could be caused by many things including
explicit calls to a namespace within the formula.
Try a simpler formula.

after running the following block of code:

topic_model_subject <- stm(documents = corpus_stm$documents,
vocab = corpus_stm$vocab,
K = 25,
max.em.its = 100,
data = corpus_stm$meta,
init.type = "Spectral")

I ran the same code last week (May 9, 2024) and this error did not appear. When removing the "prevalence" parameter, the stm function runs as designed. However, the 'Subject' term is present in the metadata of the corpus. This leads me to believe the issue could be due to an update in the package, especially as the error message refers to the potential issue with the namespace. Any help with this issue would be much appreciated!

Here are subsets of the data:
WoS_text:
WoS_text.zip
Scopus_texts:
Scopus_texts.zip

Here is the total relevant code:

#install packages --------------------------------------------------------

#Installs
install.packages("rJava")
install.packages("tabula")
install.packages("tidyverse")
install.packages("tabulizer")
install.packages("tabulizerjars")
install.packages("qdapRegex")
install.packages("quanteda.textstats")
install.packages("tidytext")
install.packages("dplyr")
install.packages("tm")
install.packages("stm")
install.packages("furrr")
install.packages("reshape2")
install.packages("ggthemes")
install.packages("kableExtra")
install.packages("nord")
install.packages("colorr")
install.packages("readtext")

library(pdftools)
library(tabula)
library(rJava)
library(tidyverse)
library(qdapRegex)
library(stringr)
library(quanteda)
library(quanteda.textstats)
library(tm)
library(stm)
library(furrr)
library(tidytext)
library(reshape2)
library(ggthemes)
library(kableExtra)
library(nord)
library(filesstrings)
library(xfun)
library(colorr)
library(readtext)

#read in data ------------------------------------------------------------

setwd("C:/Users/viole/OneDrive/Desktop/CCAL/stm")

#WOS Read texts
wos_texts <- readtext("WoS_text",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Author", "Year", "Journal", "Topic", "Methods", "Scale", "Subject", "Database"))

#split into paragraphs
wos_paragraphs <- lapply(wos_texts$text, function(txt) unlist(strsplit(txt, "\n\n")))

#Convert paragraphs to data frames
wos_paragraphs_df <- data.frame(text_field = unlist(wos_paragraphs), stringsAsFactors = FALSE)

#SCOPUS Read texts
scopus_texts <- readtext("Scopus_texts",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Author", "Year", "Journal", "Topic", "Methods", "Scale", "Subject", "Database"))

#split into paragraphs
scopus_paragraphs <- lapply(scopus_texts$text, function(txt) unlist(strsplit(txt, "\n\n")))

#Convert paragraphs to data frames
scopus_paragraphs_df <- data.frame(text_field = unlist(scopus_paragraphs), stringsAsFactors = FALSE)

#create corpus -----------------------------------------------------------

paragraphs_df <- rbind(scopus_paragraphs_df, wos_paragraphs_df)

#Create a corpus from the data frame
corpus <- corpus(paragraphs_df$text_field)

#Remove numbers, punct, symbols, separators
corpus_tokens <- tokens(corpus, what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_url = TRUE)

#Remove stopwords
corpus_tokens <- tokens_select(corpus_tokens, stopwords('english'),
selection = 'remove')

#Remove custom tokens (e.g. "et", "al")
corpus_tokens <- tokens_select(corpus_tokens,
c("et", "al", "fig"),
selection = "remove",
case_insensitive = TRUE)

#Remove words shorter than 2 characters
corpus_tokens <- tokens_remove(corpus_tokens, min_nchar = 2)

#Compound bigrams and trigrams
corpus_tokens <- tokens_compound(corpus_tokens,
phrase(c("adaptive capacity",
"affordable housing",
"air conditioning",
"air temperature",
"building code*",
"climat* chang*",
"developing countr*",
"developed countr*",
"extreme heat",
"extreme weather",
"flood insurance",
"flood risk",
"food security",
"global warming",
"gray infrastructure",
"grey infrastructure",
"green infrastructure",
"green roof", "heat island",
"heat pump", "heat stress",
"heat wave", "home owner*",
"hous* price*", "human rights",
"hurricane isabel", "hurricane katrina",
"hurricane sandy", "informal housing",
"informal settlement*",
"local government*",
"managed retreat", "mental health",
"natural disaster*", "passive hous*",
"planning polic*", "property owner*",
"real estate", "risk management",
"risk reduction", "sea level",
"social capital",
"social housing",
"social vulnerabilit*",
"socio economic", "solar radiation",
"spatial planning", "storm surge",
"stormwater management",
"thermal comfort", "tree cover*",
"urban development", "urban planning",
"urban poor", "urban resilience",
"urban sprawl",
"disaster risk management",
"disaster risk reduction",
"flood risk management",
"flood risk reduction",
"urban heat island", "sea level rise",
"extreme weather event*")))

#Stem words
corpus_tokens <- tokens_wordstem(corpus_tokens,
language = quanteda_options("language_stemmer"))

#Tokens to lowercase
corpus_tokens <- tokens_tolower(corpus_tokens)

#Convert to dfm (with stopwords removed) DOCUMENT FEATURE MATRIX
corpus_dfm <- dfm(corpus_tokens)

#Trim
corpus_dfm <- dfm_trim(corpus_dfm, min_termfreq = 3)

#Convert to STM
corpus_stm <- convert(x = corpus_dfm, to = "stm", docvars = docvars(corpus), omit_empty = FALSE)

#inspect topics in stm ---------------------------------------------------

#Inspect topics in stm
topics_content <- labelTopics(topic_model25, c(1:60))

#run stm with covariates "Subject" (takes around 90 iterations)

topic_model_subject <- stm(documents = corpus_stm$documents,
vocab = corpus_stm$vocab,
K = 25,
max.em.its = 100,
data = corpus_stm$meta,
init.type = "Spectral")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant