-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathREADME.Rmd
376 lines (289 loc) · 16.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
---
output: github_document
---
```{r global_options, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# LexisNexisTools
[![R-CMD-check](https://github.com/JBGruber/LexisNexisTools/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/JBGruber/LexisNexisTools/actions/workflows/R-CMD-check.yaml)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version-ago/LexisNexisTools)](https://cran.r-project.org/package=LexisNexisTools)
[![CRAN_Download_Badge](https://cranlogs.r-pkg.org/badges/grand-total/LexisNexisTools)](https://cran.r-project.org/package=LexisNexisTools)
[![Codecov test coverage](https://codecov.io/gh/JBGruber/LexisNexisTools/branch/master/graph/badge.svg)](https://app.codecov.io/gh/JBGruber/LexisNexisTools?branch=master)
## Motivation
My PhD supervisor once told me that everyone doing newspaper analysis starts by writing code to read in files from the 'LexisNexis' newspaper archive.
However, while I do recommend this exercise, not everyone has the time.
This package provides functions to read in TXT, RTF, DOC and PDF files downloaded from the old 'LexisNexis' or DOCX from the new Nexis Uni, Lexis Advance and similar services.
The package also comes with a few other features that should be useful while working with data from the popular newspaper archive.
**Did you experience any problems, have questions or an idea about a great new feature?** Then please don't hesitate to file an [**issue report**](https://github.com/JBGruber/LexisNexisTools/issues).
## Installation
Install via:
```{r, eval=FALSE}
install.packages("LexisNexisTools")
```
Or get the development version by installing directly from GitHub (if you do not have `remotes` yet install it via `install.packages("remotes")` first):
```{r, eval=FALSE}
remotes::install_github("JBGruber/LexisNexisTools")
```
## Demo
### Load Package
```{r, message=FALSE}
library("LexisNexisTools")
```
If you do not yet have files from 'LexisNexis' but want to test the package, you can use `lnt_sample()` to copy a sample file with mock data into your current working directory:
```{r, eval=FALSE}
lnt_sample()
```
### Rename Files
'LexisNexis' does not give its files proper names.
The function `lnt_rename()` renames files to a standard format:
For TXT files this format is "searchTerm_startDate-endDate_documentRange.txt" (e.g., "Obama_20091201-20100511_1-500.txt") (for other file types the format is similar but depends on what information is available).
Note, that this will not work if your files lack a cover page with this information.
Currently, it seems, like 'LexisNexis' only delivers those cover pages when you first create a link to your search ("link to this search" on the results page), follow this link, and then download the TXT files from there (see here for a [visual explanation](https://github.com/JBGruber/LexisNexisTools/wiki/Downloading-Files-From-Nexis)).
If you do not want to rename files, you can skip to the next section.
The rest of the package's functionality stays untouched by whether you rename your files or not.
However, in a larger database, you will profit from a consistent naming scheme.
There are three ways in which you can rename the files:
- Run lnt_rename() directly in your working directory without the x argument, which will prompt an option to scan for TXT files in your current working directory:
```{r, eval=FALSE}
report <- lnt_rename()
```
- Provide a folder path (and set `recursive = TRUE` if you want to scan for files recursively):
```{r, eval=FALSE}
report <- lnt_rename(x = getwd(), report = TRUE)
```
- Provide a character object with file names.
Use `list.files()` to search for files in a certain path.
```{r, eval=FALSE}
my_files <- list.files(pattern = ".txt", path = getwd(),
full.names = TRUE, recursive = TRUE, ignore.case = TRUE)
report <- lnt_rename(x = my_files, report = TRUE)
report
```
```{r, echo=FALSE}
library(kableExtra)
temp <- paste0(tempfile(), ".TXT")
silent <- file.copy(
from = system.file("extdata", "sample.TXT", package = "LexisNexisTools"),
to = temp,
overwrite = TRUE
)
report <- lnt_rename(x = temp, simulate = FALSE, report = TRUE, verbose = FALSE)
report$name_orig <- "sample.TXT"
newfile <- report$name_new
report$name_new <- basename(report$name_new)
kable(report, format = "markdown")
```
Using `list.files()` instead of the built-in mechanism allows you to specify a file pattern.
This might be a preferred option if you have a folder in which only some of the TXT files contain newspaper articles from 'LexisNexis' but other files have the ending TXT as well.
If you are unsure what the TXT files in your chosen folder might contain, use the option `simulate = TRUE` (which is the default).
The argument `report = TRUE` indicates that the output of the function in `R` will be a data.frame containing a report on which files have been changed on your drive and how.
### Read in 'LexisNexis' Files to Get Meta, Articles and Paragraphs
The main function of this package is `lnt_read()`.
It converts the raw files into three different `data.frames` nested in a special S4 object of class `LNToutput`.
The three data.frames contain (1.) the metadata of the articles, (2.) the articles themselves, and (3.) the paragraphs.
There are several important keywords that are used to split up the raw articles into article text and metadata.
Those need to be provided in some form but can be left to 'auto' to use 'LexisNexis' defaults in several languages.
All keywords can be regular expressions and need to be in most cases:
- `start_keyword`: The English default is "\\d+ of \\d+ DOCUMENTS$" which stands for, for example, "1 of 112 DOCUMENTS". It is used to split up the text in the TXT files into individual articles. You will not have to change anything here, except you work with documents in languages other than the currently supported.
- `end_keyword`: This keyword is used to remove unnecessary information at the end of an article. Usually, this is "^LANGUAGE:". Where the keyword isn't found, the additional information ends up in the article text.
- `length_keyword`: This keyword, which is usually just "^LENGTH:" (or its equivalent in other languages) finds the information about the length of an article.
However, since this is always the last line of the metadata, it is used to separate metadata and article text.
There seems to be only one type of cases where this information is missing:
if the article consists only of a graphic (which 'LexisNexis' does not retrieve).
The final output from `lnt_read()` has a column named `Graphic`, which indicates if this keyword was missing.
The article text then contains all metadata as well.
In these cases, you should remove the whole article after inspecting it.
(Use `View(LNToutput@articles$Article[LNToutput@meta$Graphic])` to view these articles in a spreadsheet like viewer.)
<p align="center">
<img src="man/figures/LN.png" width="100%" border="1" />
</p>
To use the function, you can again provide either file name(s), folder name(s) or nothing---to search the current working directory for relevant files---as `x` argument:
```{r eval=FALSE}
LNToutput <- lnt_read(x = getwd())
```
```{r echo=FALSE, message=FALSE}
LNToutput <- lnt_read(x = newfile)
LNToutput@meta$Source_File <- basename(LNToutput@meta$Source_File)
```
## Creating LNToutput from 1 file...
## ...files loaded [0.0016 secs]
## ...articles split [0.0089 secs]
## ...lengths extracted [0.0097 secs]
## ...newspapers extracted [0.01 secs]
## ...dates extracted [0.012 secs]
## ...authors extracted [0.013 secs]
## ...sections extracted [0.014 secs]
## ...editions extracted [0.014 secs]
## ...headlines extracted [0.016 secs]
## ...dates converted [0.023 secs]
## ...metadata extracted [0.026 secs]
## ...article texts extracted [0.029 secs]
## ...paragraphs extracted [0.041 secs]
## ...superfluous whitespace removed from articles [0.044 secs]
## ...superfluous whitespace removed from paragraphs [0.046 secs]
## Elapsed time: 0.047 secs
The returned object of class `LNToutput` is intended to be an intermediate container.
As it stores articles and paragraphs in two separate data.frames, nested in an S4 object, the relevant text data is stored twice in almost the same format.
This has the advantage, that there is no need to use special characters, such as "\\n".
However, it makes the files rather big when you save them directly.
The object can, however, be easily converted to regular data.frames using `@` to select the data.frame you want:
```{r, eval=FALSE}
meta_df <- LNToutput@meta
articles_df <- LNToutput@articles
paragraphs_df <- LNToutput@paragraphs
# Print meta to get an idea of the data
head(meta_df, n = 3)
```
```{r, echo=FALSE}
meta_df <- LNToutput@meta
articles_df <- LNToutput@articles
paragraphs_df <- LNToutput@paragraphs
meta_df$Source_File <- basename(meta_df$Source_File)
# Print meta to get an idea of the data
kable(head(meta_df, n = 3), format = "markdown")
```
If you want to keep only one data.frame including metadata and text data you can easily do so:
```{r, message=FALSE}
meta_articles_df <- lnt_convert(LNToutput, to = "data.frame")
# Or keep the paragraphs
meta_paragraphs_df <- lnt_convert(LNToutput, to = "data.frame", what = "Paragraphs")
```
Alternatively, you can convert LNToutput objects to formats common in other packages using the function `lnt_convert`:
```{r eval=FALSE}
rDNA_docs <- lnt_convert(LNToutput, to = "rDNA")
quanteda_corpus <- lnt_convert(LNToutput, to = "quanteda")
tCorpus <- lnt_convert(LNToutput, to = "corpustools")
tidy <- lnt_convert(LNToutput, to = "tidytext")
Corpus <- lnt_convert(LNToutput, to = "tm")
dbloc <- lnt_convert(LNToutput, to = "SQLite")
```
See `?lnt_convert` for details and comment in [this issue](https://github.com/JBGruber/LexisNexisTools/issues/2) if you want a format added to the convert function.
### Identify Highly Similar Articles
In 'LexisNexis' itself, there is an option to group highly similar articles.
However, experience shows that this feature does not always work perfectly.
One common problem when working with 'LexisNexis' data is thus that many articles appear to be delivered twice or more times.
While direct duplicates can be filtered out using, for example, `LNToutput <- LNToutput[!duplicated(LNToutput@articles$Article), ]` this does not work for articles with small differences.
Hence when one comma or white space is different between two articles, they are treated as different.
The function `lnt_similarity()` combines the fast similarity measure from [quanteda](https://github.com/quanteda/quanteda) with the much slower but more accurate relative [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to compare all articles published on the same day.
Calculating the Levenshtein distance might be very slow though if you have many articles published each day in your data set.
If you think the less accurate similarity measure might be sufficient in your case, simply turn this feature off with `rel_dist = FALSE`.
The easiest way to use `lnt_similarity()` is to input an `LNToutput` object directly.
However, it is also possible to provide texts, dates and IDs separately:
```{r, eval=FALSE}
# Either provide a LNToutput
duplicates_df <- lnt_similarity(LNToutput = LNToutput,
threshold = 0.97)
```
```{r, results='hide', message=FALSE}
# Or the important parts separatley
duplicates_df <- lnt_similarity(texts = LNToutput@articles$Article,
dates = LNToutput@meta$Date,
IDs = LNToutput@articles$ID,
threshold = 0.97)
```
## Checking similiarity for 10 articles over 4 dates...
## ...quanteda dfm construced for similarity comparison [0.063 secs].
## ...processing date 2010-01-08: 0 duplicates found [0.064 secs].
## ...processing date 2010-01-09: 0 duplicates found [0.064 secs].
## ...processing date 2010-01-10: 0 duplicates found [0.25 secs].
## ...processing date 2010-01-11: 5 duplicates found [3.05 secs].
## Threshold = 0.97; 4 days processed; 5 duplicates found; in 3.05 secs
Now you can inspect the results using the function `lnt_diff()`:
```{r, eval=FALSE}
lnt_diff(duplicates_df, min = 0, max = Inf)
```
<p align="center">
<img src="man/figures/diff.png" alt="diff" border="1">
</p>
By default, 25 randomly selected articles are displayed one after another, ordered by least to most different within the min and max limits.
After you have chosen a good cut-off value, you can subset the `duplicates_df` data.frame and remove the respective articles:
```{r}
duplicates_df <- duplicates_df[duplicates_df$rel_dist < 0.2]
LNToutput <- LNToutput[!LNToutput@meta$ID %in% duplicates_df$ID_duplicate, ]
```
Note, that you can subset LNToutput objects almost like you would in a regular data.frame using the square brackets.
```{r}
LNToutput[1, ]
```
In this case, writing `[1, ]` delivers an LNToutput object which includes only the first article and the metadata and paragraphs belonging to it.
Now you can extract the remaining articles or convert them to a format you prefer.
```{r, eval=FALSE}
#' generate new dataframes without highly similar duplicates
meta_df <- LNToutput@meta
articles_df <- LNToutput@articles
paragraphs_df <- LNToutput@paragraphs
# Print e.g., meta to see how the data changed
head(meta_df, n = 3)
```
```{r, echo=FALSE}
meta_df <- LNToutput@meta
articles_df <- LNToutput@articles
paragraphs_df <- LNToutput@paragraphs
kable(head(meta_df, n = 3), format = "markdown")
```
### Lookup Keywords
While downloading from 'LexisNexis', you have already used keywords to filter relevant articles from a larger set.
However, while working with the data, your focus might change or you might want to find the different versions of your keyword in the set.
Both can be done using `lnt_lookup`:
```{r}
lnt_lookup(LNToutput, pattern = "statistical computing")
```
The output shows that the keyword pattern was only found in the article with ID 9, all other values are `NULL`, which means the keyword wasn't found.
If your focus shifts and you want to subset your data to only include articles which mention this keyword, you could append this information to the meta information in the LNToutput object and then subset it to articles where the list entry is different from `NULL`.
```{r}
LNToutput@meta$stats <- lnt_lookup(LNToutput, pattern = "statistical computing")
LNToutput <- LNToutput[!sapply(LNToutput@meta$stats, is.null), ]
LNToutput
```
Another use of the function is to find out which versions of your keyword are in the set.
You can do so by using regular expressions.
The following looks for words starting with the 'stat', followed by more characters, up until the end of the word (the pattern internally always starts and ends at a word boundary).
```{r}
lnt_lookup(LNToutput, pattern = "stat.*?")
```
You can use `table()` to count the different versions of patterns:
```{r}
table(unlist(lnt_lookup(LNToutput, pattern = "stat.+?\\b")))
```
## Issues and questions?
These are the common functions in *LexisNexisTools*. Do you feel like anything is missing from the package or is one of the functions not doing its job? File an issue here to let me know: [issue tracker](https://github.com/JBGruber/LexisNexisTools/issues).
```{r make vignette from readme, echo=FALSE, results='hide', message=FALSE}
library("magrittr")
readLines("README.Rmd") %>%
.[grep("^## Demo", .):(grep("^## Issues", .) - 1)] %>%
gsub("man/", "../man/", .) %>%
c(
'---',
'title: "Basic Usage"',
'author: "Johannes B. Gruber"',
'date: "`r Sys.Date()`"',
'output: rmarkdown::html_vignette',
'vignette: >',
' %\\VignetteIndexEntry{Basic Usage}',
' %\\VignetteEngine{knitr::rmarkdown}',
' %\\VignetteEncoding{UTF-8}',
'---',
' ',
'```{r setup, include = FALSE}',
'knitr::opts_chunk$set(',
' collapse = TRUE,',
' comment = "#>"',
')',
'library("kableExtra")',
'```',
.,
'```{r echo = FALSE}',
'unlink("sample.TXT")',
'```'
) %>%
writeLines("./vignettes/demo.Rmd")
rmarkdown::render("./vignettes/demo.Rmd")
unlink(c("sample.TXT",
"./vignettes/SampleFile_20091201-20100511_1-10.txt",
"SampleFile_20091201-20100511_1-10.txt"))
```