-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path0b - Training sample.Rmd
95 lines (70 loc) · 1.93 KB
/
0b - Training sample.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: "0b. Training sample"
author: '100385774'
date: "2023-05-15"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# 0. Loading the data
Data has been previously compiled into a single .csv document in the 'Preliminar.Rmd' file.
```{r}
library(tidyverse)
library(data.table)
library(ggplot2)
library(stringr)
library(openxlsx)
library(lubridate)
library(emo)
library(stopwords)
```
```{r}
df <- fread("data/tweets.csv")
```
IDs extraction, for replicability purposes:
```{r}
ids <- df %>%
select(id) %>%
as.vector()
write.table(ids, file = "data/tweetsIDS.txt", row.names = F)
```
# 1. Manual annotation
Duplicates removal, Spanish filtering:
```{r}
df <- df %>%
distinct() %>%
filter(lang == "es")
```
## Manual annotations
Filtering search for hate terms + random sampling.
```{r}
set.seed(123)
annotation_1 <- df %>%
filter(str_detect(df$text, "moro")) %>%
sample_n(500, replace = FALSE)
annotation_2 <- df %>%
filter(str_detect(df$text, "negro|mono")) %>%
sample_n(500, replace = FALSE)
annotation_3 <- df %>%
filter(str_detect(df$text, "chino|sudaca|panchito|gitano")) %>%
sample_n(500, replace = FALSE)
annotation_4 <- df %>%
filter(str_detect(df$text, "maricon|maricón|travelo|transexual|travesti")) %>%
sample_n(500, replace = FALSE)
annotation_5 <- df %>%
filter(str_detect(df$text, "a tu pais|a tu país|a su pais|a su país")) %>%
sample_n(500, replace = FALSE)
annotation_6 <- df %>%
filter(str_detect(df$text, "puta|zorra|a la cocina|chupapollas")) %>%
sample_n(500, replace = FALSE)
annotation_random <- df %>%
sample_n(2000, replace = FALSE)
```
```{r}
annotation <- bind_rows(annotation_1, annotation_2, annotation_3, annotation_4, annotation_5, annotation_6, annotation_random) %>%
distinct() %>%
mutate(hate = 0) %>%
select(c(text, hate, id))
write.xlsx(annotation, file = "annotationbis.xlsx")
```