0b - Training sample.Rmd

---
title: "0b. Training sample"
author: '100385774'
date: "2023-05-15"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# 0. Loading the data

Data has been previously compiled into a single .csv document in the 'Preliminar.Rmd' file. 

```{r}
library(tidyverse)
library(data.table)
library(ggplot2)
library(stringr)
library(openxlsx)
library(lubridate)
library(emo)
library(stopwords)
```


```{r}
df <- fread("data/tweets.csv")
```
IDs extraction, for replicability purposes: 
```{r}
ids <- df %>% 
  select(id) %>% 
  as.vector()

write.table(ids, file = "data/tweetsIDS.txt", row.names = F)
```


# 1. Manual annotation

Duplicates removal, Spanish filtering: 

```{r}
df <- df %>% 
  distinct() %>% 
  filter(lang == "es") 
```

## Manual annotations

Filtering search for hate terms + random sampling.   

```{r}
set.seed(123)

annotation_1 <- df %>% 
  filter(str_detect(df$text, "moro")) %>% 
   sample_n(500, replace = FALSE)

annotation_2 <- df %>% 
  filter(str_detect(df$text, "negro|mono")) %>% 
   sample_n(500, replace = FALSE)

annotation_3 <- df %>% 
  filter(str_detect(df$text, "chino|sudaca|panchito|gitano")) %>% 
   sample_n(500, replace = FALSE)

annotation_4 <- df %>% 
  filter(str_detect(df$text, "maricon|maricón|travelo|transexual|travesti")) %>% 
   sample_n(500, replace = FALSE)

annotation_5 <- df %>% 
  filter(str_detect(df$text, "a tu pais|a tu país|a su pais|a su país")) %>% 
   sample_n(500, replace = FALSE)

annotation_6 <- df %>% 
  filter(str_detect(df$text, "puta|zorra|a la cocina|chupapollas")) %>% 
   sample_n(500, replace = FALSE)

annotation_random <- df %>% 
             sample_n(2000, replace = FALSE)
```

```{r}

annotation <- bind_rows(annotation_1, annotation_2, annotation_3, annotation_4, annotation_5, annotation_6, annotation_random) %>% 
  distinct() %>% 
  mutate(hate = 0) %>% 
  select(c(text, hate, id))

write.xlsx(annotation, file = "annotationbis.xlsx")

```