20022018-talk.Rmd

---
title: "Quantifying lexical contact in Daghestan"
author: "M. Daniel, I. Chechuro, S. Verhees"
date: "20.02.2018 <br> Linguistic Convergence Lab seminar"
output:
  xaringan::moon_reader:
    css: ["default", "my_theme.css"]
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---

```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(echo = FALSE,
                      message=FALSE,
                      warning= FALSE,
                      fig.width=10.4,
                      fig.height=7)
library(tidyverse)
library(lingtypology)
library(phangorn)
library(knitr)
```

## Outline of the talk

- Comparison of our shortlist and the WOLD questionnaire

--

- Data from the field: Rutul region

  + Similarity sets and distance matrices
  + Neighbournet visualization
  
--

- Dictionary data: neighbournet

--

- Avar area

--

- Further plans

---


## World Loanword Database project: a brief overview

- [WOLD](http://wold.clld.org/meaning) questionnaire - 1814 items 

--

- The aim of the [WOLD](http://wold.clld.org/meaning) project: empirical test of lexical borrowability

-- 

- East Caucasian in [WOLD](http://wold.clld.org/meaning): two languages in central Daghestan: Archi (Lezgic) and Bezhta (Tsezic)

--

- [WOLD](http://wold.clld.org/meaning) includes all parts of speech (including 905 nouns, 334 verbs, 120 adjectives ...)

--

- Empirically confirms the lexical borrowability hierarchy, with the Noun on top

--

- Conclusion: to easily quantify lexical contact, focus on nouns

---

## Shortlist

- We decided to also compile a short list that can be quickly collected in the field from several speakers - WOLD is too long

--

- The short list was supposed to show a high ratio of horyzontally borrowed items (higher than full WOLD), lower ratio of vertical borrowings

--

- Thus: nouns coming from mid-range of the WOLD list (from 30 to 45) - to avoid conservative lexical items and lexical items typical in vertical borrowings; names of locally non-existent objects were avoided; some assumedly locally important nouns were added

--

- Resulting shortlist: some 225 nouns

---


## Shortlist versus WOLD longlist

```{r, echo = F, results = 'asis'}

count <- read_tsv("loancount.csv")
colnames(count) <- c("Archi", "WOLD", "%", "short", "%",
                     "Bezhta", "WOLD", "%", "short", "%")

kable(count[,1:5], format = "html")


```

---

## Shortlist versus WOLD longlist

```{r, echo = F, results = 'asis'}

count <- read_tsv("loancount.csv")
colnames(count) <- c("Archi", "WOLD", "%", "short", "%",
                     "Bezhta", "WOLD", "%", "short", "%")

kable(count[,6:10], format = "html")


```

---

## Shortlist versus WOLD longlist

- The shortlist does not increase the percentage of horyzontal loans; neither does it reduce the percentage of vertical loans - instead, it captures percentages similar to the long list

--

- For Archi, they concur with data on multilingualism (many people speak Avar, some speak Lak)

--

- Relative rise of Georgian borrowings in Bezhta seems to be due to the domain of everyday objects (e.g. 'spoon', 'candle').

---


class: inverse, center, middle

## The rationale for the fieldwork

- Even for minority languages, dictionaries tend to be on the conservative (pure language) side

--

- Collect more naturalistic data on the usage of loanwords from speakers

--

- Collect comparable data from several speakers in a partly controlled experimental setting

--

- Downs: no strict transcription of the data, similarity judgements without etymological investigation

---

## Fieldwork

```{r, fig.height=7}

# general area map

loan <- read_csv("loan_villages.csv")

map.feature(loan$Language,
            features = loan$Language,
            tile = c("Esri.OceanBasemap"),
            color = loan$color,
            latitude = loan$Lat,
            longitude = loan$Lon,
            title = "Language",
            label = loan$EngNames,
            label.hide = F,
            zoom.control = T)

```

---

## Simple counts for Turkic in Rutul

Proof of concept:

- Ikhrek: 
94 loans in at least one speaker, 82 in at least two, 77 in three, 66 in all four
- Kina: 
93 loans in at least one speaker, 81 in at least two, 62 in all three
- Kiche: 
81 loans in at least one speaker, 61 in both

Results seem to be pretty stable
---
## Simple counts for Turkic in Rutul

- Cross-village comparison of loan inventory stability.

How many loans only occur in one village? 
- Ikhrek: 
5 loans occur in at least one speaker; 2 in two, 2 in 3, 2 in all 4
- Kina: 
3 loans occur in in at least one speaker; 2 in 2; 2 in 3
- Kiche: 
6 loans occur in at least one speaker; 2 in both

69 common loans attested in at least one speaker out of each village
54 in at least two speakers out of each village
---
## Simple counts for Turkic in Rutul

- Cross-section of inventory sets between the villages

Common in: Ikhrek^Kina 30; Kina^Kiche 20; Ikhrek^Kiche 10

Present in Ikhrek+ but absent in Kina- 56; Kina+Ikhrek- 34
Ikhrek+Kiche- 87; Kiche+Ikhrek-99
Kina+Kiche- 47; Kiche+Kina- 87

---
## Similarity sets

One meaning may correspond to one or more lexemes in a language. Each distinct lexeme forms a set - similar lexemes belong to the same set.

**SV style**

```{r}
sim <- read_tsv("similarity2.csv")
kable(sim, format = "html")
```

---

## Similarity sets

One meaning may correspond to one or more lexemes in a language. Each distinct lexeme forms a set - similar lexemes belong to the same set.

**MD style**

```{r}
sim <- read_tsv("similarity.csv")
kable(sim, format = "html")
```

---

## From similarity sets to distance matrices

- Convert shared loans into similarity models (neighbornet): Number of times two languages share a similarity set is converted into a distance matrix

- Simplistic method: Count similarities between lects by counting how many similarity sets two lects share.

- Issue: similarities by cognate are counted (to be excluded?). 

- Issue: NOT having the same loan is counted as similarity, even if different items are used (Are 0's indeed counted?)

- Complification: A similarity sets is only counted if it goes beyond a set of closely related languages.
Lines where similarity sets are contained in e.g. Turkic only or Rutul only are excluded from the counts. (MD-style below)

- Complification: One meaning is counted twice. Two languages are deemed similar within a meaning if the share at least one similarity set for this meaning.

---

## Similarity to distance matrix

- The number of times two languages share a similarity set is converted into a distance matrix 

- Distance - number of shared sets divided by the number of meanings (minus the meanings for which one of the lects had not data)

--

- Distance matrix forms the input for a neighbournet visualization

--

```{r}

dist <- read.csv("rutul_dist.csv", sep = ";")
kable(dist[1:4,1:3], format = "html")

```

---

class: inverse, center, middle

## Rutul area

---

## Rutul area (SV style)

```{r}

Rutul_SV <- read.csv("Rutul_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_SV <- as.dist(Rutul_SV)
nnet <- neighborNet(Rutul_SV)
plot(nnet, "2D")

```


---


## Rutul area (MD style)

```{r}

Rutul_MD <- read.csv("Rutul_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_MD <- as.dist(Rutul_MD)
nnet <- neighborNet(Rutul_MD)
plot(nnet, "2D")

```


---

class: inverse, center, middle

## Without "stop words"
If a meaning has a different lexeme in each language, remove it.
The idea is to measure distance based on loans only.

---

## Rutul area (SV style)

```{r}

Rutul_No_Stopwords_SV <- read.csv("Rutul_No_Stopwords_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_No_Stopwords_SV <- as.dist(Rutul_No_Stopwords_SV)
nnet <- neighborNet(Rutul_No_Stopwords_SV)
plot(nnet, "2D")

```

---


## Rutul area (MD style)

```{r}

Rutul_No_Stopwords_MD <- read.csv("Rutul_No_Stopwords_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_No_Stopwords_MD <- as.dist(Rutul_No_Stopwords_MD)
nnet <- neighborNet(Rutul_No_Stopwords_MD)
plot(nnet, "2D")

```

---


class: inverse, center, middle

## Distance between major languages
Based on a larger subset of the WOLD questionnaire (~600 items).
Data gathered from dictionaries.
Visualizations based on the same variables as the Rutul data.

---

## Major languages (SV style)

```{r}

Big_LGS_SV <- read.csv("Big_LGS_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_SV <- as.dist(Big_LGS_SV)
nnet <- neighborNet(Big_LGS_SV)
plot(nnet, "2D")

```


---


## Major languages (MD style)

```{r}

Big_LGS_MD <- read.csv("Big_LGS_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_MD <- as.dist(Big_LGS_MD)
nnet <- neighborNet(Big_LGS_MD)
plot(nnet, "2D")

```

---

class: inverse, center, middle

## Without "stop words"
If a meaning has a different lexeme in each language, remove it.
The idea is to measure distance based on loans only.

---

## Major languages (SV style)

```{r}

Big_LGS_No_Stopwords_SV <- read.csv("Big_LGS_No_Stopwords_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_SV <- as.dist(Big_LGS_No_Stopwords_SV)
nnet <- neighborNet(Big_LGS_No_Stopwords_SV)
plot(nnet, "2D")

```

---


## Major languages (MD style)

```{r}

Big_LGS_No_Stopwords_MD <- read.csv("Big_LGS_No_Stopwords_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_MD <- as.dist(Big_LGS_No_Stopwords_MD)
nnet <- neighborNet(Big_LGS_No_Stopwords_MD)
plot(nnet, "2D")

```

---

class: inverse, center, middle

##  Avar area

---

## Avar area (MD style)

```{r}

avar <- read.csv("avar_dist.csv", header = TRUE, sep = ";", na.strings = "NA")
avar <- as.dist(avar)
nnet <- neighborNet(avar)
plot(nnet, "2D")

```
Only one Avar word common to all: *ʁalbac'* 'lion'.

---

## Further plans

- Converge the Rutul and the Andi branches of the project
--
- Focus on one type of data analysis / visualization
--
- Learn more about neighbornets
---

## Thank you!

We used

[**lingtypology**](https://ropensci.github.io/lingtypology/), [**phangorn**](https://cran.r-project.org/web/packages/phangorn/index.html) and [**xaringan**](https://github.com/yihui/xaringan) for this presentation.