-
Notifications
You must be signed in to change notification settings - Fork 0
/
20022018-talk.Rmd
468 lines (279 loc) · 10.1 KB
/
20022018-talk.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
title: "Quantifying lexical contact in Daghestan"
author: "M. Daniel, I. Chechuro, S. Verhees"
date: "20.02.2018 <br> Linguistic Convergence Lab seminar"
output:
xaringan::moon_reader:
css: ["default", "my_theme.css"]
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(echo = FALSE,
message=FALSE,
warning= FALSE,
fig.width=10.4,
fig.height=7)
library(tidyverse)
library(lingtypology)
library(phangorn)
library(knitr)
```
## Outline of the talk
- Comparison of our shortlist and the WOLD questionnaire
--
- Data from the field: Rutul region
+ Similarity sets and distance matrices
+ Neighbournet visualization
--
- Dictionary data: neighbournet
--
- Avar area
--
- Further plans
---
## World Loanword Database project: a brief overview
- [WOLD](http://wold.clld.org/meaning) questionnaire - 1814 items
--
- The aim of the [WOLD](http://wold.clld.org/meaning) project: empirical test of lexical borrowability
--
- East Caucasian in [WOLD](http://wold.clld.org/meaning): two languages in central Daghestan: Archi (Lezgic) and Bezhta (Tsezic)
--
- [WOLD](http://wold.clld.org/meaning) includes all parts of speech (including 905 nouns, 334 verbs, 120 adjectives ...)
--
- Empirically confirms the lexical borrowability hierarchy, with the Noun on top
--
- Conclusion: to easily quantify lexical contact, focus on nouns
---
## Shortlist
- We decided to also compile a short list that can be quickly collected in the field from several speakers - WOLD is too long
--
- The short list was supposed to show a high ratio of horyzontally borrowed items (higher than full WOLD), lower ratio of vertical borrowings
--
- Thus: nouns coming from mid-range of the WOLD list (from 30 to 45) - to avoid conservative lexical items and lexical items typical in vertical borrowings; names of locally non-existent objects were avoided; some assumedly locally important nouns were added
--
- Resulting shortlist: some 225 nouns
---
## Shortlist versus WOLD longlist
```{r, echo = F, results = 'asis'}
count <- read_tsv("loancount.csv")
colnames(count) <- c("Archi", "WOLD", "%", "short", "%",
"Bezhta", "WOLD", "%", "short", "%")
kable(count[,1:5], format = "html")
```
---
## Shortlist versus WOLD longlist
```{r, echo = F, results = 'asis'}
count <- read_tsv("loancount.csv")
colnames(count) <- c("Archi", "WOLD", "%", "short", "%",
"Bezhta", "WOLD", "%", "short", "%")
kable(count[,6:10], format = "html")
```
---
## Shortlist versus WOLD longlist
- The shortlist does not increase the percentage of horyzontal loans; neither does it reduce the percentage of vertical loans - instead, it captures percentages similar to the long list
--
- For Archi, they concur with data on multilingualism (many people speak Avar, some speak Lak)
--
- Relative rise of Georgian borrowings in Bezhta seems to be due to the domain of everyday objects (e.g. 'spoon', 'candle').
---
class: inverse, center, middle
## The rationale for the fieldwork
- Even for minority languages, dictionaries tend to be on the conservative (pure language) side
--
- Collect more naturalistic data on the usage of loanwords from speakers
--
- Collect comparable data from several speakers in a partly controlled experimental setting
--
- Downs: no strict transcription of the data, similarity judgements without etymological investigation
---
## Fieldwork
```{r, fig.height=7}
# general area map
loan <- read_csv("loan_villages.csv")
map.feature(loan$Language,
features = loan$Language,
tile = c("Esri.OceanBasemap"),
color = loan$color,
latitude = loan$Lat,
longitude = loan$Lon,
title = "Language",
label = loan$EngNames,
label.hide = F,
zoom.control = T)
```
---
## Simple counts for Turkic in Rutul
Proof of concept:
- Ikhrek:
94 loans in at least one speaker, 82 in at least two, 77 in three, 66 in all four
- Kina:
93 loans in at least one speaker, 81 in at least two, 62 in all three
- Kiche:
81 loans in at least one speaker, 61 in both
Results seem to be pretty stable
---
## Simple counts for Turkic in Rutul
- Cross-village comparison of loan inventory stability.
How many loans only occur in one village?
- Ikhrek:
5 loans occur in at least one speaker; 2 in two, 2 in 3, 2 in all 4
- Kina:
3 loans occur in in at least one speaker; 2 in 2; 2 in 3
- Kiche:
6 loans occur in at least one speaker; 2 in both
69 common loans attested in at least one speaker out of each village
54 in at least two speakers out of each village
---
## Simple counts for Turkic in Rutul
- Cross-section of inventory sets between the villages
Common in: Ikhrek^Kina 30; Kina^Kiche 20; Ikhrek^Kiche 10
Present in Ikhrek+ but absent in Kina- 56; Kina+Ikhrek- 34
Ikhrek+Kiche- 87; Kiche+Ikhrek-99
Kina+Kiche- 47; Kiche+Kina- 87
---
## Similarity sets
One meaning may correspond to one or more lexemes in a language. Each distinct lexeme forms a set - similar lexemes belong to the same set.
**SV style**
```{r}
sim <- read_tsv("similarity2.csv")
kable(sim, format = "html")
```
---
## Similarity sets
One meaning may correspond to one or more lexemes in a language. Each distinct lexeme forms a set - similar lexemes belong to the same set.
**MD style**
```{r}
sim <- read_tsv("similarity.csv")
kable(sim, format = "html")
```
---
## From similarity sets to distance matrices
- Convert shared loans into similarity models (neighbornet): Number of times two languages share a similarity set is converted into a distance matrix
- Simplistic method: Count similarities between lects by counting how many similarity sets two lects share.
- Issue: similarities by cognate are counted (to be excluded?).
- Issue: NOT having the same loan is counted as similarity, even if different items are used (Are 0's indeed counted?)
- Complification: A similarity sets is only counted if it goes beyond a set of closely related languages.
Lines where similarity sets are contained in e.g. Turkic only or Rutul only are excluded from the counts. (MD-style below)
- Complification: One meaning is counted twice. Two languages are deemed similar within a meaning if the share at least one similarity set for this meaning.
---
## Similarity to distance matrix
- The number of times two languages share a similarity set is converted into a distance matrix
- Distance - number of shared sets divided by the number of meanings (minus the meanings for which one of the lects had not data)
--
- Distance matrix forms the input for a neighbournet visualization
--
```{r}
dist <- read.csv("rutul_dist.csv", sep = ";")
kable(dist[1:4,1:3], format = "html")
```
---
class: inverse, center, middle
## Rutul area
---
## Rutul area (SV style)
```{r}
Rutul_SV <- read.csv("Rutul_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_SV <- as.dist(Rutul_SV)
nnet <- neighborNet(Rutul_SV)
plot(nnet, "2D")
```
---
## Rutul area (MD style)
```{r}
Rutul_MD <- read.csv("Rutul_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_MD <- as.dist(Rutul_MD)
nnet <- neighborNet(Rutul_MD)
plot(nnet, "2D")
```
---
class: inverse, center, middle
## Without "stop words"
If a meaning has a different lexeme in each language, remove it.
The idea is to measure distance based on loans only.
---
## Rutul area (SV style)
```{r}
Rutul_No_Stopwords_SV <- read.csv("Rutul_No_Stopwords_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_No_Stopwords_SV <- as.dist(Rutul_No_Stopwords_SV)
nnet <- neighborNet(Rutul_No_Stopwords_SV)
plot(nnet, "2D")
```
---
## Rutul area (MD style)
```{r}
Rutul_No_Stopwords_MD <- read.csv("Rutul_No_Stopwords_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Rutul_No_Stopwords_MD <- as.dist(Rutul_No_Stopwords_MD)
nnet <- neighborNet(Rutul_No_Stopwords_MD)
plot(nnet, "2D")
```
---
class: inverse, center, middle
## Distance between major languages
Based on a larger subset of the WOLD questionnaire (~600 items).
Data gathered from dictionaries.
Visualizations based on the same variables as the Rutul data.
---
## Major languages (SV style)
```{r}
Big_LGS_SV <- read.csv("Big_LGS_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_SV <- as.dist(Big_LGS_SV)
nnet <- neighborNet(Big_LGS_SV)
plot(nnet, "2D")
```
---
## Major languages (MD style)
```{r}
Big_LGS_MD <- read.csv("Big_LGS_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_MD <- as.dist(Big_LGS_MD)
nnet <- neighborNet(Big_LGS_MD)
plot(nnet, "2D")
```
---
class: inverse, center, middle
## Without "stop words"
If a meaning has a different lexeme in each language, remove it.
The idea is to measure distance based on loans only.
---
## Major languages (SV style)
```{r}
Big_LGS_No_Stopwords_SV <- read.csv("Big_LGS_No_Stopwords_SV.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_SV <- as.dist(Big_LGS_No_Stopwords_SV)
nnet <- neighborNet(Big_LGS_No_Stopwords_SV)
plot(nnet, "2D")
```
---
## Major languages (MD style)
```{r}
Big_LGS_No_Stopwords_MD <- read.csv("Big_LGS_No_Stopwords_MD.csv", header = TRUE, sep = ";", na.strings = "NA")
Big_LGS_MD <- as.dist(Big_LGS_No_Stopwords_MD)
nnet <- neighborNet(Big_LGS_No_Stopwords_MD)
plot(nnet, "2D")
```
---
class: inverse, center, middle
## Avar area
---
## Avar area (MD style)
```{r}
avar <- read.csv("avar_dist.csv", header = TRUE, sep = ";", na.strings = "NA")
avar <- as.dist(avar)
nnet <- neighborNet(avar)
plot(nnet, "2D")
```
Only one Avar word common to all: *ʁalbac'* 'lion'.
---
## Further plans
- Converge the Rutul and the Andi branches of the project
--
- Focus on one type of data analysis / visualization
--
- Learn more about neighbornets
---
## Thank you!
We used
[**lingtypology**](https://ropensci.github.io/lingtypology/), [**phangorn**](https://cran.r-project.org/web/packages/phangorn/index.html) and [**xaringan**](https://github.com/yihui/xaringan) for this presentation.