-
Notifications
You must be signed in to change notification settings - Fork 0
/
MSA-Project.Rmd
550 lines (423 loc) · 30.5 KB
/
MSA-Project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
---
title: "PCA on Finnish Municipalities"
author: "Niko Miller"
date: "`r format(Sys.time(), '%d.%m.%Y')`"
output:
bookdown::pdf_document2:
keep_tex: false
number_sections: true
fig_caption: true
bibliography: references.bib
header-includes:
- \pagenumbering{gobble}
- \usepackage{amsmath}
- \usepackage{amssymb}
- \usepackage{amsfonts}
- \usepackage{mathtools}
- \usepackage{graphicx}
- \usepackage{placeins}
- \usepackage{float}
- \floatplacement{figure}{H}
- \usepackage{framed}
- \usepackage[font={small}]{caption}
- \usepackage[font={small,it}]{subcaption}
- \usepackage{hyperref}
- \usepackage[labelsep=period]{caption}
- \usepackage[labelfont=bf]{caption}
- \captionsetup{justification=raggedright,singlelinecheck=false}
subtitle: "Dimensionality reduction on demographic data of Finnish municipalities with PCA"
geometry: top=0.7in, bottom=0.5in, left=0.5in, right=0.7in
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::knit_hooks$set(optipng = knitr::hook_optipng)
options(ggrepel.max.overlaps = Inf)
library(tidyverse)
library(knitr)
library(kableExtra)
library(summarytools)
library(pxweb)
library(ggplot2)
library(reshape2)
library(ggcorrplot)
library(ggrepel)
library(ggfortify)
# muni.latest <- read.csv("muni.csv")
# muni.latest <- muni.latest[, -1] %>% as_tibble()
muni.latest <- read.csv("muni_2021.csv", sep = ";" , header = F)
colnames(muni.latest) <- c('Area',
'Deg.Urbanisation',
'Popul',
'Popul.Growth',
'Prop.Under15',
'Prop.15to64',
'Prop.Over64',
'Prop.Swedish',
'Prop.Foreign',
'Excess.Births',
'Migr.Gain',
'Families',
'Households',
'Prop.Households.Terr.Det',
'Prop.Households.Rental',
'Prop.Educ.Degree2',
'Prop.Educ.Degree3',
'Labour.Force',
'Empl.Rate',
'Prop.Empl.Muni',
'Prop.Unempl',
'Prop.Pensioners',
'Depend.Ratio',
'Jobs.Muni',
'Prop.Primary.Sector',
'Prop.Secondary.Sector',
'Prop.Services.Sector',
'Jobs.Self.Suff',
'Contr.Margin',
'Loan.Stock',
'Group.Loan.Stock',
'Educ.Cult.Activity',
'Soc.Health.Activity')
muni.latest <- muni.latest %>% filter(!grepl('koko maa|maakunta|seutukunta|mariehamn', Area, ignore.case = T))
muni.latest <- muni.latest %>%
arrange(desc(Popul))
```
\newpage
\pagenumbering{arabic}
# Introduction
## Data Collection and Cleaning
Tilastokeskus provides data on 32 demographic variables Finnish municipalities. The data can be freely obtained from Tilastokeskus' website in csv. format [@data].
The variables provide information on e.g., the municipalities' population and its growth, age structure, economic sector split, and housing type split.
We concentrate on municipalities only, so we filtered out the country (koko maa), counties (maakunta) and subregions (seutukunta) from the data. Moreover, we removed Mariehamn Stad as it contains the same information as Maarianhamina. The final data contains 314 municipalities or equivalently, observations, as we have a cross-sectional data of statistics for one year.
## Research Question and Motivation
The purpose of this study is to reduce dimensionality in the data and assess whether differences between the municipalities can be explained by considerably less than 32 dimensions - namely by the first few principal components. This study's contribution can be seen as exploratory data analysis where the municipalities are clustered in a lower dimension. The results can be applied in a virtue of further work. Examples are more advanced clustering, predictive modelling or gathering contextual knowledge for economic decision making.
Clustering municipalities with demographic factors can be useful in a business context as well, e.g., marketing. Knowing how municipalities differ in demographics can help a company to improve its targeting. For instance, they could launch different marketing campaigns or product lines in different municipalities.
# Univariate Analysis
## Variable Descriptions
Table \@ref(tab:vardesc) describes all the variables used in this study. More detailed descriptions are available at Tilastokeskus' website [@data]. We can see from the Table that most of the variables are measured in 2020 and represent a proportion of a municipality's population. Some variables, e.g., Population, are newer data (2021) while a some data points are from 2019 (e.g., split of economic sectors). Overall, we have a quite wide demographic spectrum as the variables cover many aspects of the municipalities: population and migration trends, education level and housing, employment, ethnicity, economy, and so on.
```{r vardesc, echo=FALSE}
vars.orig <- c('Degree of urbanisation',
'Population',
'Population change from the previous year',
'Share of persons aged under 15 of the population',
'Share of persons aged 15 to 64 of the population',
'Share of persons aged over 64 of the population',
'Share of Swedish-speakers of the population',
'Share of foreign citizens of the population',
'Excess of births',
'Intermunicipial migration gain/loss',
'Number of families',
'Number of household-dwelling units',
'Share of household-dwelling units living in terraced houses and detached houses',
'Share of household-dwelling units living in rental dwellings',
'Share of persons aged 15 or over with at least upper secondary qualifications',
'Share of persons aged 15 or over with tertiary level qualifications',
'Employed labour force resident in the area',
'Employment rate',
'Share of persons working in their municipality of residence',
'Proportion of unemployed among the labour force',
'Proportion of pensioners of the population',
'Economic dependency ratio',
'Number of workplaces in the area',
'Share of workplaces in primary production',
'Share of workplaces in secondary production',
'Share of workplaces in services',
'Workplace self-sufficiency',
'Annual contribution margin',
'Loan stock',
'Group loan stock',
'Educational and cultural activities',
'Social and health care activities')
vars <- c('Deg.Urbanisation',
'Popul',
'Popul.Growth',
'Prop.Under15',
'Prop.15to64',
'Prop.Over64',
'Prop.Swedish',
'Prop.Foreign',
'Excess.Births',
'Migr.Gain',
'Families',
'Households',
'Prop.Households.Terr.Det',
'Prop.Households.Rental',
'Prop.Educ.Degree2',
'Prop.Educ.Degree3',
'Labour.Force',
'Empl.Rate',
'Prop.Empl.Muni',
'Prop.Unempl',
'Prop.Pensioners',
'Depend.Ratio',
'Jobs.Muni',
'Prop.Primary.Sector',
'Prop.Secondary.Sector',
'Prop.Services.Sector',
'Jobs.Self.Suff',
'Contr.Margin',
'Loan.Stock',
'Group.Loan.Stock',
'Educ.Cult.Activity',
'Soc.Health.Activity')
years <- c(2020,
2021,
2021,
2021,
2021,
2021,
2021,
2021,
2020,
2020,
2020,
2020,
2020,
2020,
2020,
2020,
2020,
2020,
2019,
2020,
2020,
2020,
2019,
2019,
2019,
2019,
2019,
2020,
2020,
2020,
2020,
2020)
units <- c("%",
"Number",
"%",
"%",
"%",
"%",
"%",
"%",
"Persons",
"Persons",
"Number",
"Number",
"%",
"%",
"%",
"%",
"Number",
"%",
"%",
"%",
"%",
"Ratio",
"Number",
"%",
"%",
"%",
"Ratio",
"EUR per capita",
"EUR per capita",
"EUR per capita",
"EUR per capita",
"EUR per capita")
var.df <- data.frame(vars, vars.orig, years, units)
kable(var.df, format = "latex", align = "lll", col.names = c("Variable", "Original Variable","Year","Unit"), caption = "Description of all used variables. Variable is the name of the variable that is used in this report. Original Variable is the original variable given in Tilastokeskus' dataset, Year is the timestamp of the data point, and Unit is the unit in which the variable is measured in.") %>%
row_spec(0, bold = T) %>%
kable_styling(latex_options = c("HOLD_position", "scale_down"), position = "left")
```
## Descriptive Statistics
Table \@ref(tab:descstats) shows descriptive statistics for all variables in the data. We can see that many of the variables have quite large variance. This is somewhat expected as the data is a cross-section of 314 Finnish municipalities, so we have very different types of municipalities in the data.
The heterogeneity of the data can be seen from e.g., the population variable, where we have a minimum of 105 inhabitants (Sottunga) and a maximum of 658457 inhabitants (Helsinki). The median population is 6134 and the 3rd quartile is 15145, so it is clear that Finland is a country with mostly small municipalities but a few very large ones.
An interesting observation is that in some municipalities, only 20.1% work locally inside their own municipality (Kauniainen), while in others the share is as large as 91.7% (Kuusamo). Moreover, in some municipalities, over a fifth is unemployed (21.6% in Ilomantsi) while in other places, the unemployment rate is as low as 3.7% (Föglö).
It is also evident that Finland is a country where in some municipalities, people mostly speak Swedish (92.4% in Sottunga) while in many other, not even a permille speak Swedish (from relatively big municipalities, e.g., Iisalmi). Furthermore, in Sottunga, over a forth (25.7%) are foreign citizens while in Merijärvi, foreign citizens only comprise 0.3% of the population. There are most likely many Swedish people living in Sottunga since it is located in Åland.
```{r descstats, echo=FALSE}
stats <- muni.latest[, -1] %>% apply(2, summary) %>% t() %>% as.data.frame()
stats$Std <- muni.latest[, -1] %>% apply(2, sd)
stats.df <- data.frame(vars, stats)
rownames(stats.df) <- NULL
kable(stats.df, format = "latex", align = "lrrrrrr", caption = "Descriptive Statistics of the variables. Variable is the name of the variable, Min is the minimum, 1st Qu. is the 1st quartile, Median is the median, Mean is the arithmetic mean, 3rd Qu. is the 3rd quartile, Max is the maximum and Std. is the sample standard deviation.", digits = 1, col.names = c("Variable", "Min", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max", "Std.")) %>%
row_spec(0, bold = T) %>%
kable_styling(latex_options = "HOLD_position", position = "left")
```
## Distribution Plots
```{r, include=FALSE}
ages <- muni.latest %>%
select(Prop.Under15, Prop.15to64, Prop.Over64)
ages <- melt(ages)
living <- muni.latest %>%
select(Prop.Households.Terr.Det, Prop.Households.Rental)
living <- melt(living)
employment <- muni.latest %>%
select(Empl.Rate, Prop.Unempl, Prop.Pensioners)
employment <- melt(employment)
sectors <- muni.latest %>%
select(Prop.Primary.Sector, Prop.Secondary.Sector, Prop.Services.Sector)
sectors <- melt(sectors)
```
To analyze further how the variables are distributed, we use Kernel density estimation to estimate the probability densities for variables in some common groups and visualize the results.
Figure \@ref(fig:agekde) shows the probability density estimates for age groups. Age groups are shares of population aged below 15, 15 to 64, and over 64. We can see that Prop.Over64 has a wide distribution. The median is close to 30% but there are municipalities in both ends of the distribution. The distribution of Prop.Under15 has a lower variance around its mean value of ~15% but the distribution is clearly positively skewed as there are some municipalities with over 30% of below 15 year-old citizens (Liminka; 30.8%).
```{r agekde, echo=FALSE, fig.cap='Probability density estimates for age groups', fig.align='left', out.width='65%', optipng = '-o7'}
ggplot(ages, aes(x=value, fill=variable)) + geom_density(alpha = .25, adjust = .75) + theme_light() +
scale_fill_manual( values = c("#dde318","#3dbc74","#440154")) +
labs(x = "Percentage (%) of Population", y = "Density", fill = "Variable", title = "Age Groups") +
theme(legend.position = "right", legend.text = element_text(size = 9))
```
Figure \@ref(fig:housingkde) shows the probability density estimates for housing types. Housing types are shares of population living in either terraced or detached houses and in rental apartments. The Figure characterizes Finland well - the distribution of Prop.Households.Terr.Det show that there are lots of municipalities where ~9/10 citizens tend to live in terraced or detached houses. However, the distribution is highly negatively skewed so there are some municipalities where very few people live in those types of houses. On the other hand, Prop.Households.Rental is centered around ~20% but highly positively skewed. This means that while in average 1/5 of people live on rent, there are municipalities where more than half are living on rent. Overall, these results support to hypothesis that Finland is a country with many rural municipalities where people tend to live in terraced and detached houses, and a few big municipalities where the proportion of rental housing increases with city size and the degree of urbanization.
```{r housingkde, echo=FALSE, fig.cap='Probability density estimates for housing types', fig.align='left', out.width='70%', optipng = '-o7'}
ggplot(living, aes(x=value, fill=variable)) + geom_density(alpha = .25, adjust = .75) + theme_light() +
scale_fill_manual( values = c("#dde318","#3dbc74","#440154")) +
labs(x = "Percentage (%) of Population", y = "Density", fill = "Variable", title = "Housing Types") +
theme(legend.position = "right", legend.text = element_text(size = 9))
```
Figure \@ref(fig:employmentkde) shows the probability density estimates for employment measures. Employment measures are the employment rate, the proportion of population that are unemployed, and the proportion of population that are pensioners. The Figure shows that Prop.Unempl has a considerable lower variance than Prop.Pensioners. This means while municipalities may have very different age structures, unemployment does not vary as much. This is somewhat expected in a welfare state like Finland. Empl.Rate is centered around 70% quite symmetrically with a standard deviation of ~5%, which shows that municipalities have some differences in employment rates (e.g., Outokumpu 58.3% vs. Luoto 82.4%).
```{r employmentkde, echo=FALSE, fig.cap='Probability density estimates for employment related measures',fig.align='left', out.width='70%', optipng = '-o7'}
ggplot(employment, aes(x=value, fill=variable)) + geom_density(alpha = .25, adjust = .75) + theme_light() +
scale_fill_manual( values = c("#dde318","#3dbc74","#440154")) +
labs(x = "Percentage (%) of Population", y = "Density", fill = "Variable", title = "Employment Measures") +
theme(legend.position = "right", legend.text = element_text(size = 9))
```
Figure \@ref(fig:sectorkde) shows the probability density estimates for the weights of economic sectors in the municipalities. Sectors are the primary-, secondary-, and services sectors. We can interpret that municipalities have large variance in the split between economic sectors. For example, in Kauniainen, the proportion of services is 93.1% while it is only 23.6% in Pyhäntä. Pyhäntä is the most secondary sector -concentrated municipality in Finland with a weight of 67% in secondary production. Lestijärvi is the most primary sector -focused municipality in Finland with a weight of 36.5% in primary production.
```{r sectorkde, echo=FALSE, fig.cap='Probability density estimates for economic sectors', fig.align='left', out.width='70%', optipng = '-o7'}
ggplot(sectors, aes(x=value, fill=variable)) + geom_density(alpha = .25, adjust = .75) + theme_light() +
scale_fill_manual( values = c("#dde318","#3dbc74","#440154")) +
labs(x = "Percentage (%) of Population", y = "Density", fill = "Variable", title = "Economic Sectors") +
theme(legend.position = "right", legend.text = element_text(size = 9))
```
# Bivariate Analysis
## Linear Dependencies
To analyze bivariate dependencies in the data, we assessed linear relationships between the variables with Pearson's correlation analysis. Figure \@ref(fig:corrplot) shows a heatmap of the correlations between all variables.
The middle of the heatmap shows that a few of the variables are perfectly positively correlated. Those variables are Families, Popul, Households, and Labour.Force. Jobs.Muni is also nearly perfectly positively correlated with the former variables (correlation coeffcient ranges from 0.98 to 0.99). This observation is not a surprise - we can expect that these variables go hand in hand. Excess.Births is relatively strongly positively correlated with all former variables (correlation coefficient ranges from 0.6 to 0.66). It is believable that excess birth concentrates in bigger municipalities with many other families and job opportunities.
Another cluster of strong positive correlations is in the bottom left corner of the heatmap. Prop.Over64 has correlation of 0.99 with Prop.Pensioners, which is expected, as people tend to start their pension at an age around 65-70. Moreover, Soc.Health.Activity is strongly positively correlated with Prop.Pensioners, Prop.Over64, and Depend.Ratio. This means that the more elderly people live in a municipality, the more the municipality spends on social and health activities.
There are also variables with strong negatively correlations. In the bottom right corner, we see that Prop.15to64 is negatively associated with Soc.Health.Activity, Depend.Ratio, Prop.Pensioners, and Prop.Over64. This means that in contrast to municipalities with old people, municipalities with many 15 to 64 year-olds tend to spend less on social and health activities. The same applies to the variable Prop.Below15.
Additional interesting observations include the correlations between Prop.Households.Terr.Det and Prop.Primary.Sector, Education (Prop.Educ.Degree2 & Prop.Educ.Degree3) and Deg.Urbanisation, and Prop.Households.Rental and a handful of variables (Jobs.Muni, Labour.Force, Households, Popul, and Families). First, it seems that people tend to live in terraced or detached houses in municipalities that are driven by the primary sector. Second, education appears to be associated with high levels of urbanisation. This is most likely due to the fact that high schools, universities, and high-profile jobs (where highly educated people work) tend to locate in bigger cities that are urbanising the fastest. Lastly, rental housing tends to be linked with jobs, population size, and households. This could be due to the fact that rental housing is more common in big cities, where there tend to be relatively more young people and apartments are more costly.
```{r corrplot, echo=FALSE, fig.dim=c(18,18), fig.cap="Heatmap of Pearson'n correlation between all variables. Red color indicates stron positive correlation while blue color indicates strong negative correlation. White color indicates no correlation.", optipng = '-o7'}
corr <- cor(muni.latest[, -1])
ggcorrplot(corr, hc.order = TRUE, type = "full", lab = TRUE, title = "Heatmap of Correlations", lab_size = 3.5)
```
# Principal Component Analysis
## Motivation
Principal Component Analysis (PCA) is a technique that is used for reducing dimensionality in numeric data while preserving as much as possible of the information (i.e., variation) contained in the original data. As our variables were numeric, and the main objective was to reduce dimensionality in the data, PCA was seen as the multivariate method. Moreover, we had a prior belief or hypothesis that differences in municipalities could arguably be explained in quite simple terms like center of growth or pensioner municipality, so PCA seemed like an interesting tool to test this hypothesis.
## Scree Plot
Figure \@ref(fig:screeplot) shows what proportion of variance is explained by each principal component (PC) along with the cumulative proportion of variance explained up to that PC. The scree plot shows that we can explain close to 55% of all variation using only the first two PCs. Furthermore, close to 70% of the variation can be explained by using the first four PCs. In this report, we do further analysis using these four PCs and have indicated them in the plot with a green color.
```{r screeplot, echo=FALSE, fig.cap="Variance explained by each Princial Component (PC) up to the 20th PC. The bars represent proportion of variance explained and the line with data points indicates the cumulative proportion of variance explained up to that PC. This type of a plot is also known as a Scree Plot.", fig.align='left', optipng = '-o7'}
muni.latest.pca <- muni.latest %>% column_to_rownames(var = "Area")
muni.pca <- prcomp(muni.latest.pca, center = T, scale. = T)
muni.pca.summary <- summary(muni.pca)
scree.df <- muni.pca.summary$importance[, 1:20] %>% t() %>% as.data.frame()
scree.df$PC <- 1:20
scree.df$Included <- c(rep("Yes",4), rep("No",16))
scree.df <- scree.df[, c(4,2,3,5)]
rownames(scree.df) <- NULL
colnames(scree.df) <- c("PC", "Prop.var", "Cum.prop", "Included")
ggplot(scree.df) +
geom_bar(aes(x = PC, y = Prop.var, fill = Included), stat = "identity") +
scale_fill_manual(values = c("#440154", "#3dbc74")) +
geom_line(aes(x = PC, y = Cum.prop)) +
geom_point(aes(x = PC, y = Cum.prop)) +
geom_hline(yintercept = 0.70, linetype = 2) +
annotate("text", x=10.5, y=0.73, label= "70% of variance explained", size = 3) +
labs(x = "Principal Component", y = "Variance Explained (%)", title = "PCA Scree Plot") +
guides(fill = guide_legend(reverse = TRUE)) +
theme_light()
```
## Principal Components 1 and 2
Figure \@ref(fig:scoreplot1) shows the scores for PC1 and PC2 for all municipalities that have a top 5 absolute loading for any of PC1 or PC2. We can clearly see four clusters of municipalities. The first cluster comprises only of Helsinki - Finland's capital city. The second cluster comprises of Espoo, Vantaa, Tampere, Oulu, and Turku - large urbanized cities around Finland. The third cluster comprises of Jomala, Lemland, Luoto, Lumparland, and Saltvik - small Swedish speaking municipalities in rural areas. The forth cluster comprises of Puolanka, Hyrynsalmi, Posio, Rautavaara, and Rääkkylä - small Finnish speaking pensioner municipalities in rural areas.
```{r scoreplot1, warning=FALSE, echo=FALSE, message=FALSE, fig.cap="Score plot for all municipalities that have a top 5 absolute loading for any of PC1 or PC2", fig.align='left', optipng = '-o7', fig.dim=c(6,4)}
loads <- muni.pca$x[, 1:2] %>% as.data.frame() %>% rownames_to_column() %>% as_tibble() %>% dplyr::rename(Area = rowname)
greatest5.PC1 <- loads %>%
arrange(desc(PC1)) %>%
head(5)
smallest5.PC1 <- loads %>%
arrange(PC1) %>%
head(5)
greatest5.PC2 <- loads %>%
arrange(desc(PC2)) %>%
head(5)
smallest5.PC2 <- loads %>%
arrange(PC2) %>%
head(5)
loads.reduced <- full_join(greatest5.PC1, smallest5.PC1) %>% full_join(greatest5.PC2) %>% full_join(smallest5.PC2)
ggplot(loads.reduced, aes(x = PC1, y = PC2, label = Area)) +
geom_point() +
geom_text_repel(size = 3) +
xlim(c(-25,10)) +
ylim(c(-20,15)) +
ggtitle("PCA Scores") +
theme_light()
```
For a deeper characterisation of the revealed clusters, we should look at the loading directions plot, which shows what loadings the different variables have for PCs 1 and 2. Figure \@ref(fig:loadingsplot1) shows the corresponding plot.
```{r loadingsplot1, warning=FALSE, echo=FALSE, message=FALSE, fig.cap="Loading directions for PC1 and PC2", fig.align='left', optipng = '-o7', fig.dim=c(6,4)}
ggplot2::autoplot(muni.pca,
alpha = 0,
loadings = T,
loadings.colour = "black",
loadings.label = T,
loadings.label.colour = "red",
loadings.label.size = 2.5,
loadings.label.repel = T) +
scale_x_continuous(limits = c(-0.4, 0.4)) +
scale_y_continuous(limits = c(-0.4, 0.4)) +
labs(x = "Standardized PC1 (33.6% explained var.)",
y = "Standardized PC1 (20.9% explained var.)",
title = "PCA Loading Directions") +
theme_light()
```
We can see that variables that have the most negative loading for PC1 are Prop.15to64, Prop.Educ.Degree3, Families, and Popul while Prop.Pensioners, Prop.Over64, Depend.Ratio, Soc.Health.Activity have strongest positive loadings. This means that PC1 could be interpreted as an index of youthfulness of a municipality - youthful municipalities are highly populated with young, educated working people and families while old municipalities are filled with pensioners.
Considering the PC2 dimension, we see that most negatively loaded variables are Prop.Households.Rental, Prop.Empl.Muni, Prop.Unempl, and Jobs.Self.Suff while Empl.Rate, Prop.Households.Terr.Det, Prop.Under15, and Prop.Swedish have strongest positive loadings. PC2 is more difficult to interpret but a clear sign is that Swedish speaking municipalities load positively for PC2. Living on a rent and unemployment indicate lower economic prosperity than living in a terraced or detached house or employment. Perhaps a high number of children is also a sign of economic wellbeing, since having children can be expensive these days. Therefore, PC2 could be an index of economic prosperity. Big cities have a lot of smaller aparments for rent and many younger people like students live there, which could explain why these cities score quite low in PC2 dimension in comparison to smaller cities.
Score and loading direction plots for PC3 and PC4 are shown in the Appendix but the interpretations are left for the interested due to scope of the study.
# Discussion and Conclusions
In this report, we have used PCA to reduce dimensionality in demographic data of Finnish Municipalities provided by Tilastokeskus. The results indicate that although Finland has very heterogeneous municipalities, some conclusions can be made after projecting the data in the principal component space. PC1 explains 31.7% of the variance and can be seen as an index of youthfulness of a municipality. PC2 explains 20.5% of the variance and can be interpreted as an index of economic prosperity of a municipality. Plotting the PC1 and PC2 scores for only those municipalities that have a top 5 absolute loading for any of the two components resulted in four clear clusters of municipalities, which demonstrates that the objective of the study was accomplished.
Limitations of this study can be seen as the following. First, PC1 and PC2 do not explain a very high (above 80%) percentage of the variance. Hence, PCA might not be a suitable method to draw conclusions from with this data. Second, using population, employed and some other variables measured in absolute numbers might have alienated Helsinki from the other cities and biased the analysis a bit regardless of standardization of variables. Perhaps leaving out Helsinki would have been a better option. Third, highly correlated variables could perhaps have been reduced to only one. In other words, perhaps we could have removed unnecessary variables. Lastly, according to public authorities [@munis], Finland should have 309 municipalities. However, I have 314 in my data after cleaning. Hence, there is a possibility that I have included some municipalities that should not have been categorized as municipalities. Any further work on this topic is encouraged to address these potential shortcomings.
\newpage
# References
<div id="refs"></div>
\newpage
# Appendix
## Principal Components 3 and 4
Figure \@ref(fig:scoreplot2) shows the scores for PC3 and PC4.
```{r scoreplot2, warning=FALSE, echo=FALSE, message=FALSE, fig.cap="Score plot for all municipalities that have a top 5 absolute loading for any of PC3 or PC4", fig.align='left', optipng = '-o7', fig.dim=c(5.5,3.5)}
loads <- muni.pca$x[, 3:4] %>% as.data.frame() %>% rownames_to_column() %>% as_tibble() %>% dplyr::rename(Area = rowname)
greatest5.PC3 <- loads %>%
arrange(desc(PC3)) %>%
head(5)
smallest5.PC3 <- loads %>%
arrange(PC3) %>%
head(5)
greatest5.PC4 <- loads %>%
arrange(desc(PC4)) %>%
head(5)
smallest5.PC4 <- loads %>%
arrange(PC4) %>%
head(5)
loads.reduced <- full_join(greatest5.PC3, smallest5.PC3) %>% full_join(greatest5.PC4) %>% full_join(smallest5.PC4)
ggplot(loads.reduced, aes(x = PC3, y = PC4, label = Area)) +
geom_point() +
geom_text_repel(size = 3) +
xlim(c(-4,5)) +
ylim(c(-8,8)) +
ggtitle("PCA Scores") +
theme_light()
```
Figure \@ref(fig:loadingsplot2) shows the loading directions for PC3 and PC4.
```{r loadingsplot2, warning=FALSE, echo=FALSE, message=FALSE, fig.cap="Loading directions for PC3 and PC4", fig.align='left', optipng = '-o7', fig.dim=c(5.5,3.5)}
ggplot2::autoplot(muni.pca,
x = 3,
y = 4,
alpha = 0,
loadings = T,
loadings.colour = "black",
loadings.label = T,
loadings.label.colour = "red",
loadings.label.size = 2.5,
loadings.label.repel = T) +
scale_x_continuous(limits = c(-0.3, 0.3)) +
scale_y_continuous(limits = c(-0.3, 0.3)) +
labs(x = "Standardized PC3 (10.2% explained var.)",
y = "Standardized PC4 (7.2% explained var.)",
title = "PCA Loading Directions") +
theme_light()
```