forked from pablo14/post_cluster_tsne
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpost.Rmd
275 lines (166 loc) · 13 KB
/
post.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
### Playing with dimensions
Hi there! This post is an experiment combining the result of **t-SNE** with two well known clustering techniques: **k-means** and **hierarchical**. This will be the practical section, in **R**.
<img src="http://datascienceheroes.com/img/blog/cluster_analysis.png" width="200px" alt="kmeans and hierarchical clustering on t-SNE">
But also, this post will explore the intersection point of concepts like dimension reduction, clustering analysis, data preparation, PCA, HDBSCAN, k-NN, SOM, deep learning...and Carl Sagan!
<br>
### PCA and t-SNE
For those who don't know **t-SNE** technique (<a href="https://lvdmaaten.github.io/tsne/" target="blank">official site</a>), it's a projection technique -or dimension reduction- similar in some aspects to Principal Component Analysis (PCA), used to visualize N variables into 2 (for example).
When the t-SNE output is poor Laurens van der Maaten (t-SNE's author) says:
> As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but t-SNE doesn’t, I am fairly sure you did something wrong.
In my experience, doing PCA with dozens of variables with:
• Some extreme values
• skewed distributions
• several dummy variables,
Doesn't lead to good visualizations.
Check out this example comparing the two methods:
<img src="http://datascienceheroes.com/img/blog/pca_tsne.png" alt="t-sne and pca" width="400px">
Source: <a href="https://www.kaggle.com/puyokw/digit-recognizer/clustering-in-2-dimension-using-tsne/code" target="blank">Clustering in 2-dimension using tsne</a>
Makes sense, doesn’t it?
<br>
### Surfing higher dimensions 🏄
Since one of the **t-SNE** results is a matrix of two dimensions, where each dot reprents an input case, we can apply a clustering and then group the cases according to their distance in this **2-dimension map**. Like a geography map does with mapping 3-dimension (our world), into two (paper).
**t-SNE** puts similar cases together, handling non-linearities of data very well. After using the algorithm on several data sets, I believe that in some cases it creates something like _circular shapes_ like islands, where these cases are similar.
However I didn't see this effect on the live demonstration from the Google Brain team: <a href="http://distill.pub/2016/misread-tsne/" target="blank">How to Use t-SNE Effectively</a>. Perhaps because of the nature of input data, 2 variables as input.
<br>
#### The swiss roll data
t-SNE according to its FAQ doesn't work very well with the _swiss roll_ -toy- data. However it's a stunning example of how a 3-Dimension surface (or **manifold**) with a concrete spiral shape **is unfols* like paper thanks to a reducing dimension technique.
The image is taken from <a href="http://axon.cs.byu.edu/papers/gashler2011smc.pdf" target="blank">this paper</a> where they used the <a href="https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Manifold_sculpting">manifold sculpting</a> technique.
<img src="http://datascienceheroes.com/img/blog/swiss_roll_manifold_sculpting.png" alt="swiss roll manifold sculpting" width="400px">
<br>
### Now the practice in R!
**t-SNE** helps make the cluster more accurate because it converts data into a 2-dimension space where dots are in a circular shape (which pleases to k-means and it's one of its weak points when creating segments. More on this: <a href="http://varianceexplained.org/r/kmeans-free-lunch/" target="blank">K-means clustering is not a free lunch</a>).
Sort of **data preparation** to apply the clustering models.
```{r, eval=FALSE}
library(caret)
library(Rtsne)
######################################################################
## The WHOLE post is in: https://github.com/pablo14/post_cluster_tsne
######################################################################
## Download data from: https://github.com/pablo14/post_cluster_tsne/blob/master/data_1.txt (url path inside the gitrepo.)
data_tsne=read.delim("data_1.txt", header = T, stringsAsFactors = F, sep = "\t")
## Rtsne function may take some minutes to complete...
set.seed(9)
tsne_model_1 = Rtsne(as.matrix(data_tsne), check_duplicates=FALSE, pca=TRUE, perplexity=30, theta=0.5, dims=2)
## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y)
```
Different runs of `Rtsne` lead to different results. So more than likely you will not see exactly the same model as the one present here.
According to the official documentation, `perplexity` is related to the importance of neighbors:
* _"It is comparable with the number of nearest neighbors k that is employed in many manifold learners."_
* _"Typical values for the perplexity range between 5 and 50"_
Object `tsne_model_1$Y` contains the X-Y coordinates (`V1` and `V2` variables) for each input case.
<br>
Plotting the t-SNE result:
```{r, eval=F}
## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +
geom_point(size=0.25) +
guides(colour=guide_legend(override.aes=list(size=6))) +
xlab("") + ylab("") +
ggtitle("t-SNE") +
theme_light(base_size=20) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank()) +
scale_colour_brewer(palette = "Set2")
```
<img src="http://datascienceheroes.com/img/blog/tsne_output.png" width="400px" alt="t-SNE output">
<br>
And there are the famous "islands" 🏝️. At this point, we can do some clustering by looking at it... But let's try k-Means and hierarchical clustering instead 😄. t-SNE's FAQ page suggest to decrease perplexity parameter to avoid this, nonetheless I didn't find a problem with this result.
<br>
#### Creating the cluster models
Next piece of code will create the **k-means** and **hierarchical** cluster models. To then assign the cluster number (1, 2 or 3) to which each input case belongs.
```{r, eval=F}
## keeping original data
d_tsne_1_original=d_tsne_1
## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 3)
d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)
## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))
## setting 3 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=3))
```
<br>
#### Plotting the cluster models onto t-SNE output
Now time to plot the result of each cluster model, based on the t-SNE map.
```{r, eval=F}
plot_cluster=function(data, var_cluster, palette)
{
ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
geom_point(size=0.25) +
guides(colour=guide_legend(override.aes=list(size=6))) +
xlab("") + ylab("") +
ggtitle("") +
theme_light(base_size=20) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
legend.direction = "horizontal",
legend.position = "bottom",
legend.box = "horizontal") +
scale_colour_brewer(palette = palette)
}
plot_k=plot_cluster(d_tsne_1_original, "cl_kmeans", "Accent")
plot_h=plot_cluster(d_tsne_1_original, "cl_hierarchical", "Set1")
## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)
grid.arrange(plot_k, plot_h, ncol=2)
```
<img src="http://datascienceheroes.com/img/blog/tsne_cluster_kmeans_hierarchical.png" width="600px" alt="kmeans and hierarchical clustering on t-SNE">
<br>
### Visual analysis
In this case, and based only on visual analysis, hierarchical seems to have more _common sense_ than k-means. Take a look at following image:
<img src="http://datascienceheroes.com/img/blog/cluster_analysis.png" width="600px" alt="kmeans and hierarchical clustering on t-SNE">
_Note: dashed lines separating the clusters were drawn by hand_
<br>
In k-means, the distance in the points at the bottom left corner are quite close in comparison to the distance of other points inside the same cluster. But they belong to different clusters. Illustrating it:
<img src="http://datascienceheroes.com/img/blog/kmeans_cluster.png" width="300px" alt="kmeans on t-SNE">
So we've got: red arrow is shorter than blue arrow...
_Note: Different runs may lead to different groupings, if you don't see this effect in that part of the map, search it in other._
This effect doesn't happen in the hierarchical clustering. Clusters with this model seems more even. But what do you think?
<br>
#### Biasing the analysis (cheating)
It's not fair to k-means to be compared like that. Last analysis based on the idea of **density clustering**. This technique is really cool to overcome the pitfalls of simpler methods.
**HDBSCAN** algorithm bases its process in densities.
Find the essence of each one by looking at this picture:
<img src="http://datascienceheroes.com/img/blog/hdbscan_vs_kmeans.png" alt="HDBSCAN vs k-means" width="600px">
Surely you understood the difference between them...
Last picture comes from <a href="http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb" target="blank">Comparing Python Clustering Algorithms</a>. Yes, Python, but it's the same for R. The package is <a href="https://cran.r-project.org/web/packages/largeVis/vignettes/largeVis.html">largeVis</a>. _(Note: Install it by doing: ```install_github("elbamos/largeVis", ref = "release/0.2")```_.
<br>
### Deep learning and t-SNE
Quoting Luke Metz from a great post (<a href="https://indico.io/blog/visualizing-with-t-sne/" target="blank">Visualizing with t-SNE</a>):
<i>Recently there has been a lot of hype around the term “deep learning“. In most applications, these “deep” models can be boiled down to the composition of simple functions that embed from one high dimensional space to another. At first glance, these spaces might seem to large to think about or visualize, but techniques such as t-SNE allow us to start to understand what’s going on inside the black box. Now, instead of treating these models as black boxes, we can start to visualize and understand them.</i>
A deep comment 👏.
<br>
### Final toughts 🚀
Beyond this post, **t-SNE** has proven to be a really **great** general purpose tool to reduce dimensionality. It can be use to explore the relationships inside the data by building clusters, or to <a href="https://auth0.com/blog/machine-learning-for-everyone-part-2-abnormal-behavior" target="blank">analyze anomaly cases</a> by inspecting the isolated points in the map.
Playing with dimensions is a key concept in data science and machine learning. Perplexity parameter is really similar to the *k* in nearest neighbors algorithm (<a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" target="blank">k-NN</a>). Mapping data into 2-dimension and then do clustering? Hmmm not new buddy: <a href="http://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/" target="blank">Self-Organising Maps for Customer Segmentation</a>.
When we select the best features to build a model, we're reducing the data's dimension. When we build a model, we are creating a function that describes the relationships in data... and so on...
Did you know the general concepts about k-NN and PCA? Well this is one more step, just plug the cables in the brain and that's it. Learning general concepts gives us the opportunity to do this kind of associations between all of these techniques. Despite comparing programming languages, the power -in my opinion- is to have the focus on how data behaves, and how these techniques are and can ultimately be- connected.
<br>
Explore the imagination with this **Carl Sagan**'s video: Flatland and the 4th Dimension. A tale about the interaction of 3D objects into 2D plane...
<iframe width="560" height="315" src="https://www.youtube.com/embed/N0WjV6MmCyM" frameborder="0" allowfullscreen></iframe>
<br>
---
<div>
<a href="http://livebook.datascienceheroes.com" target="blank">
<img src="http://datascienceheroes.com/img/blog/book_gray.png" alt="Data Science Live Book" style="align:left; float:left; width:42px; position:static;" style="border-radius:5px;"> Data Science Live Book
</a> (open source).
</div>
<br>
<div>
<img src="http://datascienceheroes.com/img/blog/twitter_logo.PNG" style=" float:left; width:42px;position:static; transform:none; -webkit-transform:none; " alt="Data Science Heroes Twitter">
<a href="https://twitter.com/DataSciHeroes" target="blank">DSH Twitter</a>
</div>
<br>
<div>
<img src="http://datascienceheroes.com/img/blog/fb_logo.PNG" style=" float:left; width:42px;position:static; transform:none; -webkit-transform:none;vertical-align: baseline;" alt="Data Science Heroes Facebook">
<a href="https://www.facebook.com/datasciheroes" target="blank">
DSH Facebook
</a>
</div>
<br>
<div>
<img src="http://datascienceheroes.com/img/blog/logo_dsh.png" style=" float:left; width:42px;position:static; transform:none; -webkit-transform:none; " alt="more posts">
<a href="http://blog.datascienceheroes.com" target="blank"> More DSH posts!</a>
</a>
</div>