-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
196 lines (127 loc) · 4.16 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<style type="text/css">
@import url(http://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(http://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(http://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body { font-family: 'Droid Serif'; }
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: 400;
}
.remark-slide-content h1 { font-size: 4.5em; }
.remark-slide-content h2 { font-size: 2.5em; }
h3 { font-size: 1.6em; }
li p { line-height : 1.5em; }
li { line-height: 1.5em;
font-size: 1.5em;
}
.red { color: #fa0000}
.orange { color: rgb(218, 134, 6);}
.green { color: rgb(55, 126, 39);}
.blue {color: rgb(39, 86, 151);}
.small-font{
font-size: 0.6em;
line-height: 0.4em;
}
.right-column{
width: 50%;
float: right;
}
.right-column li {
font-size: 1.2em;
line-height: 0.8em;
}
.left-column{
width: 50%;
float: left;
}
.remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
</style>
</head>
<body>
<textarea id="source">
class: center, middle
Progress Report
===============
## Mei-Hua ##
#### 2014.07.30 ####
---
## Aim
* Use previous methods to analyze Chinese data
---
## Corpus
* kimo blog data ( 2006/07/01 ~ 2007/06/30 )
![alt text](img/kimo_example.png)
---
## Corpus
* different from LJ40K: </br>
sentences with emoticon </br>
* full of special characters, unnecessary punctuation marks </br>
---
## Finished Progress
* store raw data into database
* tokenize sentences with CKIP tokenizer
* use standford parser to get dependency relationship
* extract patterns
* build lexicon
---
## Progressing
* tfidf: for keyword identifying
* svm
---
## Problems - Tokenization
* Stanford parser performs bad on Chinese sentences tokenizing. </br>
.orange[->] Use CKIP instead
* Some sentences with lots of punctuation mark cannot be tokenized. </br>
Examples: .small-font[如果你需要時間好好冷靜的思考.......沒關係......我願意等你.......無論多久我都會等你........等你準備好了.........等你願意見我....... 我不會再讓你擔心.......我也會按時吃飯........也會好好照顧自己.......你說的我都答應你........真的.......我說的都是真的.........請你相信 我.......... 小黑豬不會和小白豬分開的......就算有........也只是短暫的]
---
## Problems - Pattern Extraction
* Chinese grammar is quite different from English one. </br>
.orange[->] We need to find new pattern rule instead.
---
## Problems - Pattern Extraction
* Patterns we want to extract: </br>
我(sub) 很(adv) .green[開心(v)] </br>
我(sub) .green[是(v)] 白痴(n) </br>
我(sub) .green[吃(v)] 了 一 隻 牛(obj) </br>
---
## Problems - Pattern Extraction
* Strange things happens: </br>
我(sub) .green[肚子(v)] 痛(obj) </br>
事(sub) .green[做(v)]
* Patterns we do not want: </br>
後來 很(adv) .green[快(v)] </br>
估(adv) .green[耐(v)]
---
## Problems - Pattern Extraction
* CKIP tokenizer may not tokenize the sentence well
* Stanford parser is not suitable for Chinese data
* Need to find a new way to generate good patterns for Chinese
---
## Problems - Feature Extraction
.left-column[ ![alt text](img/feature_extract.png) ]
.right-column[
* difficult to find repeated patterns in one sentence, because sentence is too short
* many patterns appear in only one emotion category </br>
.orange[->] useless for training models
]
---
## Problems
* Need to redefine "document" </br>
.red[✘] treat sentence as document
---
class: center, middle
Questions or Comments :)
=============
</textarea>
<script src="http://gnab.github.io/remark/downloads/remark-latest.min.js" type="text/javascript">
</script>
<script type="text/javascript">
var slideshow = remark.create();
</script>
</body>
</html>