forked from liam-zhu/csci1951a.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
328 lines (223 loc) · 31.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
<!DOCTYPE html>
<html>
<head><meta name="generator" content="Hexo 3.8.0">
<meta charset="utf-8">
<title>CSCI1951A Blog</title>
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta property="og:type" content="website">
<meta property="og:title" content="CSCI1951A Blog">
<meta property="og:url" content="https://liamju.github.io/csci1951a.github.io/index.html">
<meta property="og:site_name" content="CSCI1951A Blog">
<meta property="og:locale" content="default">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="CSCI1951A Blog">
<link rel="alternate" href="/csci1951a.github.io/atom.xml" title="CSCI1951A Blog" type="application/atom+xml">
<link rel="icon" href="/favicon.png">
<link href="//fonts.googleapis.com/css?family=Source+Code+Pro" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="/csci1951a.github.io/css/style.css">
</head>
</html>
<body>
<div id="container">
<div id="wrap">
<header id="header">
<div id="banner"></div>
<div id="header-outer" class="outer">
<div id="header-title" class="inner">
<h1 id="logo-wrap">
<a href="/csci1951a.github.io/" id="logo">CSCI1951A Blog</a>
</h1>
<h2 id="subtitle-wrap">
<a href="/csci1951a.github.io/" id="subtitle">Group Name: The Avengers</a>
</h2>
</div>
<div id="header-inner" class="inner">
<nav id="main-nav">
<a id="main-nav-toggle" class="nav-icon"></a>
<a class="main-nav-link" href="/csci1951a.github.io/">Home</a>
<a class="main-nav-link" href="/csci1951a.github.io/archives">Archives</a>
</nav>
<nav id="sub-nav">
<a id="nav-rss-link" class="nav-icon" href="/csci1951a.github.io/atom.xml" title="RSS Feed"></a>
<a id="nav-search-btn" class="nav-icon" title="Search"></a>
</nav>
<div id="search-form-wrap">
<form action="//google.com/search" method="get" accept-charset="UTF-8" class="search-form"><input type="search" name="q" class="search-form-input" placeholder="Search"><button type="submit" class="search-form-submit"></button><input type="hidden" name="sitesearch" value="https://liamju.github.io/csci1951a.github.io"></form>
</div>
</div>
</div>
</header>
<div class="outer">
<section id="main">
<article id="post-BlogPost3" class="article article-type-post" itemscope itemprop="blogPost">
<div class="article-meta">
<a href="/csci1951a.github.io/2019/05/04/BlogPost3/" class="article-date">
<time datetime="2019-05-04T06:26:35.000Z" itemprop="datePublished">2019-05-04</time>
</a>
</div>
<div class="article-inner">
<header class="article-header">
<h1 itemprop="name">
<a class="article-title" href="/csci1951a.github.io/2019/05/04/BlogPost3/">Final Blog</a>
</h1>
</header>
<div class="article-entry" itemprop="articleBody">
<h2 id="Vision"><a href="#Vision" class="headerlink" title="Vision"></a>Vision</h2><p>Is it possible to predict the box office of a movie? With “big data” resources and “machine learning” methods, we achieved the target of generating some reasonable forecasts. Two months ago, we brainstormed about how to make a reasonable prediction for a super popular movie that might break the historical record –Avengers: EndGame, and if so, we could apply such method to other movies as well. </p>
<p>Since then, we explored how multiple data sources can fit into different models. We used data in three different aspects: historical movies that are similar to the target movie, discussion from social media (Twitter) as popularity, and news articles from public media (New York Times) as opinions from professionals. As a result, we finally come up with a way to generalize this forecasting process and make it into a useful prediction tool for all movies.</p>
<h2 id="Achievements"><a href="#Achievements" class="headerlink" title="Achievements"></a>Achievements</h2><p>We achieved the goals we set in the beginning:</p>
<ul>
<li>Predict the whole box office of movie Avengers 4: EndGame.</li>
<li>Generalize the movie box office forecast model.</li>
</ul>
<h2 id="Data"><a href="#Data" class="headerlink" title="Data"></a>Data</h2><p>We collected data from four datasets and finally applied three in the prediction models. Data from <strong>IMDB</strong> to get the some basic features about the movie, such as movie genre and budget. In addition, IMDB reviews were also used for training the NLP model to analyze New York Time articles, alone with binarized rating score. Data from <strong>Twitter</strong> to indicate the movie popularity. Data from <strong>New York Times</strong> to show the attitudes of the professional reviews. Trailer view counts from Youtube were also collected with Youtube API. However, due to large numbers of movies and the restriction of Youtube API, the feature of trailer view is not available for all the movies and finally we did not apply this feature in our final prediction model. On the other hand, the number of tweets could also reflect the popularity as Youtube trailer view counts do (even better), which means removing this Youtube trailer view counts feature would not lose much information to our prediction models.</p>
<h3 id="Datasets-Introduction"><a href="#Datasets-Introduction" class="headerlink" title="Datasets Introduction"></a>Datasets Introduction</h3><h4 id="IMDB"><a href="#IMDB" class="headerlink" title="IMDB"></a>IMDB</h4><p>Movie features of 45000 movie records: genres, production companies, release date, IMDB popularity index, IMDB vote average, runtime, budget, revenue.<br>For the training set of the NLP model, 50,000 IMDB reviews for a variety of movies were used. Each review contains 2 parts: The actual review, and the review score. Actual reviews were tokenized into vectors, and review scores were binarized into either 0 or 1. The ratio of positive reviews and negative reviews were roughly 1:1.</p>
<h4 id="Social-media-number-of-Tweets-with-hashtags-of-the-movie-title"><a href="#Social-media-number-of-Tweets-with-hashtags-of-the-movie-title" class="headerlink" title="Social media: number of Tweets with hashtags of the movie title."></a>Social media: number of Tweets with hashtags of the movie title.</h4><p>We use data from Twitter to present the social influence of the movie. For simplicity, we first only take the number of tweets into our consideration. We used the Twitter premium API to estimate the number of tweets related to the target movie each day. </p>
<p>Since the API is very expensive and has very strict rate limits (100 request per month). We need to store the request results and reuse them. For a specific movie, we collected and stored all tweets related to a movie in a window of length 50 days. Here is the figure.<br><img src="/csci1951a.github.io/2019/05/04/BlogPost3/twitter3.png" alt></p>
<p>In our final implementation, we only used data in a month ( from 25 days prior to the release date to 5 days after the release date).</p>
<h4 id="Public-media-articles"><a href="#Public-media-articles" class="headerlink" title="Public media articles"></a>Public media articles</h4><p>Although the data collected by using New York Times API includes the actual article content, only headlines of the articles were used for NLP analysis to reflect the professional opinions for the movie. The reason being that although the content of the article contains far more information than headline alone, most of the headlines were already representative for the content, and it would be pointless to include the redundant sentences. By doing so, the performance of the software can also be improved.</p>
<h2 id="Model-Structure-and-Implementation-Detail"><a href="#Model-Structure-and-Implementation-Detail" class="headerlink" title="Model Structure and Implementation Detail"></a>Model Structure and Implementation Detail</h2><p><img src="/csci1951a.github.io/2019/05/04/BlogPost3/model.png" alt><br>Our predictor is made of two parts: base and variance: </p>
<pre><code>box office = base + variance
</code></pre><p>The base value used a single linear regression model with features from the IMDB dataset. The base part is more about static features and it is not reflexible with real time. Actually, using average box office is good enough in this part.</p>
<p>For the variance part, we first use an additional pair of linear regression models, with number of tweets and ratings of articles as separate features. The result from the two variance models are then treated as a new pair of features for the third variance linear regression model.</p>
<h3 id="Data-processing"><a href="#Data-processing" class="headerlink" title="Data processing"></a>Data processing</h3><p>####IMDB data<br>We use the IMDB dataset from Kaggle and cleaned it into features that are most relevant and most likely to affect to the final revenue. To narrow down the dataset, only movies that are similar to the target movie are considered for the training. Similarity is defined as movies with the same production companies, similar genres, and release time. </p>
<p>####Twitter<br>We map the number of tweets at each time interval (7 days as default) into a single feature. Therefore, the number of related tweets on i days before the release date as the i/7 th feature. Take the movie Avengers: Endgame as an example, of which the release date is 04/26/2019. We use premium search API to get the related tweets in a 30 day window (04/01 ~ 05/01), which can be mapped into features X3, X2, X1 and X0.</p>
<p>Twitter was developing really fast in the past 5 years. More and more people begin talking movies on Twitter. For those older movies, there is a huge gap between the number of tweets at that time and nowadays. To weaken the influence of this problem, square root value of tweets numbers is applied.</p>
<p>####NYT articles<br>New York Times article was used for providing a professional view of the movie. A LSTM (Long short-term memory) natural language processing model was trained by using movie reviews on IMDB. The training set contains a string of sentences for the review content, and number of 0 representing a negative review, or 1 representing a positive review. The model produces a float number of rating from 0 to 1, representing how positive the review is. The New York Times articles were fed into the NLP model for evaluating the attitude.</p>
<p>For each movie, all the articles in a window of time that relate to this movie were fed into the model, and generate an average score. Similar movies that share same key words were fed in to the model as well. The average article rating was found to be 0.91 for Avengers: End Game, and 0.96 for Avengers: Infinity War.<br>The amount of articles for each movie depends on the window period, and how many articles exist in this window period. The default window was set to 30 days before the movie release, and all articles on New Your Times during this period were used. The final output for this model was the average score of the movie, and amount of total articles. For movies that does not have any article, score were set to 0, and amount was set to 0.</p>
<h2 id="Visualization"><a href="#Visualization" class="headerlink" title="Visualization"></a>Visualization</h2><p>For <strong>Avengers: Endgame</strong>, if we set the number of similar movies to <strong>20</strong>, the prediction model gives a base value at 663.9 million. In our final code, the number of similar movies is set to 10 or 15, we wanted to <strong>minimize the usage of the expensive Twitter API</strong>, and the base value became 888.9 million), which is similar to the previous movie <strong>Avengers: Infinity War</strong>. Which is reasonable considering the features we chose. </p>
<p>To visualize the data, we picked two features. The figure below shows the positive correlation between two of the features and the box office.<br><img src="/csci1951a.github.io/2019/05/04/BlogPost3/basic.png" alt></p>
<p>In fact, we can tell that this movie is actually much more popular than Infinity War from social media and news. The social media is able to capture this part of information in the figure below:<br><img src="/csci1951a.github.io/2019/05/04/BlogPost3/twitter.png" alt></p>
<p>The first peak occurred at April 2nd, which is the pre-sale date, and since the release date–April 22nd, the discussion amount kept growing to the climax. To combine this piece of information to our prediction, we got a variance value at 1.95 billion – a total of 2.8 billion for the global revenue.</p>
<h3 id="Feelings"><a href="#Feelings" class="headerlink" title="Feelings"></a>Feelings</h3><p>We are so excited that our project was voted as the best overall project in the final presentation (4 votes from TAs). We did not make a very attractive poster for our project, but we did spend a lot of time on coding. one TA told us that our model was more complicated than those of most group. Data collecting, cleaning, and analysis in this project is really challenging. Our project is very useful to help customers to determine whether to watch a movie. Or it can be used by cinemas to decide how to allocate resources for different movies.</p>
<p>We used many techniques mentioned in this course, such as data cleaning (hw1), web scraping (hw2), NLP (hw7), machine learning (hw5), MapReduce (hw3), data visualization (hw6). We have learned how to be a novice data scientist. </p>
<p> <img src="/csci1951a.github.io/2019/05/04/BlogPost3/poster.jpg" alt></p>
</div>
<footer class="article-footer">
<a data-url="https://liamju.github.io/csci1951a.github.io/2019/05/04/BlogPost3/" data-id="cjv96pkrb000287ljmfjhn1om" class="article-share-link">Share</a>
</footer>
</div>
</article>
<article id="post-BlogPost2" class="article article-type-post" itemscope itemprop="blogPost">
<div class="article-meta">
<a href="/csci1951a.github.io/2019/04/18/BlogPost2/" class="article-date">
<time datetime="2019-04-18T15:05:06.000Z" itemprop="datePublished">2019-04-18</time>
</a>
</div>
<div class="article-inner">
<header class="article-header">
<h1 itemprop="name">
<a class="article-title" href="/csci1951a.github.io/2019/04/18/BlogPost2/">Blog Post 2</a>
</h1>
</header>
<div class="article-entry" itemprop="articleBody">
<h1 id="Current-Status"><a href="#Current-Status" class="headerlink" title="Current Status"></a>Current Status</h1><p>We trained a box office Prediction model that predicts a rough value for the final box office and a Social Media model to correlate the topic discussion amount and the movie box office.</p>
<h2 id="Box-Office-Prediction-Model"><a href="#Box-Office-Prediction-Model" class="headerlink" title="Box Office Prediction Model"></a>Box Office Prediction Model</h2><p>To simplify the problem, the model is trained with movies only related to Marvel superhero and 5 features are considered for the training. Features selected are Features about director, actors, actresses: Number of followers on Twitter (select top three), feature about showing time influence: Time past from the first movie, a feature about popularity: The times of watching movie trailers on Youtube.</p>
<p>feature example<br>[top1_number_of_followers, top2_number_of_followers, top3_number_of_followers, years_past, times_watch_movie_trailers]</p>
<p>Since the amount of data is very limited, we decided to choose linear regression model to make the prediction. We used linear_model in sklearn package. </p>
<p>The coefficients for each feature are:<br>Coefficients:<br> [16.29878943 -9.82518317 19.13476566 14.24214735 1.05286416]<br>Prediction:<br> [614.15516941]</p>
<p>Base on this model, the prediction result for the movie is 614.2 million, which is much lesser than our expectation. One of the possible reason might be the trailer’s view count can’t yet reflect the popularity of Avengers: Endgame, since it was just released on March 13th, and there are still two weeks to the premiere. As the view count increases, we can expect to get a greater value.</p>
<p>For the data visualization, we select two most influential features as x,y axis: top1 number of followers of the crew on Twitter and trailers view count. Z dimension is the box office of the movie. As shown in the plot, it is obvious that the movie with higher followers and watching times tends to have a higher box office. </p>
<p><img src="/csci1951a.github.io/2019/04/18/BlogPost2/basic.png" alt></p>
<p>Next step, we plan to train with all the movies in the same genres as Avenger4, which are movies with genres “Science Fiction”, “Action” or Adventure”.<br>More features are also considered for training: cast total facebook likes, IMDB movie score.</p>
<h2 id="Social-Media-Model"><a href="#Social-Media-Model" class="headerlink" title="Social Media Model"></a>Social Media Model</h2><p>In Midterm report, we planned to use the Twitter standard API to estimate the number of tweets related to the target movie each day. For a specific movie, we planned to treat the number of related tweets on i days before the release date as features Xi. If we are interested in all tweets 30 days prior to the release date. We can get 30 features: X30, X29, … , X1. </p>
<p>Take the movie Avengers: Endgame as an example, of which the release date is 04/26/2019. We first use standard search API to get the related tweets in the past 9 days (03/27 ~ 04/04). These are features X32, X31, …, X24. Finally, we can plot the following figure. </p>
<p><img src="/csci1951a.github.io/2019/04/18/BlogPost2/popularity.png" alt></p>
<p>The reason why the number of tweets related to Avengers: Endgame is very large on April 2 is that the pre-sales began on that day.</p>
<p>We have completed this function. However, due to the restriction of Twitter API, we cannot access very old data. Even we have upgraded to a Premium account, we can only request for 100 times each month, which is far from enough in our project. So we may have to give up using Twitter to estimate the popularity of a movie later. So we may use data from other sources to estimate the popularity of a movie. Currently, we want to use the New York Times API to search the article related to a movie. Since the number of articles is much smaller than that of tweets, we may not directly use the number of articles as a feature. We need to analyze the content of articles.</p>
<h2 id="Next-Steps"><a href="#Next-Steps" class="headerlink" title="Next Steps"></a>Next Steps</h2><h3 id="Goals"><a href="#Goals" class="headerlink" title="Goals"></a>Goals</h3><p>Data - processing: enlarge our training data with more movies in the same genres. Search for features for these new movies.<br>Train model for predicting base value (first-day box office): linear regression model</p>
<h3 id="Features"><a href="#Features" class="headerlink" title="Features"></a>Features</h3><p>We found ways to automate the process of data collection, additional dataset so we are going to extend the number of possible features for experiment.<br>The current feature list for the box-office prediction model is as follows:<br>[movie_title, box_office, release_date, genres, Series, actors, actors_popularity, director, director_popularity, production_company, company_total_gross, trailer_viewcount, pre_sale]</p>
<p>Previously, we didn’t find a good way to get this data through web scraping or existing API as an indicator for actor popularity. Since there exist many fan accounts, fake accounts that use the same name, which we want to avoid, so simple web scraping would not work, and we manually collected number of followers of the actors. This time, we found existing data of facebook likes for top actors and combined multiple datasets. In addition, we use Youtube API to query the view count for each movie’s trailer.</p>
<h3 id="Switch-from-Twitter-to-NYT-Article"><a href="#Switch-from-Twitter-to-NYT-Article" class="headerlink" title="Switch from Twitter to NYT Article"></a>Switch from Twitter to NYT Article</h3><p>The original data source for this project was twitter. However, due to limitation of twitter API, data collection was found to be rather hard. Therefore, the New York Times articles were used as a replacement. Due the nature of articles, the analysis methodology needs to be adjusted. The new proposed method was to perform natural language processing technique on each article, semantic analysis, and rate each movie based on the result. This article-based rating can be used as one feature of the prediction model. </p>
<h2 id="Timeline"><a href="#Timeline" class="headerlink" title="Timeline"></a>Timeline</h2><p>Apr/20 - Apr/25: Data preparation for linear regression training; Social media model training<br>Apr/26 - Apr/30: Data update and optimization<br>May/1 - May/2: Poster preparation<br>May/3 - May/10: Model generalization and conclusion; Blog3</p>
</div>
<footer class="article-footer">
<a data-url="https://liamju.github.io/csci1951a.github.io/2019/04/18/BlogPost2/" data-id="cjv96pkr2000187ljb4fzqwhh" class="article-share-link">Share</a>
</footer>
</div>
</article>
<article id="post-BlogPost1" class="article article-type-post" itemscope itemprop="blogPost">
<div class="article-meta">
<a href="/csci1951a.github.io/2019/03/15/BlogPost1/" class="article-date">
<time datetime="2019-03-15T20:26:45.000Z" itemprop="datePublished">2019-03-15</time>
</a>
</div>
<div class="article-inner">
<header class="article-header">
<h1 itemprop="name">
<a class="article-title" href="/csci1951a.github.io/2019/03/15/BlogPost1/">Blog Post 1</a>
</h1>
</header>
<div class="article-entry" itemprop="articleBody">
<h2 id="About-the-Project"><a href="#About-the-Project" class="headerlink" title="About the Project"></a>About the Project</h2><p>Is it impossible to predict the box office of a movie? Maybe not :) With “big data” resources and “machine learning” methods, we could make some accurate forecasts. That is what we are trying to achieve in this project. Starting from forecasting one popular movie Avengers 4: EndGame, which will be released in April 2019, we will see how different features of the movie have influence on the box office. The features to be considered are mainly in three categories: The financial influence from movie crews, The social comments and economic background of movie industry. Learning from this prediction model for one specific movie, we could finally come up with a tool to generalize this forecasting progress and make it one useful tool to select the upcoming movie with most box office potential.</p>
<h3 id="Goals"><a href="#Goals" class="headerlink" title="Goals"></a>Goals</h3><ul>
<li>Predict the first day box office of movie Avengers 4: EndGame.</li>
<li>Predict the whole box office of movie Avengers 4: EndGame.</li>
<li>Generalize the movie box office forecast model.</li>
</ul>
<h2 id="Data"><a href="#Data" class="headerlink" title="Data"></a>Data</h2><h3 id="Movie-features-crew-plot-keywords-similar-movies-release-information"><a href="#Movie-features-crew-plot-keywords-similar-movies-release-information" class="headerlink" title="Movie features: crew, plot keywords, similar movies, release information."></a>Movie features: crew, plot keywords, similar movies, release information.</h3><p>We use the dataset downloaded from Kaggle. After data-cleaning, here is part of the information.<br><img src="/csci1951a.github.io/2019/03/15/BlogPost1/Blog1Kagglecsv.png" alt></p>
<h3 id="Social-influence-reviews-hashtags-on-social-media"><a href="#Social-influence-reviews-hashtags-on-social-media" class="headerlink" title="Social influence: reviews, hashtags on social media"></a>Social influence: reviews, hashtags on social media</h3><p>We use data from Twitter to present the social influence of the movie. For simplicity, we first only take the number of tweets into our consideration. In later version, we will also use some NLP method to analyze the content of tweets.</p>
<h2 id="Methods"><a href="#Methods" class="headerlink" title="Methods"></a>Methods</h2><p>We divide the box office of a movie into two parts: base and variance. </p>
<pre><code>box office = base + variance
</code></pre><p>Then we use the dataset from Kaggle to predict the base value and use the data from Twitter to predict the variance part.<br>There are many useful features that help us predict the base value of the box office of a film. However, it misses one necessary term– “box office records the first day of release”. The box office Mojo is an website that offers this information: <a href="https://www.boxofficemojo.com/alltime/days/?page=open&p=.htm" target="_blank" rel="noopener">https://www.boxofficemojo.com/alltime/days/?page=open&p=.htm</a>. Therefore, an additional step of web scraping and data cleaning is needed for the data of movie features</p>
<p><img src="/csci1951a.github.io/2019/03/15/BlogPost1/Blog1boxofficemojo.png" alt><br>For the base part, we do a multiple regression analysis. Y is the base value of the box office, X are some movie features. For the variance, we will adopt machine learning method to analyze and also use regression to predict the value.</p>
<h2 id="Timeline"><a href="#Timeline" class="headerlink" title="Timeline"></a>Timeline</h2><p>Feb/2019: Topic discussion & Data preparation<br>Mar/1 - Mar/15: Methodology research & Data - preprocessing I<br>Mar/15 - Mar/31: Data - preprocessing II & Train model for predicting base value (first day box office)<br>Mar/30 - Apr/15: Train model for predicting base value (whole box office) & variance value(first day box office)<br>Apr/15 - Apr/30: Train model for predicting variance value(real time)</p>
<h2 id="Current-Status"><a href="#Current-Status" class="headerlink" title="Current Status"></a>Current Status</h2><h3 id="Progress"><a href="#Progress" class="headerlink" title="Progress"></a>Progress</h3><p>We crawled all tweets related to a movie one month before the release date of this movie and compute the number of tweets on each day. Then we get 30 features x0, … , x29. So if we want to predict the box office of <em>Avenger 4</em>, not only will we crawled all tweets related to <em>Avenger 4</em>, but also we need to select a set of movies that is similar to <em>Avenger 4</em>, which is our training set. We need to crawl all tweets for these movies one by one (MapReduce).</p>
<p>To use Twitter API, we first created a Twitter developer account and obtain credentials.Then we created an app on <a href="https://developer.twitter.com/en/apps" target="_blank" rel="noopener">https://developer.twitter.com/en/apps</a>. Note that Twitter is very concerned about users’ privacy. You have to describe in detail the functionality of your app and how you will use the data get from Twitter. After creating an app, we can get Consumer API keys and Access tokens and use Twitter APIs in our program.</p>
<p>In our project, we choose to use Tweepy, a Python wrapper around the Twitter API. There are many other libraries in various programming languages that let you use Twitter API. because it is simple to use yet fully supports the Twitter API.</p>
<p>The first step is to setup tweepy to authenticate with Twitter credentials:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">auth = OAuthHandler(consumer_key, consumer_secret)</span><br><span class="line">auth.set_access_token(access_token, access_token_secret)</span><br><span class="line">api = tweepy.API(auth)</span><br></pre></td></tr></table></figure></p>
<p>Then we can use the search API to get tweets that related to the tag “TheAvengers”, we can use the following code:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> tweet <span class="keyword">in</span> tweepy.Cursor(api.search, q=<span class="string">"#TheAvengers"</span>, count = <span class="number">100</span>,</span><br><span class="line"> lang=<span class="string">"en"</span>,</span><br><span class="line"> since_id=sinceId,</span><br><span class="line"> max_id=maxId).items():</span><br><span class="line">print(tweet.created_at, tweet.text, tweet.id)</span><br><span class="line">csvWriter.writerow([tweet.created_at, tweet.id, tweet.text.encode(<span class="string">'utf-8'</span>)])</span><br></pre></td></tr></table></figure></p>
<p>The result looks like:<br><img src="/csci1951a.github.io/2019/03/15/BlogPost1/Blog1Twittercsv.png" alt></p>
<p>Note that in the latest Twitter search API, Twitter replace the parameters <em>since</em> and <em>until</em> with <em>since_id</em> and <em>max_id</em>, tweetId is an integer that increase monotonically in Twitter. Another problem is that the search index has a 7-day limit. In other word, if you want to get all tweets in the past 30 days, you have to call this API for 5 times.</p>
<h2 id="Next-Steps"><a href="#Next-Steps" class="headerlink" title="Next Steps"></a>Next Steps</h2><h3 id="Goals-1"><a href="#Goals-1" class="headerlink" title="Goals"></a>Goals</h3><p>Data - preprocessing II: Use mapreduce to process the data. We set the data as key and count the number of tweets on each day.<br> Collect data for the movie first day box office.<br>Train model for predicting base value (first day box office): linear regression model training.</p>
</div>
<footer class="article-footer">
<a data-url="https://liamju.github.io/csci1951a.github.io/2019/03/15/BlogPost1/" data-id="cjv96pkqy000087ljpzkdt2xf" class="article-share-link">Share</a>
</footer>
</div>
</article>
</section>
<aside id="sidebar">
<div class="widget-wrap">
<h3 class="widget-title">Archives</h3>
<div class="widget">
<ul class="archive-list"><li class="archive-list-item"><a class="archive-list-link" href="/csci1951a.github.io/archives/2019/05/">May 2019</a></li><li class="archive-list-item"><a class="archive-list-link" href="/csci1951a.github.io/archives/2019/04/">April 2019</a></li><li class="archive-list-item"><a class="archive-list-link" href="/csci1951a.github.io/archives/2019/03/">March 2019</a></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Recent Posts</h3>
<div class="widget">
<ul>
<li>
<a href="/csci1951a.github.io/2019/05/04/BlogPost3/">Final Blog</a>
</li>
<li>
<a href="/csci1951a.github.io/2019/04/18/BlogPost2/">Blog Post 2</a>
</li>
<li>
<a href="/csci1951a.github.io/2019/03/15/BlogPost1/">Blog Post 1</a>
</li>
</ul>
</div>
</div>
</aside>
</div>
<footer id="footer">
<div class="outer">
<div id="footer-info" class="inner">
© 2019 Shunjia Zhu<br>
Powered by <a href="http://hexo.io/" target="_blank">Hexo</a>
</div>
</div>
</footer>
</div>
<nav id="mobile-nav">
<a href="/csci1951a.github.io/" class="mobile-nav-link">Home</a>
<a href="/csci1951a.github.io/archives" class="mobile-nav-link">Archives</a>
</nav>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>
<link rel="stylesheet" href="/csci1951a.github.io/fancybox/jquery.fancybox.css">
<script src="/csci1951a.github.io/fancybox/jquery.fancybox.pack.js"></script>
<script src="/csci1951a.github.io/js/script.js"></script>
</div>
</body>
</html>