final report beta

BigDataSystemTHU2018 · Dec 23, 2018 · d7b4006 · d7b4006
1 parent 79ec49c
commit d7b4006
Show file tree

Hide file tree

Showing 15 changed files with 192 additions and 28 deletions.
diff --git a/Code/PostProcesssor/Error/ratioplot.py b/Code/PostProcesssor/Error/ratioplot.py
@@ -0,0 +1,48 @@
+import numpy as np
+import matplotlib.pyplot as plt
+
+x = [24,48,72,96,120,144,168]
+y1 = [45.66,42.49,41.17,40.52,40.07,39.83,39.82]
+y2 = [53.01,46.02,42.34,39.96,38.22,36.76,35.60]
+y3 = [48.80,43.20,40.19,38.37,37.10,36.10,35.35]
+y4 = [51.49,43.94,39.96,37.40,35.46,33.91,32.70]
+y0 = [40.07,40.59,40.80,40.68,40.46,40.27,40.16]
+
+plt.plot(x, y1,label="model1",marker = "o")
+plt.plot(x, y2,label="model2",marker = "*")
+plt.plot(x, y3,label="model3",marker = "+")
+plt.plot(x, y4,label="model4",marker = "_")
+plt.plot(x, y0,label="interpolation",marker = "s")
+plt.xlabel("Time/h")
+plt.ylabel("accuracy(%)")
+plt.title('line chart')
+plt.legend(loc="best")
+plt.title("Change trend with time")
+
+
+
+plt.savefig('accuracy.png',dpi=200)
+plt.show()
+
+xx = [24,48,72,96,120,144,168]
+yy1 = [100,100,100,97.91,98.33,98.61,98.88]
+yy2 = [100,100,100,97.91,85.83,71.52,61.30]
+yy3 = [100,87.50,91.66,87.50,78.33,65.27,55.95]
+yy4 = [100,95.83,97.22,85.41,68.33,56.94,48.80]
+yy0 = [20.83,27.08,29.16,28.12,22.20,18.75,16.07]
+
+plt.plot(xx, yy1,label="model1",marker = "o")
+plt.plot(xx, yy2,label="model2",marker = "*")
+plt.plot(xx, yy3,label="model3",marker = "+")
+plt.plot(xx, yy4,label="model4",marker = "_")
+plt.plot(xx, yy0,label="interpolation",marker = "s")
+plt.xlabel("Time/h")
+plt.ylabel("accuracy(%)")
+plt.title('line chart')
+plt.legend(loc="best")
+plt.title("Change trend with time")
+
+
+
+plt.savefig('accuracyRSME.png',dpi=200)
+plt.show()
diff --git a/Docs/Final-report.pdf b/Docs/Final-report.pdf
diff --git a/Report.pdf b/Report.pdf
diff --git a/Reports/Appendices/AppendixA.tex b/Reports/Appendices/AppendixA.tex
@@ -65,7 +65,7 @@ \section{项目细节说明}
   \item 第二阶段：11.07-11.28\\
   完成整个系统的搭建和模型验证，得出预测的结果和预测的精度
   \item 第三阶段：11.29-12.05\\
-  进行系统的优化和特征深入分析，如对于算法中对于各种影响因素和处理手段的体现和对于结果的影响\\
+  进行系统的优化和特征深入分析，如对于算法中对于各种影响因素和处理手段的体现和对于结果的影响
   \item 第四阶段：12.06-12.15\\
   对工作进行总结和分析
 \end{enumerate}

diff --git a/Reports/Appendices/AppendixB.tex b/Reports/Appendices/AppendixB.tex
@@ -4,7 +4,7 @@ \chapter{系统构建} % Main appendix title
 \label{AppendixB} % For referencing this appendix elsewhere, use \ref{AppendixA}
 \section{模型功能和系统实现}
 
-本项目所有代码和文档开源于 Project-Unicom \quad \url{https://github.com/BigDataSystemTHU2018/Project-Unicom}
+本项目代码和文档开源于Project-Unicom \\ \url{https://github.com/BigDataSystemTHU2018/Project-Unicom}
 
 \subsection{系统架构}
 如图(\ref{fig:pattern})所示，整体模型架构分为数据预处理和预分析，模型构建，结果可视化和后分析三个层次。
@@ -14,3 +14,55 @@ \subsection{系统架构}
 \caption{项目分析思路}
 \label{fig:pattern}
 \end{figure}
+\subsection{代码架构}
+分析代码见 \\ \url{https://github.com/BigDataSystemTHU2018/Project-Unicom/tree/master/Code}
+问题的主要模块分析如下:
+\begin{itemize}
+	\item 问题分析阶段
+		\begin{itemize}
+			\item 问题描述：用过去三个月的信令数据预测未来一个周的人口分布，即已知过去三个月的数据，（尽可能）预测未来一周的数据
+			\item 问题性质：回归
+			\item 评判指标：预测出的结果尽可能的准确，即预测值与实际误差尽可能小。
+		\end{itemize}
+	\item 设计阶段——设计出一个回归系统，输入为过去三个月的数据，输出为已有数据的分析结果和预测结果。系统包含以下几个模块：
+		\begin{itemize}
+			\item 预处理模块：
+			输入为原始数据，输出为清洗后的数据。负责数据的清洗，缺失值补全，异常值噪声点的处理等
+			\item 预分析模块：
+			负责从已有数据中提取挖掘先验知识，数据初步分析，可视化等。
+			\item 回归前处理模块：
+			将数据归一化等。
+			\item 回归器模块：
+			用于预测
+			\item 回归后处理模块：
+			包括可视化，对预测数据分析，生成报告等等。
+		\end{itemize}
+	\item 编码阶段——技术路线：
+		\begin{itemize}
+			\item 平台：tensorflow,sklearn
+			\item 算法:聚类（K-means），降维（PCA/NMF），回归（MLP/ResNet）
+			\item 可视化：matlab, matplotlib(seaborn)
+		\end{itemize}
+\end{itemize}
+\subsubsection*{回归器结构}
+回归系统的回归器分为4个部分，分别是：数据处理、Resnet搭建、网络训练、模型预测,其代码见\\ \url{https://github.com/BigDataSystemTHU2018/Project-Unicom/tree/master/Code/Processor/ResNet})\\ 模块的主要功能为：
+\begin{enumerate}
+	\item 从本地目录获取所有清洗好的数据，将数据按照Resnet的要求加载到内存。
+	\item 对数据自动分为训练数据和预测数据，其中训练数据喂入Resnet进行模型训练。
+	\item 对训练好的模型进行预测结果的输出。
+	\item 支持断点续训，训练过程中模型会每10轮保存一次，可以自由跳转到已经保存好的模型进行继续训练或者模型输出。
+	\item 支持迁移学习。即利用已经训练好的模型训练新的场景，提高训练的效率和降低对样本数量的需求。
+	\item getdataTHU.py: 搜索当前目录下所有.csv,.txt文件，以这些文件作为所需的数据。请将待训练的数据放入当前目录。
+	\item ResnetTHU.py: 定义网路结构，可以根据需要改变卷积核大小(CONV\_SIZE)和激活函数。
+	\item trainTHU.py: 训练模块，也为反向传播模块，根据需要可以改变图片大小
+	IMAGE\_WIDTH,IMAGE\_HIGHT,改变通道数NUM\_CHANNELS1,\\
+	NUM\_CHANNELS2,NUM\_CHANNELS3,同时BATCH\_SIZE2,BATCH\_SIZE3\\,BATCH\_SIZE4也要相应的改变来保持一致（现为取最近3小时，一天前的4小时，一周前的2小时）。
+	\item predictTHU.py: 预测模块，需要设置预测天数（现为7天）。
+\end{enumerate}
+模型网络的基本逻辑架构如图(\ref{fig:B1})所示：
+\begin{figure}[ht]
+\centering
+\includegraphics[width=0.8\textwidth]{B2.png}
+\caption{残差网络构型}
+\label{fig:B1}
+\end{figure}
diff --git a/Reports/Chapters/Chapter1.tex b/Reports/Chapters/Chapter1.tex
@@ -78,7 +78,7 @@ \subsection{稀疏数据的处理方法}
 \subsubsection*{协作筛选}
 协作筛选(Collaborative filtering, CF)被广泛使用在推荐系统的构建中，其基本的思想是相似的使用者对相似的物品有相似的评估标准和机制\cite{goldberg1992using}。 那么，如果可以确定出使用者和物品之间的联系，就可以对未来的使用者的评估做出预测\cite{nakamura1998collaborative}。 而在城市计算中， 物品可以是地理信息，乘客，司机还有服务的订购者。 一旦我们组装了矩阵，就可以使用这一准则来填充缺失的值。
 \subsubsection*{矩阵分解}
-矩阵分解，顾名思义就是讲矩阵分解为两个或者三个矩阵的乘积。 典型的方法有矩阵的 LU 分解、 QR 分解、 SVD 分解，其中 SVD 分解是使用最频繁的方法。 一种常用的方法为当矩阵非常稀疏时，可以使用三个低阶的矩阵对于其进行近似，比如只选取对应的奇异值之和大于总的奇异值之和的 \% 90 的那些行的数据。
+矩阵分解，顾名思义就是讲矩阵分解为两个或者三个矩阵的乘积。 典型的方法有矩阵的 LU 分解、 QR 分解、 SVD 分解，其中 SVD 分解是使用最频繁的方法。 一种常用的方法为当矩阵非常稀疏时，可以使用三个低阶的矩阵对于其进行近似，比如只选取对应的奇异值之和大于总的奇异值之和的 90\% 的那些行的数据。
 \subsubsection*{张量分解}
 张量通常有三个维度, 可以根据数据值对的特征分解为矩阵或向量的乘法。 限制分解的目标函数是最大限度地减少
 中现有项的值的乘积。分解后, 我们通过

diff --git a/Reports/Chapters/Chapter3.tex b/Reports/Chapters/Chapter3.tex
@@ -132,33 +132,55 @@ \subsubsection*{模型理论基础}
 \end{figure}
 \subsubsection*{残差网络结构}
 前文的图(\ref{fig:3.1})很好地展示了本研究所采用的残差网络的架构，在时间周期层面上考虑 \textbf{小时、天、周}三个不同层次的影响，并利用卷积考虑空间上的相互关联，最后再加入外部因素的影响。 模型分别采取预测时刻的最近三个小时、一天前的四个小时和一周前的两个小时的数据作为输入，在网络中得到不同的权重，再计算残差实现反向传播。\\
-在实际的网络设计中，考虑到城市不同区域之间的相互关联（如依靠地面公路、地铁的流动）进行卷积核的大小的调整和对于激活函数的定义。
+\indent 在实际的网络设计中，考虑到城市不同区域之间的相互关联（如依靠地面公路、地铁的流动）进行卷积核的大小的调整和对于激活函数的定义。
 \subsubsection*{数据混合}
 模型中的三个不同时间周期的数据可以用参数矩阵的方法进行混合:
 \begin{equation}
 \mathbf { X } _ { R e s } = \mathbf { W } _ { hour } \circ \mathbf { X } _ { hour }  + \mathbf { W } _ {day } \circ \mathbf { X } _ { day }  + \mathbf { W } _ { week } \circ \mathbf { X } _ { week } 
 \end{equation}
 其中的乘法为矩阵乘法(哈达马积)，而参数$W_{hour}$,$W_{day}$和$W_{week}$为模型参数
 \subsubsection*{模型算法和优化方法}
+模型的算法步骤和优化方法见算法框图(\ref{al}) 
 \begin{algorithm}[t]
-\caption{algorithm caption} %算法的名字
+\caption{ResNet Training Algorithm} %算法的名字
+We first construct the training
+instances from the original sequence data. Then, RenNet is trained via backprogagation and adatle.\\
 \hspace*{0.02in} {\bf Input:} %算法的输入， \hspace*{0.02in}用来控制位置，同时利用 \\ 进行换行
-input parameters A, B, C\\
+Historical observations: \{ $X_0,\cdots,X_{n-1}$ \}\\
+\hspace*{0.02in} Lengths of closeness, period, trend sequences: $l_c$,$l_p$,$l_q$\\
+\hspace*{0.02in} Period, trend span: q\\
 \hspace*{0.02in} {\bf Output:} %算法的结果输出
-output result
+learnt ResNet model\\
 \begin{algorithmic}[1]
-\State some description % \State 后写一般语句
-\For{condition} % For 语句，需要和EndFor对应
-　　\State ...
-　　\If{condition} % If 语句，需要和EndIf对应
-　　　　\State ...
-　　\Else
-　　　　\State ...
+\State \#construct training instances\\
+D <- None\\ % \State 后写一般语句
+\For{t in all aviliable time interval [1,n-1]} % For 语句，需要和EndFor对应
+　　\State $S_c = [X_{t-l_c},X_{t-(l_c-1)},\cdots,X_{t-1}]$
+	\State $S_p = [X_{t-l_p*p},X_{t-(l_p-1)*p},\cdots,X_{t-p}]$
+	\State $S_q = [X_{t-l_q*q},X_{t-(l_q-1)*q},\cdots,X_{t-q}]$
+\# $X_t$ is the target at time t\\
+Put an training instance ($\{S_c,S_p,S_q \},X_t$) into D\\
+\EndFor
+\# define the loss
+\\
+LOSS = MSE($X_t$,output of ResNet)\\
+\# train the model\\
+\# initialize all learnable parameters cita in ResNet\\
+
+Epoch <- int(numbers)\\
+
+
+
+\For{epoch in range(Epoch)} % For 语句，需要和EndFor对应
+　　\State \textbf{Pepeat}
+	\State Select a instance $D_b$ from D
+	\State cita by minimizing the LOSS with $D_b$
+\textbf{Until} all elements of D used\\
+\# save model every 10 epoches
+　　\If{epoch \% 10 == 0} % If 语句，需要和EndIf对应
+　　　　\State Save model
 　　\EndIf
 \EndFor
-\While{condition} % While语句，需要和EndWhile对应
-　　\State ...
-\EndWhile
-\State \Return result
 \end{algorithmic}
+\label{al}
 \end{algorithm}
diff --git a/Reports/Chapters/Chapter4.tex b/Reports/Chapters/Chapter4.tex
@@ -71,14 +71,14 @@ \subsection{不同模型和不同参数下的预测结果}
 	\item模型3\\ 初始学习率0.01， 学习衰减指数0.999，优化方法AdadeltaOptimizer
 	\item 模型4 \\初始学习率0.001， 学习衰减指数0.999999999，优化方法AdadeltaOptimizer
 \end{itemize}
-以其中的模型1为例进行说明，可以看出在从一天到七天的预测周期内，其预测的准确率都要高于插值预测模型，而不同参数之间的预测准确率也会发生改变，另外根据图(\ref{fig:acc})中可以更加清楚的看出，不同模型的预测精度均随着预测时间点的后移而下降。\\
-总结而言，建议在实际的预测分析中应该逐天分析，因为随着天数的增加，误差会越来越大并逐渐累积（这是一个显自然的结果），所以从整个星期来衡量而言预测精度会受到影响。另外，逐天分析可以
+\indent 以其中的模型1为例进行说明，可以看出在从一天到七天的预测周期内，其预测的准确率都要高于插值预测模型，而不同参数之间的预测准确率也会发生改变，另外根据图(\ref{fig:acc})中可以更加清楚的看出，不同模型的预测精度均随着预测时间点的后移而下降。\\
+\indent 总结而言，建议在实际的预测分析中应该逐天分析，因为随着天数的增加，误差会越来越大并逐渐累积（这是一个显自然的结果），所以从整个星期来衡量而言预测精度会受到影响。另外，逐天分析可以
 知道预测出的结果在一定的误差范围下可接受的预测天数是多少天。比方说前三天预测比插值得来的结果好很多，到第四天开始精度和插值一样，后面插值会比模型的精度大等情况在实际的预测中都是有可能发生的。
 \begin{table}[htbp]
   \centering
   \caption{不同参数组合对于预测精度的影响}
     \begin{tabular}{rrrrrrrrr}
-    \hline
+    \hline \\
            &       &       &       &\multicolumn{1}{l}{error<50} &       &       &       &  \\\\
            \hline
           & \multicolumn{1}{l}{24h} & \multicolumn{1}{l}{48h} & \multicolumn{1}{l}{72h} & \multicolumn{1}{l}{96h} & \multicolumn{1}{l}{120h} & \multicolumn{1}{l}{144h} & \multicolumn{1}{l}{168h} &  \\
@@ -115,15 +115,55 @@ \subsection{不同模型和不同参数下的预测结果}
   \label{tab:diffpara}%
 \end{table}%
 
+
+% Table generated by Excel2LaTeX from sheet 'Sheet1'
+\begin{table}[htbp]
+  \centering
+  \caption{不同参数组合的RSME的比较}
+    \begin{tabular}{lrrrrrrr}
+    \hline\\
+          & \multicolumn{7}{c}{RMSE<300} 
+          \\\\
+          \hline
+          & \multicolumn{1}{l}{24h} & \multicolumn{1}{l}{48h} & \multicolumn{1}{l}{72h} & \multicolumn{1}{l}{96h} & \multicolumn{1}{l}{120h} & \multicolumn{1}{l}{144h} & \multicolumn{1}{l}{168h} \\
+    模型1   & 58.30\% & 50.00\% & 47.22\% & 43.75\% & 41.67\% & 40.27\% & 39.88\% \\
+    模型2   & 100.00\% & 75.00\% & 59.72\% & 44.79\% & 35.83\% & 29.86\% & 25.59\% \\
+    模型3   & 100.00\% & 68.75\% & 52.77 & 39.58\% & 31.66\% & 26.38\% & 22.61\% \\
+    模型4   & 100.00\% & 68.75\% & 47.22 & 35.41\% & 28.33\% & 23.61\% & 20.23\% \\
+    插值    & 0.00\% & 0.00\% & 0.00\% & 0.00\% & 0.00\% & 0.00\% & 0.00\% \\
+    \hline
+          &       &       &       &       &       &       &  \\
+          & \multicolumn{7}{c}{RMSE<400} \\\\
+          \hline
+          & \multicolumn{1}{l}{24h} & \multicolumn{1}{l}{48h} & \multicolumn{1}{l}{72h} & \multicolumn{1}{l}{96h} & \multicolumn{1}{l}{120h} & \multicolumn{1}{l}{144h} & \multicolumn{1}{l}{168h} \\
+    模型1   & 95.83\% & 97.90\% & 98.60\% & 87.50\% & 85.83\% & 86.68\% & 88.63\% \\
+    模型2   & 100.00\% & 81.25\% & 86.11\% & 72.91\% & 58.33\% & 48.61\% & 41.66\% \\
+    模型3   & 100.00\% & 79.16\% & 73.61\% & 62.50\% & 50.00\% & 41.66\% & 35.71\% \\
+    模型4   & 100.00\% & 79.16\% & 65.27\% & 48.95\% & 39.16\% & 32.63\% & 27.97\% \\
+    插值    & 0.00\% & 0.00\% & 0.00\% & 0.00\% & 0.00\% & 0.00\% & 0.00\% \\
+    \hline
+          &       &       &       &       &       &       &  \\
+          & \multicolumn{7}{c}{RMSE<500} \\\\
+          \hline
+          & \multicolumn{1}{l}{24h} & \multicolumn{1}{l}{48h} & \multicolumn{1}{l}{72h} & \multicolumn{1}{l}{96h} & \multicolumn{1}{l}{120h} & \multicolumn{1}{l}{144h} & \multicolumn{1}{l}{168h} \\
+    模型1   & 100.00\% & 100.00\% & 100.00\% & 97.91\% & 98.33\% & 98.61\% & 98.88\% \\
+    模型2   & 100.00\% & 100.00\% & 100.00\% & 97.91\% & 85.83\% & 71.52\% & 61.30\% \\
+    模型3   & 100.00\% & 87.50\% & 91.66\% & 87.50\% & 78.33\% & 65.27\% & 55.95\% \\
+    模型4   & 100.00\% & 95.83\% & 97.22\% & 85.41\% & 68.33\% & 56.94\% & 48.80\% \\
+    插值    & 20.83\% & 27.08\% & 29.16\% & 28.12\% & 22.50\% & 18.75\% & 16.07\% \\
+    \hline
+    \end{tabular}%
+  \label{tab:addlabel}%
+\end{table}%
+
 \begin{figure}[ht]
 \centering
 \subfloat[绝对误差小于50的格点占比]{\label{fig:acc}{\includegraphics[width=0.45\textwidth]{accuracy.png}}}
-\subfloat[均方误差变化]{\label{fig:rsmeacc}{\includegraphics[width=0.45\textwidth]{accuracy.png}}}
+\subfloat[均方误差变化]{\label{fig:rsmeacc}{\includegraphics[width=0.45\textwidth]{accuracyRSME.png}}}
 \hfill
 \caption{模型预测精度随着预测时间的变化}
 %\label{fig:subfigures}
 \end{figure}
-加上RSME的误差分析部分
 \subsection*{参数的影响}
 上面的分析表明，所设计的残差网络在预测的精度上无论是简单的深度神经网络还是和均值比较都有较大的提升，而从表(\ref{table:diffpara})中也可以看出，采用不同的参数对于残差网络训练结果的影响：
 \begin{itemize}
@@ -154,6 +194,8 @@ \subsection{结果可视化和区域分析}
 \label{fig:region}
 \end{figure}
 图(\ref{fig:11})展示了三里屯地区的人流量预测情况及其与真实值之间的对比关系，从曲线的走势可以看出，总体上预测值和真实值的符合度比较高，误差值在前期的分布没有明显的趋势性，表明造成预测误差的一个重要原因可能是随机因素的影响，而预测的后期也呈现出一定的预测精度的下降趋势，这也和之前的讨论分析是相符合的。\\
-而图(\ref{fig:33})中展示了北京西站这样一个人流量较大但也有较明显变化的区域的预测结果，可以看出，对于高峰的预测是相对比较准确的，可以看出本模型可以在诸如火车站、地铁站等的人流预测和预警中发挥较好的作用。
+\indent 而图(\ref{fig:33})中展示了北京西站这样一个人流量较大但也有较明显变化的区域的预测结果，可以看出，对于高峰的预测是相对比较准确的，可以看出本模型可以在诸如火车站、地铁站等的人流预测和预警中发挥较好的作用。
 
-\section{总结和展望}
+\section{总结和展望}
+本文搭建了一个利用手机信令数据进行对于城市人口分布预测的模型构架，并可以综合考量影响人口分布的多种因素。 在一定的条件下，本文的模型在预测精度和趋势捕捉等方面都较大地好过基线模型，为利用城市大数据预测人口分布和流动提供了一个较为可行和准确的模型。 另外模型有着较好的扩展性，数据来源可以包括诸如手机信令数据、GPS数据，出租车数据甚至地铁出行数据等多种数据，只要经过恰当的前期处理，均可以整合到模型中进行预测。\\
+\indent 另外，由于数据量和数据周期的限制，本文中的训练结果还没完全达到模型的最佳水平，在未来的应用场景中，在有着更加丰富和详细的数据的情况下，模型的预测结果将会有更好的表现。
diff --git a/Reports/Figures/B1.png b/Reports/Figures/B1.png
diff --git a/Reports/Figures/B2.png b/Reports/Figures/B2.png
diff --git a/Reports/Figures/accuracy.png b/Reports/Figures/accuracy.png
diff --git a/Reports/Figures/accuracyRSME.png b/Reports/Figures/accuracyRSME.png
diff --git a/Reports/cover.pdf b/Reports/cover.pdf