< Dat Air >

Groups

< 李右元, 107753027 >
< 楊晴焱, 107155018 >

Goal

AQI(Air Quality Index), a measurement to evaluate status of air safety and cleanliness, is derived from numerical formula defined by EPA (Environmental Protection Agency), which uses weighted values of concentration from diffrent gases and selects one with maxmium. Since it takes the result of the quantity over gases and they come from things in our daily lives, this triggers us bringing out a question: can we use the quantity of things emitting these gases contributing to AQI index to predict it?

Demo

Code execution
We recommend that one download the entire master branch file folder as zip and decompress it, and switch the root layer under this master-branch titled folder with typing the following command on your R terminal to execute the code named after "final.R".

Rscript code/final.R

Data Visualization
We simply output the performances of the models as png pictures in this code by different dimensions from aiding research pourposes directly without extra access to the internet.

The output pictures are at folder named after "results", and the directory format looks like:
results/[The Dimension Folder Name]

For the details about analysis dimensions mentioned above, please check at the "Results" segment below for detailed information.

Folder organization and its related information

Docs

Presentation slides demonstrated on Jan 15, 2019

Data

Source

We collected the AQI results among all cities in Taiwan every month from 2005-2017, and the related features of AQI contributors such as the monthly numbers of Motorbike&Car,Garbage&Waste generated, air pollution penalty&auditory cases for the project analysis. The related data reference source link is posted under "References" segment below. There are 3432 records in total after the collcection.
Input format

As we collected so many features and AQI Results, it took us considerable time to combine all of them into one CSV file as modeling data, for these Open Data has its own format from different breaus or authorities concerned with "unique aligning preference". The final CSV we used contains these columns in this order in English version:
- Label Columns:
  [Year],[Month],[City],[AQI]
- Feature Columns [Traffic]:
  [Car],[Bike]
- Feature Columns [Waste]:
  [TotalGarbageT],[GeneralGarbageT],[HugeGarbageT],[RecycleGarbageT],[KitchenWasteT],[WastePerPersonKG]
- Feature Columns [Penalty]:
  [PenaltyConstruction],[PenaltyPollution],[PenaltyMobilePollution]
- Feature Columns [Auditory]:
  [ExamConstruction],[ExamPollution],[ExamMobliePollution]
** Hint: AQI was converted with the Rscript titled "AQICoversion.R" under folder "code". Plus, One thing we should point out is we used "replaceChinese.csv" instead of "AllFeatures+Labelv4.csv" in our code, because "AllFeatures+Labelv4.csv" with non-English words would go haywire in display after re-download from Github. However, we still left both csv files for better comparison. Plus, for those who want to re-produce this model experiment, please switch and replace the data in the folder entitled "data" under this project folder with master-branch titled.
Data preprocessing
- Handle missing data
  Fortunately, we got only about 10% data missing values, and all in the nearest features([Car] & [Bike]), we would just remove them from the data set. However, these missing values are city-oriented and time-bound, which means there would exist great bias upon predicting on these cities, for they missed half of the figures from a consecutive time interval from 2005-2010.
- Scale value
  Because the features are with diffrent units and the numbers varies in scale massively, one method we have here is value-standardization, for that can simply scale down the values to the same level and unify the units among these values.

Code

We used [KNN], [Decision Tree], [Random Forest], three mdoels within our capabilities as we wanted to compare and optimize the performances. Meanwhile, since our data is city-oriented and time-bound, we tried to realize if they had great effect on prediction by examining the average performances amongst three dimensions: [By All-data],[By Cities],[By Months] as their titles. However, we applied cross-validation to all these three dimensions(Data split ratio: 70% training, 30% testing), and scored them at testing results by checking Precison, Recall, and F1-Score individually.

Upon coding processing, we asked ourselves the following questions:
- Given the data we had, the city-oriented and time-bound, would the models training under these two sub-conditions separately work better than the model training with all data (Specified Model VS General Model)?
- Which model([KNN], [Decision Tree], [Random Forest]) would performance better results over Precision, Recall, F1-Score?

Results

With different performance evaluation methods( Precision, Recall, F1-Score ), and three-dimension analysis( [By All-data],[By Cities],[By Months] ), we may conclude:
- Although General Model with all data outwieghed almost 90% of the other models with data under three dimensions, original data set had serious unbalance distribution which happened even under three dimensions split as well. The result here arose that we might need to do further statistical tests to clarify this.
- The challenging parts of our projects lied in: [Poor Feature Diversity],[Unbalance Data],[Open Data Integration],[Background Knowledge Limitation].

References

Car&Motorbike Statistic in Taiwan: [https://stat.thb.gov.tw/hb01/webMain.aspx?sys=100&funid=11100]
Garbage Statistic in Taiwan: [https://erdb.epa.gov.tw/DataRepository/Statistics/TrashClearExecutiveProduce.aspx?topic1=%E5%9C%B0&topic2=%E6%B1%A1%E6%9F%93%E9%98%B2%E6%B2%BB&subject=%E5%BB%A2%E6%A3%84%E7%89%A9]
Air-Pollution Penalty Cases Statisic in Taiwan: [https://erdb.epa.gov.tw/DataRepository/Statistics/StatEmsEemFineCnt.aspx?topic1=%E5%85%B6%E4%BB%96&topic2=%E7%92%B0%E4%BF%9D%E7%B5%B1%E8%A8%88&subject=%E6%B1%A1%E6%9F%93%E7%AE%A1%E5%88%B6]
Air-Pollution Auditory Cases Statisic in Taiwan: [https://erdb.epa.gov.tw/DataRepository/Statistics/StatEmsEemCnt.aspx?topic1=%u5176%u4ed6&topic2=%u74b0%u4fdd%u7d71%u8a08&subject=%u6c61%u67d3%u7ba1%u5236]
AQI Formula: [https://taqm.epa.gov.tw/taqm/tw/b0203.aspx]
AQI Statistic in Taiwan: [https://erdb.epa.gov.tw/DataRepository/EnvMonitor/AirQualityMonitorMonData.aspx?topic1=%u5927%u6c23&topic2=%u74b0%u5883%u53ca%u751f%u614b%u76e3%u6e2c&subject=%u7a7a%u6c23%u54c1%u8cea]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

< Dat Air >

Groups

Goal

Demo

Folder organization and its related information

Docs

Data

Code

Results

References

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
code		code
data		data
docs		docs
results		results
README.md		README.md

1071-DataScience/finalproject-RickyLeeeee

Folders and files

Latest commit

History

Repository files navigation

< Dat Air >

Groups

Goal

Demo

Folder organization and its related information

Docs

Data

Code

Results

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages