Skip to content

1071-DataScience/finalproject-RickyLeeeee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

< Dat Air >

Groups

  • < 李右元, 107753027 >
  • < 楊晴焱, 107155018 >

Goal

AQI(Air Quality Index), a measurement to evaluate status of air safety and cleanliness, is derived from numerical formula defined by EPA (Environmental Protection Agency), which uses weighted values of concentration from diffrent gases and selects one with maxmium. Since it takes the result of the quantity over gases and they come from things in our daily lives, this triggers us bringing out a question: can we use the quantity of things emitting these gases contributing to AQI index to predict it?

Demo

  • Code execution
    We recommend that one download the entire master branch file folder as zip and decompress it, and switch the root layer under this master-branch titled folder with typing the following command on your R terminal to execute the code named after "final.R".
Rscript code/final.R
  • Data Visualization
    We simply output the performances of the models as png pictures in this code by different dimensions from aiding research pourposes directly without extra access to the internet.

    The output pictures are at folder named after "results", and the directory format looks like:
    results/[The Dimension Folder Name]

    For the details about analysis dimensions mentioned above, please check at the "Results" segment below for detailed information.

Folder organization and its related information

Docs

  • Presentation slides demonstrated on Jan 15, 2019

Data

  • Source

    We collected the AQI results among all cities in Taiwan every month from 2005-2017, and the related features of AQI contributors such as the monthly numbers of Motorbike&Car,Garbage&Waste generated, air pollution penalty&auditory cases for the project analysis. The related data reference source link is posted under "References" segment below. There are 3432 records in total after the collcection.

  • Input format

    As we collected so many features and AQI Results, it took us considerable time to combine all of them into one CSV file as modeling data, for these Open Data has its own format from different breaus or authorities concerned with "unique aligning preference". The final CSV we used contains these columns in this order in English version:

    • Label Columns:
      [Year],[Month],[City],[AQI]

    • Feature Columns [Traffic]:
      [Car],[Bike]

    • Feature Columns [Waste]:
      [TotalGarbageT],[GeneralGarbageT],[HugeGarbageT],[RecycleGarbageT],[KitchenWasteT],[WastePerPersonKG]

    • Feature Columns [Penalty]:
      [PenaltyConstruction],[PenaltyPollution],[PenaltyMobilePollution]

    • Feature Columns [Auditory]:
      [ExamConstruction],[ExamPollution],[ExamMobliePollution]

    ** Hint: AQI was converted with the Rscript titled "AQICoversion.R" under folder "code". Plus, One thing we should point out is we used "replaceChinese.csv" instead of "AllFeatures+Labelv4.csv" in our code, because "AllFeatures+Labelv4.csv" with non-English words would go haywire in display after re-download from Github. However, we still left both csv files for better comparison. Plus, for those who want to re-produce this model experiment, please switch and replace the data in the folder entitled "data" under this project folder with master-branch titled.

  • Data preprocessing

    • Handle missing data
      Fortunately, we got only about 10% data missing values, and all in the nearest features([Car] & [Bike]), we would just remove them from the data set. However, these missing values are city-oriented and time-bound, which means there would exist great bias upon predicting on these cities, for they missed half of the figures from a consecutive time interval from 2005-2010.

    • Scale value
      Because the features are with diffrent units and the numbers varies in scale massively, one method we have here is value-standardization, for that can simply scale down the values to the same level and unify the units among these values.

Code

  • We used [KNN], [Decision Tree], [Random Forest], three mdoels within our capabilities as we wanted to compare and optimize the performances. Meanwhile, since our data is city-oriented and time-bound, we tried to realize if they had great effect on prediction by examining the average performances amongst three dimensions: [By All-data],[By Cities],[By Months] as their titles. However, we applied cross-validation to all these three dimensions(Data split ratio: 70% training, 30% testing), and scored them at testing results by checking Precison, Recall, and F1-Score individually.

    Upon coding processing, we asked ourselves the following questions:

    • Given the data we had, the city-oriented and time-bound, would the models training under these two sub-conditions separately work better than the model training with all data (Specified Model VS General Model)?

    • Which model([KNN], [Decision Tree], [Random Forest]) would performance better results over Precision, Recall, F1-Score?

Results

  • With different performance evaluation methods( Precision, Recall, F1-Score ), and three-dimension analysis( [By All-data],[By Cities],[By Months] ), we may conclude:

    • Although General Model with all data outwieghed almost 90% of the other models with data under three dimensions, original data set had serious unbalance distribution which happened even under three dimensions split as well. The result here arose that we might need to do further statistical tests to clarify this.

    • The challenging parts of our projects lied in: [Poor Feature Diversity],[Unbalance Data],[Open Data Integration],[Background Knowledge Limitation].

References

About

finalproject-RickyLeeeee created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages