Skip to content

Data Quality Control

appatton edited this page Mar 24, 2014 · 4 revisions

Flag A: Zip Code (veh_zip) in Registration and Estimates (rae_public) table not in Massachusetts.

Amount of data affected: 3.2%

Recommended short-term solution: Set error value of veh_zip = "99999".

Recommended long-term solution: Check these files against the initial records to see if they can be corrected in the database.

Other notes: I used the zip code dataset at http://www.unitedstateszipcodes.org/ma/ to determine whether zip codes are in Massachusetts. The original file uses veh_zip = "00000" for suppressed data. These data points are also more likely to have unreliable mileage estimates. In QC of other fields, scrutinize these fields. Many of these values look like transpositions of MA zip codes - eg. 06250 might be 02650, 01234 might be 02134.

Flag B: Unreasonable value of miles per day (mi_per_day) in Registration and Estimates (rae_public) table.

Amount of data affected: depends on cut-off point, but probably <1%. [need to insert plots and stats here later].

Recommended short-term solution: For a conservative approach, use cut-point of 60mph12hr=720. For a more realistic approach, use 60mph4hr = 240. The cut-off of 200 miles/day that was used in the aggregation in the grid layer is also reasonable, although it will lose vehicles that are used to commute exceptionally large distances.

Recommended long-term solution: Cross-reference high mileage vehicles (>200) with written records, if available.

Other Notes: I have tried various cut-points based on quantiles and reasonability. There was no association between owner type (non-commercial or commercial) and mi_per_day. Zip codes not in Massachusetts were likely to also have unreasonable mileage (See Flag A).