-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update format.R #220
base: master
Are you sure you want to change the base?
Update format.R #220
Conversation
There are a few variables that the format_stats19() function doesn't uncode correctly. These are variables that contain both categorical data values alongside non-categorical values. At present, the formatting process is generating NAs for the following variables: Casualty dataset- "age_of_casualty"; Vehicle dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model"; Accident dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model". I've proposed an amend in the code that omits these variables from being uncoded alongside the wider set of variables, to avoid NAs being incorrectly introduced, and then hard coded the uncoding of those previously omitted variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Please can you provide a bit more info, e.g. the output before the change and after?
No problem! Apologies, i'm new to using github. I've attached the heads of the outputs from the original function and from the amended version of the function that i've proposed. Do let me know if you want anything else. Also, (i don't think this impacts on the suggested change) i wasn't able to use the dl_stats19() or the read_vehicles() function successfully (not sure why as of yet, so will have a bit more of a look into that), so i instead downloaded the relevant csv files outside of R, and then read in the csv files using read.csv(), and then subsequently applied the formatting function. accidents_2021_formatted_original_head.csv |
Thank you for the contribution here. I have not had a proper look but have been teaching using stats19 so know a little more about the data. Is the issue NA's in what should be a character column and NA should only be used for "missing" data? I am just thinking things like this would be best to work with the underlying schema files as adding values in the code for variables with -1, 99 etc would be hard to manage. The DfT might change those etc. So, my suggestion/question is: is there a way to generalise this to all variables? |
Just to be clear, which values in the original .csv files generate incorrectly introduced NAs? -99 is a value that is correctly converted to NA I believe. |
In the case of the variables mentioned, they have the code -1 for "Unknown" or "Data missing or out of range" (and the first "first_road_number" and "second_road_number" number variables have the additional code 0 for "first_road_class is C or Unclassified. These roads do not have official numbers so recorded as zero "), but otherwise there are no -99 codes for those variables. I appreciate that having to code the uncoding instances individually is not practical, and agree that it would be better to handle this using the schema/without having to hard code the uncoding of individual variables. In the schema file, for the variables for which NAs are being introduced inappropriately, the pattern is that they have a case for the uncoding of -1 (and 0 in the case of "first_road_number" and "second road number"), but no cases for the uncoding of other values (see image for "age_of_driver" example). I think there are a couple of options.
for(i in vars_to_change) { I hope that has helped to clear up some of the queries. Let me know if you have any further queries or thoughts on the options i've outlined. |
This certainly looks good to me. Will aim to take a proper look next week and test with reproducible example to demonstrate the fix. Happy for anyone else to check in the meantime. Many thanks for your contribution @R-M-J-P this is greatly appreciated. |
#245 is heading towards
FYI @Robinlovelace and @R-M-J-P |
@R-M-J-P and @Robinlovelace #245 addreses this but not as above. It now uses NA to code the -1 etc when column type is an integer in cases of age_of_* columns. So this PR still stands but should consider #245 in this context. |
👍 to that Layik. Will finally take a look with a view to pushing the latest version to CRAN. |
There are a few variables that the format_stats19() function doesn't uncode correctly. These are variables that contain both categorical data values alongside non-categorical values. At present, the formatting process is generating NAs for the following variables: Casualty dataset- "age_of_casualty"; Vehicle dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model"; Accident dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model". I've proposed an amend in the code that omits these variables from being uncoded alongside the wider set of variables, to avoid NAs being incorrectly introduced, and then hard coded the uncoding of those previously omitted variables.