Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update format.R #220

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Update format.R #220

wants to merge 1 commit into from

Conversation

R-M-J-P
Copy link

@R-M-J-P R-M-J-P commented Jan 24, 2023

There are a few variables that the format_stats19() function doesn't uncode correctly. These are variables that contain both categorical data values alongside non-categorical values. At present, the formatting process is generating NAs for the following variables: Casualty dataset- "age_of_casualty"; Vehicle dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model"; Accident dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model". I've proposed an amend in the code that omits these variables from being uncoded alongside the wider set of variables, to avoid NAs being incorrectly introduced, and then hard coded the uncoding of those previously omitted variables.

There are a few variables that the format_stats19() function doesn't uncode correctly. These are variables that contain both categorical data values alongside non-categorical values. At present, the formatting process is generating NAs for the following variables: Casualty dataset- "age_of_casualty"; Vehicle dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model"; Accident dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model". I've proposed an amend in the code that omits these variables from being uncoded alongside the wider set of variables, to avoid NAs being incorrectly introduced, and then hard coded the uncoding of those previously omitted variables.
Copy link
Member

@Robinlovelace Robinlovelace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please can you provide a bit more info, e.g. the output before the change and after?

@R-M-J-P
Copy link
Author

R-M-J-P commented Jan 25, 2023

No problem! Apologies, i'm new to using github. I've attached the heads of the outputs from the original function and from the amended version of the function that i've proposed.
As mentioned initially, the differences in the outputs relate to the following variables: Casualty dataset- "age_of_casualty"; Vehicle dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model"; Accident dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model".

Do let me know if you want anything else.

Also, (i don't think this impacts on the suggested change) i wasn't able to use the dl_stats19() or the read_vehicles() function successfully (not sure why as of yet, so will have a bit more of a look into that), so i instead downloaded the relevant csv files outside of R, and then read in the csv files using read.csv(), and then subsequently applied the formatting function.

accidents_2021_formatted_original_head.csv
accidents_2021_formatted_amended_head.csv
casualties_2021_formatted_original_head.csv
casualties_2021_formatted_amended_head.csv
vehicles_2021_formatted_original_head.csv
vehicles_2021_formatted_amended_head.csv

@layik
Copy link
Member

layik commented Jan 25, 2023

Thank you for the contribution here. I have not had a proper look but have been teaching using stats19 so know a little more about the data.

Is the issue NA's in what should be a character column and NA should only be used for "missing" data?

I am just thinking things like this would be best to work with the underlying schema files as adding values in the code for variables with -1, 99 etc would be hard to manage. The DfT might change those etc. So, my suggestion/question is: is there a way to generalise this to all variables?

@Robinlovelace
Copy link
Member

to avoid NAs being incorrectly introduced

Just to be clear, which values in the original .csv files generate incorrectly introduced NAs? -99 is a value that is correctly converted to NA I believe.

@R-M-J-P
Copy link
Author

R-M-J-P commented Jan 27, 2023

In the case of the variables mentioned, they have the code -1 for "Unknown" or "Data missing or out of range" (and the first "first_road_number" and "second_road_number" number variables have the additional code 0 for "first_road_class is C or Unclassified. These roads do not have official numbers so recorded as zero "), but otherwise there are no -99 codes for those variables.

I appreciate that having to code the uncoding instances individually is not practical, and agree that it would be better to handle this using the schema/without having to hard code the uncoding of individual variables.

In the schema file, for the variables for which NAs are being introduced inappropriately, the pattern is that they have a case for the uncoding of -1 (and 0 in the case of "first_road_number" and "second road number"), but no cases for the uncoding of other values (see image for "age_of_driver" example).

I think there are a couple of options.

  1. Remove those variables from the schema, so that the -1s remain as -1s (and 0s as 0s), but the other values within those variables are preserved. This would essentially be the same as what happens with the 'lsoa_of_driver' variable in the 'vehicle' dataset which, at present, retains the -1 values after the formatting process

  2. Amend the uncoding process in the script, so that in cases where there are inappropriately introduced NAs, the value is replaced with the original coded value. I haven't developed code to conduct that process as of yet, but i know that it would want to be implemented within the following existing block of code in the 'format_stats19' function:

for(i in vars_to_change) {
lkp_name = lkp$column_name[lkp$column_name == new_names[i]]
lookup = stats19::stats19_schema[
stats19::stats19_schema$variable_formatted == lkp_name,
c("code", "label")
]
x[[i]] = lookup$label[match(x[[i]], lookup$code)]
}

I hope that has helped to clear up some of the queries. Let me know if you have any further queries or thoughts on the options i've outlined.

image

@Robinlovelace
Copy link
Member

This certainly looks good to me. Will aim to take a proper look next week and test with reproducible example to demonstrate the fix. Happy for anyone else to check in the meantime. Many thanks for your contribution @R-M-J-P this is greatly appreciated.

This was referenced Jul 27, 2024
@layik
Copy link
Member

layik commented Jul 28, 2024

#245 is heading towards

  1. Amend the uncoding process in the script, so that in cases where there are inappropriately introduced NAs, the value is replaced with the original coded value. I haven't developed code to conduct that process as of yet, but i know that it would want to be implemented within the following existing block of code in the 'format_stats19' function:

FYI @Robinlovelace and @R-M-J-P

@layik
Copy link
Member

layik commented Jul 29, 2024

@R-M-J-P and @Robinlovelace #245 addreses this but not as above. It now uses NA to code the -1 etc when column type is an integer in cases of age_of_* columns. So this PR still stands but should consider #245 in this context.

@Robinlovelace
Copy link
Member

👍 to that Layik. Will finally take a look with a view to pushing the latest version to CRAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants