Update format.R #220

R-M-J-P · 2023-01-24T16:09:03Z

There are a few variables that the format_stats19() function doesn't uncode correctly. These are variables that contain both categorical data values alongside non-categorical values. At present, the formatting process is generating NAs for the following variables: Casualty dataset- "age_of_casualty"; Vehicle dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model"; Accident dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model". I've proposed an amend in the code that omits these variables from being uncoded alongside the wider set of variables, to avoid NAs being incorrectly introduced, and then hard coded the uncoding of those previously omitted variables.

Robinlovelace

Thanks for the PR! Please can you provide a bit more info, e.g. the output before the change and after?

R-M-J-P · 2023-01-25T09:18:40Z

No problem! Apologies, i'm new to using github. I've attached the heads of the outputs from the original function and from the amended version of the function that i've proposed.
As mentioned initially, the differences in the outputs relate to the following variables: Casualty dataset- "age_of_casualty"; Vehicle dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model"; Accident dataset- "age_of_driver", "engine_capacity_cc" and "generic_make_model".

Do let me know if you want anything else.

Also, (i don't think this impacts on the suggested change) i wasn't able to use the dl_stats19() or the read_vehicles() function successfully (not sure why as of yet, so will have a bit more of a look into that), so i instead downloaded the relevant csv files outside of R, and then read in the csv files using read.csv(), and then subsequently applied the formatting function.

accidents_2021_formatted_original_head.csv
accidents_2021_formatted_amended_head.csv
casualties_2021_formatted_original_head.csv
casualties_2021_formatted_amended_head.csv
vehicles_2021_formatted_original_head.csv
vehicles_2021_formatted_amended_head.csv

layik · 2023-01-25T10:02:21Z

Thank you for the contribution here. I have not had a proper look but have been teaching using stats19 so know a little more about the data.

Is the issue NA's in what should be a character column and NA should only be used for "missing" data?

I am just thinking things like this would be best to work with the underlying schema files as adding values in the code for variables with -1, 99 etc would be hard to manage. The DfT might change those etc. So, my suggestion/question is: is there a way to generalise this to all variables?

Robinlovelace · 2023-01-26T06:59:44Z

to avoid NAs being incorrectly introduced

Just to be clear, which values in the original .csv files generate incorrectly introduced NAs? -99 is a value that is correctly converted to NA I believe.

R-M-J-P · 2023-01-27T09:45:59Z

In the case of the variables mentioned, they have the code -1 for "Unknown" or "Data missing or out of range" (and the first "first_road_number" and "second_road_number" number variables have the additional code 0 for "first_road_class is C or Unclassified. These roads do not have official numbers so recorded as zero "), but otherwise there are no -99 codes for those variables.

I appreciate that having to code the uncoding instances individually is not practical, and agree that it would be better to handle this using the schema/without having to hard code the uncoding of individual variables.

In the schema file, for the variables for which NAs are being introduced inappropriately, the pattern is that they have a case for the uncoding of -1 (and 0 in the case of "first_road_number" and "second road number"), but no cases for the uncoding of other values (see image for "age_of_driver" example).

I think there are a couple of options.

Remove those variables from the schema, so that the -1s remain as -1s (and 0s as 0s), but the other values within those variables are preserved. This would essentially be the same as what happens with the 'lsoa_of_driver' variable in the 'vehicle' dataset which, at present, retains the -1 values after the formatting process
Amend the uncoding process in the script, so that in cases where there are inappropriately introduced NAs, the value is replaced with the original coded value. I haven't developed code to conduct that process as of yet, but i know that it would want to be implemented within the following existing block of code in the 'format_stats19' function:

for(i in vars_to_change) {
lkp_name = lkp$column_name[lkp$column_name == new_names[i]]
lookup = stats19::stats19_schema[
stats19::stats19_schema$variable_formatted == lkp_name,
c("code", "label")
]
x[[i]] = lookup$label[match(x[[i]], lookup$code)]
}

I hope that has helped to clear up some of the queries. Let me know if you have any further queries or thoughts on the options i've outlined.

Robinlovelace · 2023-01-27T17:07:46Z

This certainly looks good to me. Will aim to take a proper look next week and test with reproducible example to demonstrate the fix. Happy for anyone else to check in the meantime. Many thanks for your contribution @R-M-J-P this is greatly appreciated.

layik · 2024-07-28T07:30:50Z

#245 is heading towards

Amend the uncoding process in the script, so that in cases where there are inappropriately introduced NAs, the value is replaced with the original coded value. I haven't developed code to conduct that process as of yet, but i know that it would want to be implemented within the following existing block of code in the 'format_stats19' function:

FYI @Robinlovelace and @R-M-J-P

layik · 2024-07-29T08:33:37Z

@R-M-J-P and @Robinlovelace #245 addreses this but not as above. It now uses NA to code the -1 etc when column type is an integer in cases of age_of_* columns. So this PR still stands but should consider #245 in this context.

Robinlovelace · 2024-07-31T10:47:27Z

👍 to that Layik. Will finally take a look with a view to pushing the latest version to CRAN.

Robinlovelace reviewed Jan 24, 2023

View reviewed changes

This was referenced Jul 27, 2024

Fix 235 #245

Merged

format_vehicle and format_casualty #235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update format.R #220

Update format.R #220

R-M-J-P commented Jan 24, 2023

Robinlovelace left a comment

R-M-J-P commented Jan 25, 2023 •

edited

Loading

layik commented Jan 25, 2023

Robinlovelace commented Jan 26, 2023

R-M-J-P commented Jan 27, 2023 •

edited

Loading

Robinlovelace commented Jan 27, 2023

layik commented Jul 28, 2024

layik commented Jul 29, 2024

Robinlovelace commented Jul 31, 2024

Update format.R #220

Are you sure you want to change the base?

Update format.R #220

Conversation

R-M-J-P commented Jan 24, 2023

Robinlovelace left a comment

Choose a reason for hiding this comment

R-M-J-P commented Jan 25, 2023 • edited Loading

layik commented Jan 25, 2023

Robinlovelace commented Jan 26, 2023

R-M-J-P commented Jan 27, 2023 • edited Loading

Robinlovelace commented Jan 27, 2023

layik commented Jul 28, 2024

layik commented Jul 29, 2024

Robinlovelace commented Jul 31, 2024

R-M-J-P commented Jan 25, 2023 •

edited

Loading

R-M-J-P commented Jan 27, 2023 •

edited

Loading