-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format_vehicle and format_casualty #235
Comments
Does sound like a bug.. |
Either a bug, or the coding makes an assumption of about CSV formatting that doesn't apply in the case of the very biggest three CSVs. I'd look more closely now, but I've got a specific job I need to crack on with... |
Hi @wengraf. Thanks for raising this and sorry for lack of response. @Robinlovelace I think this might be the cause. # issue 235 checks
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
colnames(stats19_variables)
#> [1] "table" "variable" "note" "column_name" "type"
stats19_variables[grep("age_", stats19_variables$variable), ]
#> table variable note column_name type
#> 44 Casualty age_band_of_casualty <NA> age_band_of_casualty character
#> 45 Casualty age_of_casualty <NA> age_of_casualty character
#> 74 Vehicle age_band_of_driver <NA> age_band_of_driver character
#> 75 Vehicle age_of_driver <NA> age_of_driver character
#> 76 Vehicle age_of_vehicle <NA> age_of_vehicle character Created on 2024-07-18 with reprex v2.1.1 See if I can fix the issue quickly. |
Should be numeric for sure. Thanks Layik, hopefully a quick fix. |
It isn't just the one's with age...I think engine capacity is another one... |
It is not a quick one I am afraid and my time is needed elsewhere. This is actually across years. library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
cas = get_stats19(year = 2022, type = "cas")
#> Files identified: dft-road-casualty-statistics-casualty-2022.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Attempt downloading from: https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Data saved at /tmp/RtmpFcCkYU/dft-road-casualty-statistics-casualty-2022.csv
#> Rows: 135480 Columns: 19
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): accident_index, accident_reference, lsoa_of_casualty
#> dbl (16): accident_year, vehicle_reference, casualty_reference, casualty_cla...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sapply(grep("age_", colnames(cas)), function(x) class(x))
#> [1] "integer" "integer"
###### yet!
table(cas$age_of_casualty)
#>
#> Data missing or out of range
#> 3129
###### fine here
table(cas$age_band_of_casualty)
#>
#> 0 - 5 11 - 15
#> 2211 6209
#.... Created on 2024-07-18 with reprex v2.1.1 @Robinlovelace and other contributors, I believe Proof? library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
pkg_env = as.environment("package:stats19")
unlockBinding("stats19_variables", pkg_env)
stats19_variables = get("stats19_variables", envir = pkg_env)
# just test age_of_casualty
stats19_variables[grep("age_of_casualty",
stats19_variables$variable), "type"] = "numeric"
assign("stats19_variables", stats19_variables, envir = pkg_env)
lockBinding("stats19_variables", pkg_env)
# read
cas = get_stats19(year = 2022, type = "cas")
#> Files identified: dft-road-casualty-statistics-casualty-2022.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Attempt downloading from: https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Data saved at /tmp/Rtmp4E06yZ/dft-road-casualty-statistics-casualty-2022.csv
#> Rows: 135480 Columns: 19
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): accident_index, accident_reference, lsoa_of_casualty
#> dbl (16): accident_year, vehicle_reference, casualty_reference, casualty_cla...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sapply(grep("age_", colnames(cas)), function(x) class(x))
#> [1] "integer" "integer"
table(cas$age_of_casualty)
#>
#> Data missing or out of range
#> 3129
table(cas$age_band_of_casualty)
#>
#> 0 - 5 11 - 15
#> 2211 6209
#... Created on 2024-07-18 with reprex v2.1.1 I must leave it there. |
cc @wengraf library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
# Function to check for columns with lots of NAs
find_cols_with_many_NAs = function(df, na_threshold = 0.5) {
# Calculate the proportion of NAs in each column
na_proportions = sapply(df, function(col) sum(is.na(col)) / length(col))
# Find columns where the proportion of NAs is greater than the threshold
cols_with_many_NAs = names(na_proportions[na_proportions > na_threshold])
return(cols_with_many_NAs)
}
col = get_stats19(year = 2022, type = "col")
#> Files identified: dft-road-casualty-statistics-collision-2022.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-2022.csv
cas = get_stats19(year = 2022, type = "cas")
#> Files identified: dft-road-casualty-statistics-casualty-2022.csv
#>
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
veh = get_stats19(year = 2022, type = "veh")
#> Files identified: dft-road-casualty-statistics-vehicle-2022.csv
#>
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-vehicle-2022.csv
find_cols_with_many_NAs(col)
#> [1] "local_authority_district" "first_road_number"
find_cols_with_many_NAs(cas)
#> [1] "age_of_casualty"
find_cols_with_many_NAs(veh)
#> [1] "age_of_driver" "engine_capacity_cc" "generic_make_model" Created on 2024-07-27 with reprex v2.1.1 EDIT: What a subtle bug this is. Found it I think: Line 84 in 3242e36
As it can be seen above, it has nothing to do with setting the class of the columns (the draft PR). Above is an attempt at converting say 1, 2, 3 for sexes male, female and uknown to replace all the values as such using the lookup object which works for say 99.9% of the values. In our integer values case, for instance, it fails on 99% as only the unknown are coded as -1 for instance. Because that is what |
This is the cure and now engine capacity, and for that matter every column, should not have crazy amount of NAs injected in them whilst formatting any of the three tables. |
Once #245 is in this will close, we will have much better get started vignette than current which shows almost useless tables. |
Heads-up @layik I'm still hitting the issue with
|
Correction: it IS fixed, it was just using old version of data. Reprex: incoming. |
Hi:
If I download and read in the full CSV (e.g., https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-1979-latest-published-year.csv) using read.csv, and then I use format_casualty or format_vehicle on that df, it turns age of casualty and age of vehicle to NA, and leaves only descriptions of those results that are actually NA.
Ivo
The text was updated successfully, but these errors were encountered: