format_vehicle and format_casualty #235

wengraf · 2024-04-19T12:34:00Z

Hi:

If I download and read in the full CSV (e.g., https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-1979-latest-published-year.csv) using read.csv, and then I use format_casualty or format_vehicle on that df, it turns age of casualty and age of vehicle to NA, and leaves only descriptions of those results that are actually NA.

Ivo

Robinlovelace · 2024-04-20T06:44:01Z

Does sound like a bug..

wengraf · 2024-04-20T07:54:33Z

Either a bug, or the coding makes an assumption of about CSV formatting that doesn't apply in the case of the very biggest three CSVs. I'd look more closely now, but I've got a specific job I need to crack on with...

layik · 2024-07-18T07:56:03Z

Hi @wengraf. Thanks for raising this and sorry for lack of response. @Robinlovelace I think this might be the cause.

# issue 235 checks
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
colnames(stats19_variables)
#> [1] "table"       "variable"    "note"        "column_name" "type"
stats19_variables[grep("age_", stats19_variables$variable), ]
#>       table             variable note          column_name      type
#> 44 Casualty age_band_of_casualty <NA> age_band_of_casualty character
#> 45 Casualty      age_of_casualty <NA>      age_of_casualty character
#> 74  Vehicle   age_band_of_driver <NA>   age_band_of_driver character
#> 75  Vehicle        age_of_driver <NA>        age_of_driver character
#> 76  Vehicle       age_of_vehicle <NA>       age_of_vehicle character

^{Created on 2024-07-18 with reprex v2.1.1}

See if I can fix the issue quickly.

Robinlovelace · 2024-07-18T08:37:22Z

Should be numeric for sure. Thanks Layik, hopefully a quick fix.

wengraf · 2024-07-18T09:17:29Z

It isn't just the one's with age...I think engine capacity is another one...

layik · 2024-07-18T09:21:57Z

It is not a quick one I am afraid and my time is needed elsewhere. This is actually across years.

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
cas = get_stats19(year = 2022, type = "cas")
#> Files identified: dft-road-casualty-statistics-casualty-2022.csv
#>    https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Attempt downloading from: https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Data saved at /tmp/RtmpFcCkYU/dft-road-casualty-statistics-casualty-2022.csv
#> Rows: 135480 Columns: 19
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (3): accident_index, accident_reference, lsoa_of_casualty
#> dbl (16): accident_year, vehicle_reference, casualty_reference, casualty_cla...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sapply(grep("age_", colnames(cas)), function(x) class(x))
#> [1] "integer" "integer"
###### yet!
table(cas$age_of_casualty)
#> 
#> Data missing or out of range 
#>                         3129
###### fine here
table(cas$age_band_of_casualty)
#> 
#>                        0 - 5                      11 - 15 
#>                         2211                         6209 
#....

^{Created on 2024-07-18 with reprex v2.1.1}

@Robinlovelace and other contributors, I believe stats19::stats19_variables$type is not used to read the data in.

Proof?

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

pkg_env = as.environment("package:stats19")
unlockBinding("stats19_variables", pkg_env)
stats19_variables = get("stats19_variables", envir = pkg_env)
# just test age_of_casualty
stats19_variables[grep("age_of_casualty", 
                       stats19_variables$variable), "type"] = "numeric"
assign("stats19_variables", stats19_variables, envir = pkg_env)
lockBinding("stats19_variables", pkg_env)
# read
cas = get_stats19(year = 2022, type = "cas")
#> Files identified: dft-road-casualty-statistics-casualty-2022.csv
#>    https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Attempt downloading from: https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
#> Data saved at /tmp/Rtmp4E06yZ/dft-road-casualty-statistics-casualty-2022.csv
#> Rows: 135480 Columns: 19
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (3): accident_index, accident_reference, lsoa_of_casualty
#> dbl (16): accident_year, vehicle_reference, casualty_reference, casualty_cla...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sapply(grep("age_", colnames(cas)), function(x) class(x))
#> [1] "integer" "integer"
table(cas$age_of_casualty)
#> 
#> Data missing or out of range 
#>                         3129
table(cas$age_band_of_casualty)
#> 
#>                        0 - 5                      11 - 15 
#>                         2211                         6209 
#...

^{Created on 2024-07-18 with reprex v2.1.1}

I must leave it there.

… with #235

layik · 2024-07-27T21:23:38Z

cc @wengraf

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
# Function to check for columns with lots of NAs
find_cols_with_many_NAs = function(df, na_threshold = 0.5) {
  # Calculate the proportion of NAs in each column
  na_proportions = sapply(df, function(col) sum(is.na(col)) / length(col))
  
  # Find columns where the proportion of NAs is greater than the threshold
  cols_with_many_NAs = names(na_proportions[na_proportions > na_threshold])
  
  return(cols_with_many_NAs)
}
col = get_stats19(year = 2022, type = "col")
#> Files identified: dft-road-casualty-statistics-collision-2022.csv
#>    https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-2022.csv
cas = get_stats19(year = 2022, type = "cas")
#> Files identified: dft-road-casualty-statistics-casualty-2022.csv
#> 
#>    https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2022.csv
veh = get_stats19(year = 2022, type = "veh")
#> Files identified: dft-road-casualty-statistics-vehicle-2022.csv
#> 
#>    https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-vehicle-2022.csv
find_cols_with_many_NAs(col)
#> [1] "local_authority_district" "first_road_number"
find_cols_with_many_NAs(cas)
#> [1] "age_of_casualty"
find_cols_with_many_NAs(veh)
#> [1] "age_of_driver"      "engine_capacity_cc" "generic_make_model"

^{Created on 2024-07-27 with reprex v2.1.1}

EDIT:
certainly related to #220

What a subtle bug this is. Found it I think:

stats19/R/format.R

Line 84 in 3242e36

x[[i]] = lookup$label[match(x[[i]], lookup$code)]

As it can be seen above, it has nothing to do with setting the class of the columns (the draft PR). Above is an attempt at converting say 1, 2, 3 for sexes male, female and uknown to replace all the values as such using the lookup object which works for say 99.9% of the values. In our integer values case, for instance, it fails on 99% as only the unknown are coded as -1 for instance. Because that is what match does which by the way is as complicated as R gets :)

layik · 2024-07-28T07:36:45Z

This is the cure and now engine capacity, and for that matter every column, should not have crazy amount of NAs injected in them whilst formatting any of the three tables.
839c656

layik · 2024-07-29T08:30:49Z

Once #245 is in this will close, we will have much better get started vignette than current which shows almost useless tables.

Robinlovelace · 2024-07-31T11:00:26Z

Heads-up @layik I'm still hitting the issue with get_stats19() as per reprex below, can you test? It may be using the wrong version but re-opening to double check:

#| eval: false
install.packages("stats19")
packageVersion("stats19")
# [1] ‘3.0.3’
cas = stats19::get_stats19(type = "cas", year = 1979)
names(cas)
table(cas$age_of_casualty)
# Data missing or out of range 
#                       211849

remotes::install_dev("stats19")
packageVersion("stats19")
# [1] ‘3.1.0’
cas = stats19::get_stats19(type = "cas", year = 1979)
names(cas)
table(cas$age_of_casualty)
# Data missing or out of range 
#                       211849

Robinlovelace · 2024-07-31T11:02:10Z

Correction: it IS fixed, it was just using old version of data. Reprex: incoming.

wengraf mentioned this issue Jul 17, 2024

Reproducing DfT pedestrian stats sheet #240

Open

layik added a commit that referenced this issue Jul 27, 2024

Attempt addressing #235

0c9a0d9

layik added a commit that referenced this issue Jul 27, 2024

Avoid column name transform changing column classes #235 + tests

90cf166

layik mentioned this issue Jul 27, 2024

Fix 235 #245

Merged

layik added a commit that referenced this issue Jul 27, 2024

Avoid column name transform changing column classes #235 + tests

fbcc832

layik added a commit that referenced this issue Jul 27, 2024

vig/stats19.Rmd fails due to changes in ggplot perhaps: nothing to do…

8c82ba9

… with #235

layik added the bug Something isn't working label Jul 27, 2024

layik closed this as completed in #245 Jul 29, 2024

Robinlovelace added a commit that referenced this issue Jul 31, 2024

Add reprex for #235

2639739

Robinlovelace reopened this Jul 31, 2024

Robinlovelace closed this as completed Jul 31, 2024

Robinlovelace added a commit that referenced this issue Jul 31, 2024

Confirm fix for #235

4dd239b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

format_vehicle and format_casualty #235

format_vehicle and format_casualty #235

wengraf commented Apr 19, 2024

Robinlovelace commented Apr 20, 2024

wengraf commented Apr 20, 2024

layik commented Jul 18, 2024

Robinlovelace commented Jul 18, 2024

wengraf commented Jul 18, 2024

layik commented Jul 18, 2024 •

edited

Loading

layik commented Jul 27, 2024 •

edited

Loading

layik commented Jul 28, 2024 •

edited

Loading

layik commented Jul 29, 2024 •

edited

Loading

Robinlovelace commented Jul 31, 2024

Robinlovelace commented Jul 31, 2024

format_vehicle and format_casualty #235

format_vehicle and format_casualty #235

Comments

wengraf commented Apr 19, 2024

Robinlovelace commented Apr 20, 2024

wengraf commented Apr 20, 2024

layik commented Jul 18, 2024

Robinlovelace commented Jul 18, 2024

wengraf commented Jul 18, 2024

layik commented Jul 18, 2024 • edited Loading

layik commented Jul 27, 2024 • edited Loading

layik commented Jul 28, 2024 • edited Loading

layik commented Jul 29, 2024 • edited Loading

Robinlovelace commented Jul 31, 2024

Robinlovelace commented Jul 31, 2024

layik commented Jul 18, 2024 •

edited

Loading

layik commented Jul 27, 2024 •

edited

Loading

layik commented Jul 28, 2024 •

edited

Loading

layik commented Jul 29, 2024 •

edited

Loading