Add support for loading missing values in resources as additional cols #161

khusmann · 2023-11-08T00:43:28Z

Presently, all missing values loaded by read_resource become NAs in the resulting tibble. This means that when missing values encode reasons for missingness (e.g. "Participant refused item", "Participant absent"), these reasons are lost. In a lot of applications, we want access to these missing reasons because of the important contextual info they provide.

This pull request adds the ability to include missing reasons as separate columns when loading resources by adding an argument in read_resource to select which data "channel" the user wants to load: values, missing, or both. Resulting columns can be named via values_channel_suffix and missing_channel_suffix.

I'm using the term channel here because I think it's a powerful way of conceptualizing missing value data & metadata that can generalize to other types of data & formats. In the same way a color image has multiple channels for red, green, and blue pixel values, we can think of tabular data with missing reasons as having a channel for "values" and a channel for "missing reasons". What's nice about thinking of values and missingness as separate channels is that it enables us to work with them as separate types: when we interlace them as most formats do, everything becomes a string and we lose that useful type info.

Unfortunately, no packages in R exist yet (that I'm aware of) to work with values & missingness as multichannel structures, and frictionless doesn't have support for multichannel tabular formats (yet). So this is why I add them as separate columns here. The closest we have to this ability in R is the tagged_na and labelled classes in haven, but these are limited to the particular ways SPSS / Stata / SAS encode their missing values, rather than enabling support for arbitrary missing reasons, as frictionless is able to represent.

In the long run it'd be nice to have an R package that could provide full support multichannel missingness, but I think extra columns are the best we can do for now. In the meantime, we might also consider at some point adding support for converting to the tagged_na and haven_labelled types, in the special cases when missing reasons conform to the peculiarities of SPSS / Stata / SAS formats.

…umns

khusmann · 2023-11-08T00:54:13Z

Another way I just thought of we might handle the API on this that could be pretty slick could be just taking unnamed and named vectors for channel selection & renaming.

Then we could select channels via:

channels = c("values")
channels = c("values", "missing")

and add column suffixes via:

channels = c(values = "", missing = "__missing")

This would generalize better to other multichannel formats down the road, e.g.:

channels = c("r", "g", "b")
channels = c(r = "__red", g = "__green", b = "__blue")

Thoughts? Other ideas?

khusmann · 2023-11-08T02:13:25Z

Just implemented the new API idea above, so read_resource now only uses added channels arg.

channels = c("values") -> load values (normal, default behavior)
channels = c("missing") -> load missing reasons

channels = c("values", "missing") -> load values AND missing reasons. Append "__values" and "__missing" to columns respectively
channels = c(values="__value_suffix", missing="__missing_suffix") -> same as above, but with custom suffixes.

I think what's nice about appending both __values and __missing to columns when loading both values and missingness is that it makes pattern-based pivoting & other data wrangling a little easier by default (e.g. via ends_with("__values") and ends_with("__missing") dplyr selectors).

peterdesmet · 2023-11-13T09:49:46Z

@khusmann interesting use case

Can you provide a small example dataset that can be used to test different approaches?
Can you provide a reproducible example using the example above. I learn a lot from just seeing how it the output looks like. :-)
I use readr as a barometer for what functionality to consider when reading data. This use case gets us quite far from that. So I'm reluctant to implement (and maintain) something complicated that isn't adopted in other packages. 😅
That said, I do understand that improved missing value interpretation would be very useful. I haven't thought about this as long as you have, but what about the following approach:

Have an argument in read_resource() to include (rather than convert) missing values
Missing values are included with a prefix:missing:NA, missing:Participant refused item
Columns get converted to string
User can manipulate data further

khusmann · 2023-11-13T20:01:33Z

Thanks for the feedback!

Sure! I think this approach is deserving of a full vignette. I'll put one together... :)
^^
For this feature, I'm thinking more along the lines of an analogy of this package to the haven package, in how it captures SPSS/Stata/SAS missing values with custom types / attributes. Until now I've been thinking of readr as more of a lower-level lib in this context, but you make a good point that this functionality may be a candidate for inclusion into readr proper instead of here... I'll have to think about that.
My hesitation with that approach is how it loses type information (by converting everything to string). So subsequent manipulations end up relying on a lot of string operations with "magic" tags (like "missing:") and type conversions rather than working with the pure data, which adds a lot of brittle boilerplate to common manipulation tasks. I can show some of the pros / cons in the aforementioned vignette, once I get it together...

khusmann · 2023-11-14T02:32:54Z

Hi again! I've put together a vignette outlining my thoughts / justifications for this approach, framed as a proposal for addition to read_delim in the tidyverse. (I think you're right, it would be most ideal for it to be supported there, if possible). Any thoughts / feedback / other perspectives on this would be greatly appreciated! :)

peterdesmet · 2023-11-21T11:13:05Z

Hi @khusmann, nice work on the vignette! I would add a chunk at the beginning to load the packages you use (I think readr, stringr, dplyr), so it becomes repeatable for others.

Since we both agree this is a better feature for readr, I suggest you suggest and clarify it as a feature there: https://github.com/tidyverse/readr/issues. The vignette will be useful.

khusmann · 2023-11-21T19:03:20Z

Thanks! Just updated my vignette with your suggestion. I'll make a post to readr with my vignette after the (USA) holiday to hopefully get more eyes on it.

Also updated this branch to relegate the channel select logic into utils in read_delim_ext in utils.R. This way if readr eventually does implement this feature, it'll be a drop-in replacement here. Also my plan is to use this as the basis for my implementation of value / missing labels (#148). One of the key features important to me in the value / missing label implementation is the ability to keep the value and missing labels separate. Otherwise, you get factor levels polluted with a bunch of missing reasons, and again rely on brittle string manipulations (& type conversions) to distinguish. Keeping them separate gives the user much more flexibility -- in general combining/interlacing channels is always trivial, but separating already interlaced channels requires context & gymnastics.

khusmann · 2024-03-25T16:08:23Z

I mentioned this on slack, but putting here for reference: I've created a package for reading interlaced values & missing reasons that might be useful here: https://kylehusmann.com/interlacer/

Instead of appending "_values" and "_missing" like I did above, value columns retain their original names, and missing columns are surrounded by dots (e.g. .name.)

It wraps & extends readr's read_* functions and col_* types, so it'd be really easy to incorporate into read_resource(). I'm imagining we could add a flag deinterlace = TRUE that would load the missing values in a deinterlaced data frame, whereas deinterlace = FALSE (the default) would keep the original behavior.

It also handles field-level missing values via the extended icol_* collector types.

Anyway, the package is still in its infancy so it's not ready to be dropped in just yet -- but would appreciate any thoughts & feedback on the approach!

Add support for loading missing values in resources as additional col…

486ab75

…umns

slightly different read_resource channel api

0cfa555

peterdesmet added enhancement New feature or request function:read_resource Function read_resource() labels Nov 13, 2023

khusmann added 2 commits November 21, 2023 10:14

break out channel select logic into separate function in utils

413dadf

fix warning thrown by col_select

8d70d38

khusmann added 4 commits November 21, 2023 12:40

move read_delim_ext to utils_ext

556639a

break out function channel_opt_standardize

af466c3

fix failed tests

e4cf268

fix R cmd check errors

4d4c1b0

peterdesmet added this to the 1.2.0 milestone Mar 27, 2024

khusmann closed this Jun 26, 2024

khusmann deleted the load_missing branch June 26, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for loading missing values in resources as additional cols #161

Add support for loading missing values in resources as additional cols #161

khusmann commented Nov 8, 2023 •

edited

Loading

khusmann commented Nov 8, 2023

khusmann commented Nov 8, 2023

peterdesmet commented Nov 13, 2023

khusmann commented Nov 13, 2023

khusmann commented Nov 14, 2023

peterdesmet commented Nov 21, 2023

khusmann commented Nov 21, 2023

khusmann commented Mar 25, 2024

Add support for loading missing values in resources as additional cols #161

Add support for loading missing values in resources as additional cols #161

Conversation

khusmann commented Nov 8, 2023 • edited Loading

khusmann commented Nov 8, 2023

khusmann commented Nov 8, 2023

peterdesmet commented Nov 13, 2023

khusmann commented Nov 13, 2023

khusmann commented Nov 14, 2023

peterdesmet commented Nov 21, 2023

khusmann commented Nov 21, 2023

khusmann commented Mar 25, 2024

khusmann commented Nov 8, 2023 •

edited

Loading