-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for loading missing values in resources as additional cols #161
Conversation
Another way I just thought of we might handle the API on this that could be pretty slick could be just taking unnamed and named vectors for channel selection & renaming. Then we could select channels via: channels = c("values") and add column suffixes via: channels = c(values = "", missing = "__missing") This would generalize better to other multichannel formats down the road, e.g.: channels = c("r", "g", "b") Thoughts? Other ideas? |
Just implemented the new API idea above, so read_resource now only uses added
I think what's nice about appending both |
@khusmann interesting use case
|
Thanks for the feedback!
|
Hi again! I've put together a vignette outlining my thoughts / justifications for this approach, framed as a proposal for addition to |
Hi @khusmann, nice work on the vignette! I would add a chunk at the beginning to load the packages you use (I think Since we both agree this is a better feature for readr, I suggest you suggest and clarify it as a feature there: https://github.com/tidyverse/readr/issues. The vignette will be useful. |
Thanks! Just updated my vignette with your suggestion. I'll make a post to readr with my vignette after the (USA) holiday to hopefully get more eyes on it. Also updated this branch to relegate the channel select logic into utils in |
I mentioned this on slack, but putting here for reference: I've created a package for reading interlaced values & missing reasons that might be useful here: https://kylehusmann.com/interlacer/ Instead of appending "_values" and "_missing" like I did above, value columns retain their original names, and missing columns are surrounded by dots (e.g. It wraps & extends readr's It also handles field-level missing values via the extended Anyway, the package is still in its infancy so it's not ready to be dropped in just yet -- but would appreciate any thoughts & feedback on the approach! |
Presently, all missing values loaded by read_resource become NAs in the resulting tibble. This means that when missing values encode reasons for missingness (e.g. "Participant refused item", "Participant absent"), these reasons are lost. In a lot of applications, we want access to these missing reasons because of the important contextual info they provide.
This pull request adds the ability to include missing reasons as separate columns when loading resources by adding an argument in
read_resource
to select which data "channel" the user wants to load:values
,missing
, orboth
. Resulting columns can be named viavalues_channel_suffix
andmissing_channel_suffix
.I'm using the term
channel
here because I think it's a powerful way of conceptualizing missing value data & metadata that can generalize to other types of data & formats. In the same way a color image has multiple channels for red, green, and blue pixel values, we can think of tabular data with missing reasons as having a channel for "values" and a channel for "missing reasons". What's nice about thinking of values and missingness as separate channels is that it enables us to work with them as separate types: when we interlace them as most formats do, everything becomes astring
and we lose that useful type info.Unfortunately, no packages in R exist yet (that I'm aware of) to work with values & missingness as multichannel structures, and frictionless doesn't have support for multichannel tabular formats (yet). So this is why I add them as separate columns here. The closest we have to this ability in R is the
tagged_na
andlabelled
classes inhaven
, but these are limited to the particular ways SPSS / Stata / SAS encode their missing values, rather than enabling support for arbitrary missing reasons, as frictionless is able to represent.In the long run it'd be nice to have an R package that could provide full support multichannel missingness, but I think extra columns are the best we can do for now. In the meantime, we might also consider at some point adding support for converting to the
tagged_na
andhaven_labelled
types, in the special cases when missing reasons conform to the peculiarities of SPSS / Stata / SAS formats.