Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow data.frame row-binding comparison (Fix #50) #284

Merged
merged 6 commits into from
Apr 20, 2019
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ S3method(clean_names,default)
S3method(clean_names,sf)
S3method(crosstab,data.frame)
S3method(crosstab,default)
S3method(describe_class,default)
S3method(describe_class,factor)
S3method(fisher.test,default)
S3method(fisher.test,tabyl)
S3method(print,tabyl)
Expand All @@ -24,8 +26,11 @@ export(adorn_totals)
export(as_tabyl)
export(chisq.test)
export(clean_names)
export(compare_df_cols)
export(compare_df_cols_same)
export(convert_to_NA)
export(crosstab)
export(describe_class)
export(excel_numeric_to_date)
export(fisher.test)
export(get_dupes)
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ This new function can be supplied as a value for the `.name_repair` argument of

Two new function `janitor::chisq.test()` and `janitor::fisher.test()` allow to apply their `stats` equivalent to two-way tabyl objects.

The new function `compare_df_cols()` allows checking if a combination of data.frames, tibbles, or lists of data.frames/tibbles have columns with the same classes, and reports on specific columns that are or are not similar. A companion function `compare_df_cols_same()` gives a TRUE/FALSE result indicating if the columns are the same (and therefore bindable, though FALSE is not definitive that binding will fail), and `describe_class()` describes the class to make differences between data.frames clear at a glance (#50, thanks to **@billdenney** for the feature.)

billdenney marked this conversation as resolved.
Show resolved Hide resolved
## Minor features

* `excel_numeric_to_date()` now returns a POSIXct object and includes a time zone. (#225, thanks to **@billdenney** for the feature.)
Expand Down
266 changes: 266 additions & 0 deletions R/compare_df_cols.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
#' Generate a comparison of data.frames (or similar objects) that indicates if
#' they will successfully bind together by rows.
#'
#' @details Due to the returned "column_name" column, no input data.frame may be
#' named "column_name".
#'
#' The \code{strict_description} argument is most typically used to understand
#' if factor levels match or are bindable. Factors are typically bindable,
#' but the behavior of what happens when they bind differs based on the
#' binding method ("rbind" or "bind_rows"). Even when
#' \code{strict_description} is \code{FALSE}, data.frames may still bind
#' because some classes (like factors and characters) can bind even if they
#' appear to differ.
#'
#' @param ... A combination of data.frames, tibbles, and lists of
#' data.frames/tibbles. The values may optionally be named arguments; if
#' named, the output column will be the name; if not named, the output column
#' will be the data.frame name (see examples section).
#' @param return Should a summary of "all" columns be returned, only return
#' "match"ing columns, or only "mismatch"ing columns?
#' @param bind_method What method of binding should be used to determine
#' matches? With "rbind", columns missing from a data.frame would be
#' considered a mismatch (as in \code{base::rbind()}; with "bind_rows",
#' columns missing from a data.frame would be considered a match (as in
#' \code{dplyr::bind_rows()}.
#' @param strict_description Passed to \code{describe_class}. Also, see the
#' Details section.
#' @return A data.frame with a column named "column_name" with a value named
#' after the input data.frames' column names, and then one column per
#' data.frame (named after the input data.frame). If more than one input has
#' the same column name, the column naming will have suffixes defined by
#' sequential use of \code{base::merge()} and may differ from expected naming.
#' The rows within the data.frame-named columns are descriptions of the
#' classes of the data within the columns (generated by
#' \code{describe_class}).
#' @examples
#' compare_df_cols(data.frame(A=1), data.frame(B=2))
#' # user-defined names
#' compare_df_cols(dfA=data.frame(A=1), dfB=data.frame(B=2))
#' # a combination of list and data.frame input
#' compare_df_cols(listA=list(dfA=data.frame(A=1), dfB=data.frame(B=2)), data.frame(A=3))
#' @family Data frame type comparison
#' @export
compare_df_cols <- function(..., return=c("all", "match", "mismatch"), bind_method=c("rbind", "bind_rows"), strict_description=FALSE) {
# Input checking
return <- match.arg(return)
bind_method <- match.arg(bind_method)
args <- list(...)
mask_input_data_frame <- sapply(X=args, FUN=is.data.frame)
mask_input_list <- sapply(X=args, FUN=is.list) & !mask_input_data_frame
mask_input_other <- !(mask_input_data_frame | mask_input_list)
if (any(mask_input_other)) {
stop("Input given with `...` must be either a data.frame or a list of data.frames.")
}
bad_list_inputs <- numeric(0)
for (idx in which(mask_input_list)) {
bad_list_inputs <-
c(
bad_list_inputs,
if (!all(sapply(X=args[[idx]], FUN=is.data.frame))) {
idx
} else {
numeric(0)
}
)
}
if (length(bad_list_inputs)) {
stop(
"List inputs must be lists of data.frames. List input ",
if (length(bad_list_inputs) == 1) {
paste("number", bad_list_inputs, "is not a list of data.frames.")
} else if (length(bad_list_inputs) < 6) {
paste("numbers", paste(bad_list_inputs, collapse=", "), "are not lists of data.frames.")
} else {
paste("numbers", paste(c(bad_list_inputs[1:5], "..."), collapse=", "), "are not lists of data.frames.")
}
)
}

# Generate and check column names
direct_names <- names(args)
indirect_names <- as.character(match.call(expand.dots=TRUE))
indirect_names <- indirect_names[!(indirect_names %in% as.character(match.call(expand.dots=FALSE)))]
if (is.null(direct_names)) {
final_names <- indirect_names
} else {
final_names <- direct_names
mask_replace <- final_names %in% ""
final_names[mask_replace] <- indirect_names[mask_replace]
}
final_names <- as.list(final_names)
for (idx in which(mask_input_list)) {
current_list_names <- names(args[[idx]])
final_names[[idx]] <-
if (is.null(current_list_names)) {
paste(final_names[[idx]], seq_along(args[[idx]]), sep="_")
} else if (any(mask_unnamed_list <- current_list_names %in% "")) {
current_list_names[mask_unnamed_list] <-
paste(
final_names[[idx]][mask_unnamed_list],
seq_len(sum(mask_unnamed_list)),
sep="_"
)
current_list_names
} else {
current_list_names
}
}
if (any(unlist(final_names) %in% "column_name")) {
stop("None of the input ... argument names or list names may be `column_name`.")
}
ret <- compare_df_cols_df_maker(args, class_colname=final_names, strict_description=strict_description)
if (return == "all" | ncol(ret) == 2) {
if (return != "all") {
warning("Only one data.frame provided, so all its classes are provided.")
}
rownames(ret) <- NULL
ret
} else {
# Choose which way to test if the rows are bindable (NA matches or not).
bind_method_fun <-
list(
rbind=function(idx) {
all(unlist(ret[idx,3:ncol(ret)]) %in% ret[idx,2])
},
bind_rows=function(idx) {
all(
unlist(ret[idx,3:ncol(ret)]) %in%
c(NA_character_,
na.omit(unlist(ret[idx,2:ncol(ret)]))[1])
)
}
)
mask_match <-
sapply(
X=seq_len(nrow(ret)),
FUN=bind_method_fun[[bind_method]]
)
ret <-
if (return == "match") {
ret[mask_match,]
} else if (return == "mismatch") {
ret[!mask_match,]
}
rownames(ret) <- NULL
ret
}
}

#' This is the workhorse for making a data.frame description used by
#' compare_df_cols
#' @param x The data.frame or list of data.frames
#' @param class_colname The name for the column-name-defining column
#' @param strict_description Passed to \code{describe_class}
#' @return A 2-column data.frame with the first column naming all the columns of
#' \code{x} and the second column (named after the value in
#' \code{class_colname}) defining the classes using
#' \code{describe_class()}.
#' @noRd
compare_df_cols_df_maker <- function(x, class_colname="class", strict_description)
UseMethod("compare_df_cols_df_maker")

compare_df_cols_df_maker.data.frame <- function(x, class_colname="class", strict_description) {
if (class_colname == "column_name") {
stop('`class_colname` cannot be "column_name"')
}
if (ncol(x) == 0) {
warning(class_colname, " has zero columns and will not appear in output.")
ret <- data.frame(column_name=character(0), stringsAsFactors=FALSE)
} else {
ret <-
data.frame(
column_name=names(x),
X=sapply(X=x, FUN=describe_class, strict_description=strict_description),
stringsAsFactors=FALSE
)
names(ret)[2] <- class_colname
}
ret
}

compare_df_cols_df_maker.list <- function(x, class_colname="class", strict_description=strict_description) {
if (length(class_colname) != length(x)) {
stop("`x` and `class_colname` must be the same length.")
} else if (any(class_colname == "column_name")) {
stop('`class_colname` cannot be "column_name"')
}
ret <-
lapply(
X=seq_along(x),
FUN=function(idx) {
compare_df_cols_df_maker(x=x[[idx]], class_colname=class_colname[[idx]], strict_description=strict_description)
}
)
Reduce(f=function(x, y) {merge(x, y, by="column_name", all=TRUE)}, x=ret)
}

#' Are the data.frames the same?
#'
#' @inheritParams compare_df_cols
#' @param verbose Print the mismatching columns if binding will fail.
#' @return \code{TRUE} if row binding will succeed or \code{FALSE} if it will
#' fail.
#' @family Data frame type comparison
#' @examples
#' compare_df_cols_same(data.frame(A=1), data.frame(A=2))
#' compare_df_cols_same(data.frame(A=1), data.frame(B=2))
#' compare_df_cols_same(data.frame(A=1), data.frame(B=2), verbose=FALSE)
#' compare_df_cols_same(data.frame(A=1), data.frame(B=2), bind_method="bind_rows")
#' @export
compare_df_cols_same <- function(..., return="mismatch", bind_method=c("rbind", "bind_rows"), verbose=TRUE) {
return <- match.arg(return)
bind_method <- match.arg(bind_method)
ret <- compare_df_cols(..., return=return, bind_method=bind_method)
if (nrow(ret) & verbose) {
print(ret)
}
nrow(ret) == 0
}

#' Describe the class(es) of an object
#'
#' @details For package developers, an S3 generic method can be written for
#' \code{describe_class()} for custom classes that may need more definition
#' than the default method.
#'
#' @param x The object to describe
#' @param strict_description Should the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Should the" - then this trails off

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what to put here. I think I'll merge it - it's the last thing holding it up! - then play with the function; it'll be a test of my understanding it, to be able to figure out what you were going to write here 😆

#' @return A character scalar describing the class(es) of an object where if the
#' scalar will match, columns in a data.frame (or similar object) should bind
#' together without issue.
sfirke marked this conversation as resolved.
Show resolved Hide resolved
#' @family Data frame type comparison
#' @examples
#' describe_class(1)
#' describe_class(factor("A"))
#' describe_class(ordered(c("A", "B")))
#' describe_class(ordered(c("A", "B")), strict_description=FALSE)
#' @export
describe_class <- function(x, strict_description=TRUE) {
UseMethod("describe_class")
}

#' @describeIn describe_class Describe factors with their levels
#' and if they are ordered.
#' @export
describe_class.factor <- function(x, strict_description=TRUE) {
if (strict_description) {
all_classes <- class(x)
all_levels <- levels(x)
level_text <- sprintf("levels=c(%s)", paste('"', levels(x), '"', sep="", collapse=", "))
factor_text <- sprintf("factor(%s)", level_text)
mask_factor <- class(x) == "factor"
all_classes[mask_factor] <- factor_text
paste(all_classes, collapse=", ")
} else {
all_classes <- setdiff(class(x), "ordered")
paste(all_classes, collapse=", ")
}
}

#' @describeIn describe_class List all classes of an object.
#' @export
describe_class.default <- function(x, strict_description=TRUE) {
all_classes <- class(x)
paste(all_classes, collapse=", ")
}
66 changes: 66 additions & 0 deletions man/compare_df_cols.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading