-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge changes in Rc/v2.1.0 to main branch #7
Conversation
R/utils.R
Outdated
} | ||
attr(out, "meta") <- meta | ||
# Set names of the list elements to the basenames of the file paths | ||
names(data_list) <- basename(file_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole paths would be better, because file_paths
can point to different folders with files that share the same name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of base file names as list names is maintained to ensure backward compatibility with existing code that relies on the legacy load_data()
function. For users who need the full file paths, this information is stored in the metadata attributes of each data frame in the returned list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users of legacy load_data()
retain the old behavior because of this statement that happens at the end of that function:
names(data_list) <- file_names
Users of load_data_files()
would benefit from seeing the exact path they provided as names of the output list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion in internal chat about possibility of removing both the paths and the extensions as well as preventing repeat entries in the resulting list. The
names(data_list) <- file_names
at the end of load_data()
should make that function immune to implementing this change.
R/utils.R
Outdated
# Get file extension | ||
extension <- tools::file_ext(file_path) | ||
|
||
# Read file based on its extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the rewrite from toupper
to tolower
, the change in the error message, the moving of code around and the obvious comments. They only open the doors for a bug to creep in and make code review a longer process for no discernible benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change to lowercase file extensions (.rds and .sas7bdat) was made to align with common R package conventions, as seen in packages like {pins} and {haven}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old toupper
is only used to do a case-insensitive check of the file extension. It's a non-user-facing implementation detail. Changing the logic is unnecessary.
checkmate::assert_character(dir_path, len = 1) | ||
checkmate::assert_character(file_names, min.len = 1) | ||
checkmate::assert_logical(prefer_sas, len = 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of new code here and I haven't reviewed it closely. The difference that strikes me the most is that the old toupper
case-insensitive match behavior is gone. I imagine this can have an impact under Windows. Since load_data
has been rewritten to list files through this function, we need a good reason to deviate from the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code follows the case-sensitive behavior of readRDS() and haven::read_sas() to avoid ambiguity and risk of matching the wrong file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code belongs to the create_data_list
currently on the main
branch:
# Case insensitive file name match
uppercase_file_name <- toupper(paste0(x, ext))
match_count <- sum(uppercase_candidates == uppercase_file_name)
if (match_count > 1) {
stop(paste("create_data_list(): More than one case-insensitive file name match for", file_path, x))
}
It is there to warn against an edge-case scenario in which a folder contains two files that share the same name but differ in case. That is not a problem under linux, but we still want to warn users against that situation, because running the same code with the same data files under case-insensitive windows file systems could lead to the loading of different files.
This check is no longer in the rewritten dv.loader
and it should be, unless the team decides otherwise.
My suggestion here would be to take the original logic of create_data_list
and adapt it minimally to follow the old filename-matching logic, so that we don't throw away useful behavior on a rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for raising this important point about case-sensitivity. I agree we should discuss with the team to determine if we want to to make any changes to the current behavior.
R/dvloader.R
Outdated
} else { | ||
study_path <- file.path(get_cre_path(), sub_dir) | ||
} | ||
file_ext <- if (prefer_sas) "sas7bdat" else "rds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is not used anywhere. I think it can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I've removed the unused file_ext
variable since it's now handled in the get_file_paths()
function.
Here are my thoughts on the last changes to this PR. ContextThe "partial matching" issue can't be fixed because it's documented behavior of dv.loader v2. The way we want to that address that issue is to break the
CommentsOn my last review of the code, I complained that an important part of the original logic of the I suggested to
Now I see that Recommendation
Alternatively, we could get fresher eyes than mine on this PR. I think we're going in circles and I'm not helping. |
Thank you for your detailed and thoughtful review. I agree with your assessment and recommendations. As suggested, I've reverted the two commits and adapted get_file_paths to use the legacy matching logic from the original create_data_list function. You can find these changes in my latest commit here: 6f65e5c This should address the concerns about maintaining the original file matching behavior while keeping the cleaner separation between path matching and file loading functionality. |
I see you really haven't reverted b971bb9. That means that the This task has taken so much of my time that I'm struggling to fulfill my commitments to this and to other projects. I'm starting to use my personal time to take care of my assigned workload. This is not a sustainable situation and so, I won't be conducting any more reviews of this codebase unless I'm explicitly asked to do it during a sprint planning meeting. Instead, I want to offer what I believe is the main change that this task requires, built on top of the last released version of this package. This commit adds a data load function that is devoid of the "partial matching" behavior originally reported as a problem. It follows the design the team recently outlined in our internal chat. The rest of the package is preserved as it is, so no extra code review is necessary. If the new function is ever tested, reviewed and deemed appropriate, then someone should modify the package documentation to take it into account, bump version numbers, etc. But, as stated above, that someone won't be me unless formally assigned to the task. |
As discussed internally, a new pull request will be created to introduce a |
Improved code quality through refactoring for better readability and maintainability, resolved partial matching issues for file names without extensions, and enhanced the
load_data()
function with newenv_var
andprint_file_paths
arguments for greater flexibility.