Merge changes in Rc/v2.1.0 to main branch #7

mingstat · 2024-10-23T07:02:36Z

Improved code quality through refactoring for better readability and maintainability, resolved partial matching issues for file names without extensions, and enhanced the load_data() function with new env_var and print_file_paths arguments for greater flexibility.

R/dvloader.R

ml-ebs-ext · 2024-11-03T17:00:58Z

R/utils.R

-  }
-  attr(out, "meta") <- meta
+  # Set names of the list elements to the basenames of the file paths
+  names(data_list) <- basename(file_paths)


Whole paths would be better, because file_paths can point to different folders with files that share the same name.

The use of base file names as list names is maintained to ensure backward compatibility with existing code that relies on the legacy load_data() function. For users who need the full file paths, this information is stored in the metadata attributes of each data frame in the returned list.

Users of legacy load_data() retain the old behavior because of this statement that happens at the end of that function:

names(data_list) <- file_names

Users of load_data_files() would benefit from seeing the exact path they provided as names of the output list.

See discussion in internal chat about possibility of removing both the paths and the extensions as well as preventing repeat entries in the resulting list. The

names(data_list) <- file_names

at the end of load_data() should make that function immune to implementing this change.

ml-ebs-ext · 2024-11-03T17:07:49Z

R/utils.R

+    # Get file extension
+    extension <- tools::file_ext(file_path)
+
+    # Read file based on its extension


I don't get the rewrite from toupper to tolower, the change in the error message, the moving of code around and the obvious comments. They only open the doors for a bug to creep in and make code review a longer process for no discernible benefit.

The change to lowercase file extensions (.rds and .sas7bdat) was made to align with common R package conventions, as seen in packages like {pins} and {haven}.

The old toupper is only used to do a case-insensitive check of the file extension. It's a non-user-facing implementation detail. Changing the logic is unnecessary.

README.md

ml-ebs-ext · 2024-11-03T17:28:46Z

R/utils.R

+  checkmate::assert_character(dir_path, len = 1)
+  checkmate::assert_character(file_names, min.len = 1)
+  checkmate::assert_logical(prefer_sas, len = 1)
+


There's a lot of new code here and I haven't reviewed it closely. The difference that strikes me the most is that the old toupper case-insensitive match behavior is gone. I imagine this can have an impact under Windows. Since load_data has been rewritten to list files through this function, we need a good reason to deviate from the old behavior.

The current code follows the case-sensitive behavior of readRDS() and haven::read_sas() to avoid ambiguity and risk of matching the wrong file.

This piece of code belongs to the create_data_list currently on the main branch:

# Case insensitive file name match uppercase_file_name <- toupper(paste0(x, ext)) match_count <- sum(uppercase_candidates == uppercase_file_name) if (match_count > 1) { stop(paste("create_data_list(): More than one case-insensitive file name match for", file_path, x)) }

It is there to warn against an edge-case scenario in which a folder contains two files that share the same name but differ in case. That is not a problem under linux, but we still want to warn users against that situation, because running the same code with the same data files under case-insensitive windows file systems could lead to the loading of different files.

This check is no longer in the rewritten dv.loader and it should be, unless the team decides otherwise.

My suggestion here would be to take the original logic of create_data_list and adapt it minimally to follow the old filename-matching logic, so that we don't throw away useful behavior on a rewrite.

Thank you for raising this important point about case-sensitivity. I agree we should discuss with the team to determine if we want to to make any changes to the current behavior.

NEWS.md

mattkorb · 2024-11-12T12:53:04Z

R/dvloader.R

-    } else {
-      study_path <- file.path(get_cre_path(), sub_dir)
-    }
+  file_ext <- if (prefer_sas) "sas7bdat" else "rds"


This variable is not used anywhere. I think it can be removed.

Good catch. I've removed the unused file_ext variable since it's now handled in the get_file_paths() function.

ml-ebs-ext · 2024-11-19T10:22:04Z

Here are my thoughts on the last changes to this PR.

Context

The "partial matching" issue can't be fixed because it's documented behavior of dv.loader v2.

The way we want to that address that issue is to break the load_data function into two:

The one that does the partial matching, which up to this PR was called get_file_paths.
The one that does the loading, which was and is called load_data_files.
We can recreate the load_data function by composing it as two call to these two new functions.

Comments

On my last review of the code, I complained that an important part of the original logic of the
legacy load_data had been dropped and that we needed to recover it.

I suggested to

take the original logic of create_data_list and adapt it minimally to follow the old filename-matching
logic, so that we don't throw away useful behavior on a rewrite.

Now I see that get_file_paths is gone and create_data_list is back, which not only lists paths,
but also loads their contents through load_data_files.
For me this is one step forward and one step back. Because indeed, we recover the old file matching
logic, but now load_data calls create_data_list which in turn calls load_data_files once per file
(inside a lapply) even though load_data_files already allows to load multiple files at the same time.
I find that unnecessarily confusing and it's the first thing I'd address if I was contributing to this codebase.

Recommendation

Revert these two commits:
350f4ed
b971bb9
Adapt get_file_paths to use the legacy matching logic in the old create_data_list.

Alternatively, we could get fresher eyes than mine on this PR. I think we're going in circles and I'm not helping.
None of my comments are set to block the approval of this PR and I'm more than willing to step aside.

mingstat · 2024-11-19T11:42:15Z

Here are my thoughts on the last changes to this PR.

Context

The "partial matching" issue can't be fixed because it's documented behavior of dv.loader v2.

The way we want to that address that issue is to break the load_data function into two:

The one that does the partial matching, which up to this PR was called get_file_paths.

The one that does the loading, which was and is called load_data_files.
We can recreate the load_data function by composing it as two call to these two new functions.

Comments

On my last review of the code, I complained that an important part of the original logic of the legacy load_data had been dropped and that we needed to recover it.

I suggested to

take the original logic of create_data_list and adapt it minimally to follow the old filename-matching
logic, so that we don't throw away useful behavior on a rewrite.

Now I see that get_file_paths is gone and create_data_list is back, which not only lists paths, but also loads their contents through load_data_files. For me this is one step forward and one step back. Because indeed, we recover the old file matching logic, but now load_data calls create_data_list which in turn calls load_data_files once per file (inside a lapply) even though load_data_files already allows to load multiple files at the same time. I find that unnecessarily confusing and it's the first thing I'd address if I was contributing to this codebase.

Recommendation

Revert these two commits:

350f4ed

b971bb9

Adapt get_file_paths to use the legacy matching logic in the old create_data_list.

Alternatively, we could get fresher eyes than mine on this PR. I think we're going in circles and I'm not helping. None of my comments are set to block the approval of this PR and I'm more than willing to step aside.

Thank you for your detailed and thoughtful review. I agree with your assessment and recommendations. As suggested, I've reverted the two commits and adapted get_file_paths to use the legacy matching logic from the original create_data_list function. You can find these changes in my latest commit here: 6f65e5c

This should address the concerns about maintaining the original file matching behavior while keeping the cleaner separation between path matching and file loading functionality.

ml-ebs-ext · 2024-11-19T15:26:18Z

I see you really haven't reverted b971bb9. That means that the load_data in this branch behaves differently than that in main. I've stopped reviewing the changes after finding this problem.

This task has taken so much of my time that I'm struggling to fulfill my commitments to this and to other projects. I'm starting to use my personal time to take care of my assigned workload. This is not a sustainable situation and so,

I won't be conducting any more reviews of this codebase unless I'm explicitly asked to do it during a sprint planning meeting.

Instead, I want to offer what I believe is the main change that this task requires, built on top of the last released version of this package.

This commit adds a data load function that is devoid of the "partial matching" behavior originally reported as a problem. It follows the design the team recently outlined in our internal chat. The rest of the package is preserved as it is, so no extra code review is necessary.

If the new function is ever tested, reviewed and deemed appropriate, then someone should modify the package documentation to take it into account, bump version numbers, etc. But, as stated above, that someone won't be me unless formally assigned to the task.

…ention

…ist()

mingstat · 2024-11-25T09:37:50Z

As discussed internally, a new pull request will be created to introduce a load_files() function without making any changes to the legacy load_data() function.

mingstat added 19 commits October 17, 2024 23:20

Refactor code in the R/ folder

532aaf5

Update examples

ff5d202

Print the directory path and file names

432e835

Add an arugment to print file paths if requested

9ec6f69

Do not export get_base_dir()

bda40b4

Add get_nfs_path() function back

b155b1a

Keep the metadata format unchanged

f7601bf

Update README

6570461

Update package vignettes

488d679

Add sample data for docs and tests

1f256bc

Update Rd files

088feb5

Update tests

30619ab

Update test setup file

8298ec2

Update integration guide

9cb51ac

Make get_base_dir internal function

01de803

Export get_file_paths and load_data_files

b111f9a

Update DESCRIPTION file

a480124

Update changelog

8722c94

Update roxygen examples

8bcefd3

mingstat requested a review from a team as a code owner October 23, 2024 07:02

mingstat added 5 commits October 23, 2024 15:08

Fix lintr issues

5b719dd

Fix styler issues

62545e9

Try to identify issue from tests

2bbf397

Use pattern matching to find files when no file extension is provided

a02950e

Fix styler issues

51ba3ce

mingstat requested review from mattkorb and ml-ebs-ext October 23, 2024 12:14

mingstat added 2 commits November 1, 2024 15:41

Save demo data to a temp dir

6e4a18a

Remove pharmaverseadam data from package

89413a7

ml-ebs-ext reviewed Nov 3, 2024

View reviewed changes

mingstat added 7 commits November 6, 2024 16:19

Remove comments inside functions

4665a23

Remvoe get_base_dir() and keep get_nfs_path()

19ccc3d

Update function docs

d164c10

Update examples in README

912957f

Update example in vignettes

52d058b

Update R document files

c06a26a

Fix styler issues

b8d3bc3

mattkorb reviewed Nov 12, 2024

View reviewed changes

mingstat added 8 commits November 13, 2024 14:09

Check argument print_file_paths

abe10d6

Remove unused code

cf48a94

Updated changelog

6dc6e0e

Update Rd file

a14c500

Reuse create_data_list() and keep legacy code

350f4ed

Add data loading vignette

e91f221

Remove file extensions and check duplicated file names

96fa7de

Remove redundant code

b971bb9

Move create_data_list() logic to get_file_paths()

6f65e5c

mingstat added 3 commits November 20, 2024 14:04

Reassign names to match the original load_data() function naming conv…

2b9719f

…ention

Update error message to use get_file_paths() instead of create_data_l…

966780f

…ist()

Add support for named file paths

b6244f6

mingstat closed this Nov 25, 2024

mingstat deleted the rc/v2.1.0 branch November 25, 2024 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge changes in Rc/v2.1.0 to main branch #7

Merge changes in Rc/v2.1.0 to main branch #7

mingstat commented Oct 23, 2024

ml-ebs-ext Nov 3, 2024

mingstat Nov 13, 2024

ml-ebs-ext Nov 13, 2024 •

edited

Loading

ml-ebs-ext Nov 14, 2024

ml-ebs-ext Nov 3, 2024

mingstat Nov 6, 2024

ml-ebs-ext Nov 12, 2024

ml-ebs-ext Nov 3, 2024

mingstat Nov 6, 2024

ml-ebs-ext Nov 12, 2024

mingstat Nov 13, 2024

mattkorb Nov 12, 2024

mingstat Nov 13, 2024

ml-ebs-ext commented Nov 19, 2024

mingstat commented Nov 19, 2024

Context

Comments

Recommendation

ml-ebs-ext commented Nov 19, 2024

mingstat commented Nov 25, 2024

Merge changes in Rc/v2.1.0 to main branch #7

Merge changes in Rc/v2.1.0 to main branch #7

Conversation

mingstat commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ml-ebs-ext Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ml-ebs-ext commented Nov 19, 2024

Context

Comments

Recommendation

mingstat commented Nov 19, 2024

Context

Comments

Recommendation

ml-ebs-ext commented Nov 19, 2024

mingstat commented Nov 25, 2024

ml-ebs-ext Nov 13, 2024 •

edited

Loading