Skip to content

Organization of files inside the research project

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

khodosevichlab/dataorganizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataorganizer

Installation

devtools::install_github("khodosevichlab/dataorganizer")

Motivation

Accessing data from the analysis notebooks or scripts you can use either full paths or have it relative to the script location. Global paths require changing the corresponding part of the scripts for each user. Relative paths require each user to have the same folder structure, but it isn't bad. Troubles come when you want to move your notebook to a different folder or to copy-paste the part of code into another vignette. This problem is also described in Stop the working directory insanity proposal. This package allows (i) having shorter aliases for accessing most used paths and (ii) have different folder structure for different users.

Usage

Basic structure

Package provides basic functionality for accessing your data. By default, the package assume the following folder structure:

project
|- data_mapping.yml     # file with mapping of data folders (see below)
|
|- data/                # raw data, not changed once created. Examples: expression matrices, images for analysis.
|
|- metadata/            # data, relevant for the research, but not produced by it. It's much smaller than "data" 
|                       # and assumed to be stored in the git repo. Examples: list of gene markers or info about patients.
|
|- output/              # output of the project relevant by itself. Examples: cell annotation, paper figures.
|
|- cache/               # cache files, created to optimize long computations (mostly in .rds format).
...                     # the rest is assumed to be a normal R package and/or workflowr package

Though it's often desired not to store all data inside the package folder. To allow a user to specify the folder structure, dataorganizer uses data_mapping.yml file. There, paths can be changed. Note: it's highly non-recommended to change paths to "metadata" and "output" folders, as they are supposed to be a part of the project.

Example:

folders:
  data: ~/data/Epilepsy/
  cache: ~/cache/Epilepsy

To access files in the folders one of the following functions should be used:

  • DataPath(path)
  • MetadataPath(path)
  • OutputPath(path)
  • CachePath(path)

Example:

DataPath("dir1", "file.mtx")

[1] "/home/user/data/Epilepsy/dir1/file.mtx"

Initialize project

To create the directories, run CreateFolders inside your project package

Manual paths to datasets

It's often the case that some long paths are used more ofthen then the rest. Let's say, we focus on the expression matrix from a single patient with path "~/data/Epilepsy/patients/control/patient1/outs/count_matrix/cm.mtx". Using this path each time is annoying. To avoid this, the path can be saved in data_mapping.yml:

folders:
  data: ~/data/Epilepsy/

datasets:
  c_p1_mtx: patients/control/patient1/outs/count_matrix/cm.mtx
  c_p2: patients/control/patient2/

Now we can run DatasetPath function:

DatasetPath("c_p1_mtx")
##                                                                      c_p1_mtx
## "/home/user/data/Epilepsy/patients/control/patient1/outs/count_matrix/cm.mtx"

It can also be used for the folder paths:

DatasetPath("c_p2", "outs/alignment.bam")
##                                                                     c_p2
## "/home/user/data/Epilepsy/patients/control/patient2//outs/alignment.bam"

About

Organization of files inside the research project

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages