controls | progress | enableMenu | enableChalkboard | enableTitleFooter | enableSearch | transition | theme | customTheme |
---|---|---|---|---|---|---|---|---|
false |
false |
false |
false |
false |
false |
slide |
night |
custom |
The homework assignment is available today (October 6) and is due October 13 at 3:20pm.
- Identify minimum requirements for a reproducible computational project
- Apply good practices for file organization
- Use
tidy
principles for tabular (spreadsheet
-style) data
- Elements of reproducibility
- File organization
- Tidy data
This class requires Microsoft Excel or LibreOffice Calc (for opening .xlsx
or .csv
files).
- Annotated data from experiments or simulations
- Documented code for data analyses
- Defined software environments
- Standardized organization of above 3 elements
- Code
- Data*
- Lab notebook
- Presentations
- Manuscripts
- Grants & fellowships
- Discussion
- Raw data: can be very large, store in cloud eg. AWS S3 or public repositories eg. Zenodo, SRA
- Intermediate data: can be large or small, store in temporary scratch space
- Tables underlying figures or samples: small, store in GitHub
Raw data without annotations cannot be analyzed
data/sample1.fastq
data/sample2.fastq
data/sample3.fastq
data/sample_annotations.tsv
sample1.fastq
@SRR21277963.1
GGAGTAACAGAAGTGAGAACCAGCTTATCAGAAAAAAAGTTTGAATTATG
+SRR21277963.1
AAGAGGGGAGGGAGGGGIAGGGGGGA.GGGGAGGGGIGIGGII<A<GGAA
sample_annotations.tsv
sample srr_id sample_id sample_name
sample1 SRR21277963 104p8 dicodon_facs_off_low_2
sample2 SRR21277964 104p7 dicodon_facs_off_high_2
sample3 SRR21277965 104p6 dicodon_facs_on_low_2
- Version control (file history, track changes)
- Collaboration (branches, merging, issue comments, discussion)
- Project management tools (project board, issues, milestones, labels)
- Link to Slack channel for notifications
Example GitHub repositories:
https://github.com/rasilab/ribosome_collisions_yeast (public)
https://github.com/rasilab/bottorff_2022 (public)
https://github.com/rasilab/micropeptide_immunity (private)
project_name
|-- analysis/
|-- experiments/
|-- grants/
|-- presentations/
|-- manuscripts/
|-- .devcontainer/
|-- .gitignore
|-- README.md
- Use
README.md
to give an overview of the project and file organization - Use
.gitignore
to ignore files that should not be tracked by git - Use
.devcontainer/
to define the software environment for analysis (specific to VSCode), also called.install/
analysis
|--USER
|--ANALYSIS_TYPE (eg. riboseq)
|--YYYY-MM-DD_short_desc
|--README.md
|--data
|--gencode
|--gencode.v26.gtf.gz
|--fastq
|--SRRnnnnnn.fastq
|--scripts
|--analyze_riboseq.ipynb
|--download_from_sra.ipynb
|--run_analysis_pipeline.smk
|--annotations
|--sample_annotations.csv
|--tables
|--summary_table_1.csv
|--summary_table_2.csv
|--figures
|--summary_figure_1.pdf
|--summary_figure_2.pdf
.gz
and.fastq
files are usually in.gitignore
README.md
should give an overview of the analysis, data source etc. and how to reproduce it- Ideally, every data file should be downloaded programmatically from permalinks
- Project repo: Short, descriptive, understandable
- File names
- No caps, no spaces, no special characters other than
_
and-
- Date format:
YYYY-MM-DD
- No version numbers or names such as
rasi_v20
(GitHub does this automatically)
- No caps, no spaces, no special characters other than
- Experiment labels
exp001
,exp002
etc.- Use in filenames, sample annotations, issues
- Sample labels
16p1
,16p2
...20t1
,20t2
etc...- Include experiment number (16, 20), type of sample (p, t), and sample number (1, 2)
- Use on Eppendorf tubes, lab notebook etc.
- Create a table of sample annotations in your lab notebook record
A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
-
Examples from Park & Subramaniam 2019: data, annotations
-
Example from Table 2 in Bedford et al. 2014, available as an Excel table in the course repo
-
Follow same naming principles for columns as for files: No caps, no spaces, no special characters, use only
_
and-
if necessary.
Saving data as plain text files is necessarily to process this data with either R or Python. You can export from Excel to .tsv
(tab-delimited, preferred format) or .csv
(comma-delimited). A few things to note when exporting data files in these formats:
- Beware that line endings differ between Windows and Unix (including Mac), though the text editors we recommend for this class can deal with this
- Exporting from Excel only works for the currently displayed spreadsheet. If you have multiple sheets, you'll need to export multiple times.
Split into small groups of 3-4 people to work from an HI (haemagglutination-inhibition) table and convert to tidy data. Data available as an Excel table in the course repo.