diff --git a/404.html b/404.html index 79ae7c5..ca71f5e 100644 --- a/404.html +++ b/404.html @@ -16,7 +16,7 @@
Skip to content

404

PAGE NOT FOUND

But if you don't change your direction, and if you keep looking, you may end up where you are heading.

Released under the MIT License.

- + \ No newline at end of file diff --git a/README.html b/README.html index 505c006..e3713c8 100644 --- a/README.html +++ b/README.html @@ -12,13 +12,13 @@ - + -
Skip to content

Bench to Byte: A biologist's guide to crafting quality code

This guide is still under active development.

I made this guide for biologists with moderate programming experience who want to improve the way they work. You'll learn how to organize your projects, use remote resources, create reproducible software environments, and build computational pipelines.

This guide was made with members of the Bloom lab in mind. In the future, I might expand this resource to apply to a more general audience.

Developing

To develop Bench to Byte, you'll need to set up a few things. Specifically, you'll need two pieces of software: Node.js and npm. Node.js is an environment that allows you to run Javascript code on your computer. npm is a package manager that handles the Javascript libraries necessary to create this website.

The instructions for installing Node and npm depend on your operating system and personal preference. For detailed instructions check out this website.

However, if you have conda installed, you can use conda to install node and npm using the environment.yml file in the repo.

bash
conda env create -f environment.yml
+    
Skip to content

Bench to Byte: A biologist's guide to crafting quality code

I made this guide for biologists with moderate programming experience who want to improve the way they work. You'll learn how to organize your projects, use remote resources, create reproducible software environments, and build computational pipelines.

This guide was made with members of the Bloom lab in mind. In the future, I might expand this resource to apply to a more general audience.

Developing

To develop Bench to Byte, you'll need to set up a few things. Specifically, you'll need two pieces of software: Node.js and npm. Node.js is an environment that allows you to run Javascript code on your computer. npm is a package manager that handles the Javascript libraries necessary to create this website.

The instructions for installing Node and npm depend on your operating system and personal preference. For detailed instructions check out this website.

However, if you have conda installed, you can use conda to install node and npm using the environment.yml file in the repo.

bash
conda env create -f environment.yml
 conda activate bench-to-byte

Once Node and npm are installed, verify by running:

bash
node -v
 npm -v

You should see the node and npm versions that you installed on your system.

If this is the first time you're developing bench-to-byte, you'll need to install the packages in packages.json. To do this, run the following command from within the repository:

bash
npm install

With everything installed, you can now boot up a live preview of the website that automatically reflects your changes upon saving.

bash
npm run docs:dev

Now, there will be a local version of the website running at http://localhost:5173. Visit this URL in your browser to see a preview of the site.

Note, If you are developing bench-to-byte from a remote server (not your local machine), use

bash
npm run remote:docs:dev

and access the preview at the URL provided under Network

Contributing

Bench to Byte is based on the static site generator VitePress. This allows you to create fully functional web pages using Markdown files. To add a new page to the site, you simply create a markdown document in the correct directory and make a pull request.

For more details on writing markdown for a Vitepress website, see the markdown reference here.

Project Structure

bash
.
 ├── README.md
@@ -49,7 +49,7 @@
 ├── package.json
 └── public
     └── logo.png

To add details to a specific chapter of the course, simply edit the index.md file under the corresponding chapters/<chapter-you-are-editing>/ directory. If you want to include images in your markdown, you can either use a link to a remotely hosted image or add the image to the public/ directory and reference it like so:

bash
![Alt Text](/my-image.png)

If you've contributed in some way to this guide, add your name and a link to a photo in public/contributors.json.

Suggestions

If you have any suggestions for topics that aren't covered in this giude, please suggest them in the corresponding Discussions thread.

Released under the MIT License.

- + \ No newline at end of file diff --git a/assets/README.md.hhc_Bc4e.js b/assets/README.md.WOiBft4A.js similarity index 84% rename from assets/README.md.hhc_Bc4e.js rename to assets/README.md.WOiBft4A.js index f70419d..ad65a58 100644 --- a/assets/README.md.hhc_Bc4e.js +++ b/assets/README.md.WOiBft4A.js @@ -1,4 +1,4 @@ -import{_ as s,c as i,o as e,V as a}from"./chunks/framework.Cw5Ferw_.js";const F=JSON.parse(`{"title":"Bench to Byte: A biologist's guide to crafting quality code","description":"","frontmatter":{},"headers":[],"relativePath":"README.md","filePath":"README.md"}`),n={name:"README.md"},t=a(`

Bench to Byte: A biologist's guide to crafting quality code

This guide is still under active development.

I made this guide for biologists with moderate programming experience who want to improve the way they work. You'll learn how to organize your projects, use remote resources, create reproducible software environments, and build computational pipelines.

This guide was made with members of the Bloom lab in mind. In the future, I might expand this resource to apply to a more general audience.

Developing

To develop Bench to Byte, you'll need to set up a few things. Specifically, you'll need two pieces of software: Node.js and npm. Node.js is an environment that allows you to run Javascript code on your computer. npm is a package manager that handles the Javascript libraries necessary to create this website.

The instructions for installing Node and npm depend on your operating system and personal preference. For detailed instructions check out this website.

However, if you have conda installed, you can use conda to install node and npm using the environment.yml file in the repo.

bash
conda env create -f environment.yml
+import{_ as s,c as i,o as a,V as e}from"./chunks/framework.Cw5Ferw_.js";const F=JSON.parse(`{"title":"Bench to Byte: A biologist's guide to crafting quality code","description":"","frontmatter":{},"headers":[],"relativePath":"README.md","filePath":"README.md"}`),n={name:"README.md"},t=e(`

Bench to Byte: A biologist's guide to crafting quality code

I made this guide for biologists with moderate programming experience who want to improve the way they work. You'll learn how to organize your projects, use remote resources, create reproducible software environments, and build computational pipelines.

This guide was made with members of the Bloom lab in mind. In the future, I might expand this resource to apply to a more general audience.

Developing

To develop Bench to Byte, you'll need to set up a few things. Specifically, you'll need two pieces of software: Node.js and npm. Node.js is an environment that allows you to run Javascript code on your computer. npm is a package manager that handles the Javascript libraries necessary to create this website.

The instructions for installing Node and npm depend on your operating system and personal preference. For detailed instructions check out this website.

However, if you have conda installed, you can use conda to install node and npm using the environment.yml file in the repo.

bash
conda env create -f environment.yml
 conda activate bench-to-byte

Once Node and npm are installed, verify by running:

bash
node -v
 npm -v

You should see the node and npm versions that you installed on your system.

If this is the first time you're developing bench-to-byte, you'll need to install the packages in packages.json. To do this, run the following command from within the repository:

bash
npm install

With everything installed, you can now boot up a live preview of the website that automatically reflects your changes upon saving.

bash
npm run docs:dev

Now, there will be a local version of the website running at http://localhost:5173. Visit this URL in your browser to see a preview of the site.

Note, If you are developing bench-to-byte from a remote server (not your local machine), use

bash
npm run remote:docs:dev

and access the preview at the URL provided under Network

Contributing

Bench to Byte is based on the static site generator VitePress. This allows you to create fully functional web pages using Markdown files. To add a new page to the site, you simply create a markdown document in the correct directory and make a pull request.

For more details on writing markdown for a Vitepress website, see the markdown reference here.

Project Structure

bash
.
 ├── README.md
@@ -28,4 +28,4 @@ import{_ as s,c as i,o as e,V as a}from"./chunks/framework.Cw5Ferw_.js";const F=
 ├── package-lock.json
 ├── package.json
 └── public
-    └── logo.png

To add details to a specific chapter of the course, simply edit the index.md file under the corresponding chapters/<chapter-you-are-editing>/ directory. If you want to include images in your markdown, you can either use a link to a remotely hosted image or add the image to the public/ directory and reference it like so:

bash
![Alt Text](/my-image.png)

If you've contributed in some way to this guide, add your name and a link to a photo in public/contributors.json.

Suggestions

If you have any suggestions for topics that aren't covered in this giude, please suggest them in the corresponding Discussions thread.

`,30),l=[t];function p(o,h,r,d,c,k){return e(),i("div",null,l)}const u=s(n,[["render",p]]);export{F as __pageData,u as default}; + └── logo.png

To add details to a specific chapter of the course, simply edit the index.md file under the corresponding chapters/<chapter-you-are-editing>/ directory. If you want to include images in your markdown, you can either use a link to a remotely hosted image or add the image to the public/ directory and reference it like so:

bash
![Alt Text](/my-image.png)

If you've contributed in some way to this guide, add your name and a link to a photo in public/contributors.json.

Suggestions

If you have any suggestions for topics that aren't covered in this giude, please suggest them in the corresponding Discussions thread.

`,29),l=[t];function p(o,h,r,d,c,k){return a(),i("div",null,l)}const u=s(n,[["render",p]]);export{F as __pageData,u as default}; diff --git a/assets/README.md.hhc_Bc4e.lean.js b/assets/README.md.WOiBft4A.lean.js similarity index 54% rename from assets/README.md.hhc_Bc4e.lean.js rename to assets/README.md.WOiBft4A.lean.js index a66e2c6..b12da85 100644 --- a/assets/README.md.hhc_Bc4e.lean.js +++ b/assets/README.md.WOiBft4A.lean.js @@ -1 +1 @@ -import{_ as s,c as i,o as e,V as a}from"./chunks/framework.Cw5Ferw_.js";const F=JSON.parse(`{"title":"Bench to Byte: A biologist's guide to crafting quality code","description":"","frontmatter":{},"headers":[],"relativePath":"README.md","filePath":"README.md"}`),n={name:"README.md"},t=a("",30),l=[t];function p(o,h,r,d,c,k){return e(),i("div",null,l)}const u=s(n,[["render",p]]);export{F as __pageData,u as default}; +import{_ as s,c as i,o as a,V as e}from"./chunks/framework.Cw5Ferw_.js";const F=JSON.parse(`{"title":"Bench to Byte: A biologist's guide to crafting quality code","description":"","frontmatter":{},"headers":[],"relativePath":"README.md","filePath":"README.md"}`),n={name:"README.md"},t=e("",29),l=[t];function p(o,h,r,d,c,k){return a(),i("div",null,l)}const u=s(n,[["render",p]]);export{F as __pageData,u as default}; diff --git a/hashmap.json b/hashmap.json index 18673d7..bd54610 100644 --- a/hashmap.json +++ b/hashmap.json @@ -1 +1 @@ -{"readme.md":"hhc_Bc4e","sections_coding-best-practices_index.md":"Dta9vFSp","sections_chatgpt-and-llms_index.md":"r93Rcp6X","sections_introduction_index.md":"CQpxuAdz","index.md":"CklGwWbF","sections_working-collaboratively_index.md":"C5p8aCvH","sections_managing-software-environments_index.md":"BYauHb59","sections_setting-up-your-ide_index.md":"gmvqJbs5","sections_organizing-your-projects_index.md":"Rc_x_RJO","sections_creating-workflows-and-pipelines_index.md":"CKk2bu3I","sections_tracking-your-code_index.md":"CAO3av_f","sections_using-remote-resources_index.md":"C5lMMmr2"} +{"sections_creating-workflows-and-pipelines_index.md":"CKk2bu3I","index.md":"CklGwWbF","sections_coding-best-practices_index.md":"Dta9vFSp","sections_organizing-your-projects_index.md":"Rc_x_RJO","sections_chatgpt-and-llms_index.md":"r93Rcp6X","sections_setting-up-your-ide_index.md":"gmvqJbs5","sections_managing-software-environments_index.md":"BYauHb59","sections_working-collaboratively_index.md":"C5p8aCvH","sections_tracking-your-code_index.md":"CAO3av_f","sections_using-remote-resources_index.md":"C5lMMmr2","readme.md":"WOiBft4A","sections_introduction_index.md":"CQpxuAdz"} diff --git a/index.html b/index.html index abe389c..bdb843d 100644 --- a/index.html +++ b/index.html @@ -19,7 +19,7 @@
Skip to content

Bench to Byte

A biologist's guide to crafting quality code

Bench to ByteBench to Byte

Contributors

Released under the MIT License.

- + \ No newline at end of file diff --git a/sections/chatgpt-and-llms/index.html b/sections/chatgpt-and-llms/index.html index 33b81aa..6f0c664 100644 --- a/sections/chatgpt-and-llms/index.html +++ b/sections/chatgpt-and-llms/index.html @@ -19,7 +19,7 @@
Skip to content

Section 0: Coding Smarter with Online Tools

I've spent a lot of time helping people troubleshoot programming issues, do data analysis, and get set up computationally. If you interviewed people about this experience, a common theme would emerge; I spend a lot of time asking the internet for help. Whether it's a forum like Stack Overflow or an AI chatbot like ChatGPT, you can figure out anything with the internet.

When it comes to programming, my philosophy is that it's much more important to grasp a concept than commit specifics to memory. If you understand the concept behind what you're trying to accomplish, the internet will always be able to fill in the specifics. That's why this is the first section.

While you're following this guide, you're bound to run into the occasional issue or topic that's lacking detail. The internet will help, but how do you decide which resources are appropriate for your questions? Below, I'll address the main online resources that I use.

Note

Almost no resource is free of factual inaccuracies: LLMs 'hallucinate', people writing posts on Stack Overflow might have no idea what they're talking about, and even programmers writing documentation make mistakes.

I have a specific goal and lack any background knowledge

If you know vaguely what you want to accomplish but nothing about how to accomplish it, I'd recommend googling a recent tutorial. For example, if you're trying to figure out how to host a website with GitHub Pages, read their current documentation or watch a video tutorial rather than going to a forum or using an LLM. This is for two reasons: (1) Forums and LLMs will not always be up-to-date, and (2) they make mistakes that you won't be able to catch without a little background know-how.

I have a specific goal and have a little background knowledge

In this case, LLMs are almost always the best approach. LLM stands for Large Language Model. These are machine learning models that are trained on vast amounts of text, usually scraped from the internet, to understand, generate, and predict new text. They learn patterns, enabling them to perform a variety of tasks including proof-reading, generating, and answering questions about code.

Because LLMs are trained on data scraped from the internet, they tend to be better at answering questions with lots of examples to train on. For this reason, LLMs are great at answering common coding questions and writing code in popular languages like Python and Javascript. However, if your question is about something obscure, like a tool that's only used by the Bloom Lab, you won't get a useful answer.

Also, LLMs can 'hallucinate' and respond confidently with factually inaccurate answers. For this reason, I only use LLMs when I feel that I can identify these mistakes or I have a way of validating the response.

Helpful Resources

There are several different LLMs that people use for coding. Personally, I like Open AI's ChatGPT. It's a good generalist model that can answer coding questions as well as performing other tasks. I'd recommend the paid model, at the time of writing this it makes a significant difference.

Another popular LLM that theoretically performs coding tasks better than ChatGPT is Anthropic's Claude. At the time of writing it's well liked by several members of the lab.

Additionally, there are coding specific models like GitHub's Copilot. Copilot is nice because it's integrated directly into the development environment that you write code in giving you suggestions as you go. I use it, but mostly as intelligent autocomplete rather than a resource for asking questions. It's nice, but not as valuable as ChatGPT or Claude.

Released under the MIT License.

- + \ No newline at end of file diff --git a/sections/coding-best-practices/index.html b/sections/coding-best-practices/index.html index 51553af..9a86e8c 100644 --- a/sections/coding-best-practices/index.html +++ b/sections/coding-best-practices/index.html @@ -72,7 +72,7 @@ raise ValueError(f"{corr_range[0]} > {corr_df[corr_col].min()=}") if corr_range[1] < corr_df[corr_col].max(): raise ValueError(f"{corr_range[1]} < {corr_df[corr_col].max()=}")

Notice two things: (1) the function has a 'Docstring' that describes the function's purpose, inputs, and outputs. (2) The code is commented judiciously, and uses comments to explain aspects of the code that aren't self-explanatory.

Modularizing code

Breaking your code into smaller, reusable pieces makes it easier to manage and understand. Those pieces can be functions, classes, or modules depending on the conventions of the programming language. Generally, you don't want to have repetitive code. If you're copying and pasting code from one part of your script into another, that's a sign that you should be modularizing you code with functions or classes.

Logging, Errors, and Warnings

It's important to use logs, warnings, and errors to monitor what's going on with your code. You should have expectations about how your code is supposed to work. These features help ensure that those expectations are met. They prevent silent bugs and help make your code user-friendly.

Notebooks vs Scripts

Programming languages that are commonly used for data analysis, like Python and R, have interactive environments called 'Notebooks' that combine code execution, text, and visualizations in a single document. Notebooks are awesome tools for data analysis, but sometimes scripts are better suited to the task. How do you decide whether to use a Notebook or a script?

When to use a Notebook (i.e. Jupyter Notebooks):

Notebooks are awesome because they combine code, visualizations, and documentation in a single location. You can also run code interactively. I typically use Notebooks for:

  • Prototyping code: It's hard to beat the convenience of running code interactively.
  • Documenting an analysis: If I'm working on an analysis that I want to document for other people it's nice to have code, writing, and plots in a single location.

When to use a script:

Scripts a great way of modularizing code that you're going to run often. I typically use scripts over Notebooks in the following situations:

  • Frequently used code: Scripts are smaller, faster, and simpler to run than Notebooks. I use scripts to modularize code that functions as a tool I plan to run frequently.

Released under the MIT License.

- + \ No newline at end of file diff --git a/sections/creating-workflows-and-pipelines/index.html b/sections/creating-workflows-and-pipelines/index.html index 5cb1c5d..f1cecb8 100644 --- a/sections/creating-workflows-and-pipelines/index.html +++ b/sections/creating-workflows-and-pipelines/index.html @@ -19,7 +19,7 @@
Skip to content

Section 7: Creating reproducible workflows and pipelines

In biology, we often string together bits of code, scripts, and tools to complete a single analysis. Each step runs sequentially to produce some desired output. Instead of remembering the order and manually running each step, you should take advantage of programming tools to codify these steps in a pipeline. In this section, I'll describe what a pipeline is, when you should write one, and what tools are available to make it easier.

What's a pipeline?

Pipelines group together different computational steps one after the other. People often accomplish this with bash scripts or by manually running commands in sequence. But there's a better way. You can use a workflow language. These languages are designed to handle complex workflows, manage dependencies, and automate the execution of tasks. They're smart because they can:

  • Automatically detect dependencies: Determine which steps depend on others.
  • Optimize execution: Run independent tasks in parallel.
  • Resume from failures: Restart workflows from the point of failure without redoing completed tasks.
  • Ensure reproducibility: Keep a record of all steps for consistent results. Imagine being able to run a large complex analysis with a single command.

When should you write an analysis as a pipeline?

If your analysis is more complex than a single script or command, a pipeline can significantly improve efficiency and reliability. This is especially true when certain steps depend on the outputs of previous ones. You should use pipelines to manage these relationships efficiently.

What tools are available for writing pipelines?

There are several tools designed to help you write pipelines including Snakemake, Make, Nextflow, and CWL. Each tool has its strengths, but Snakemake and Nextflow are the most commonly used tools used in Bioinformatics.

Snakemake

We primarily use Snakemake in the Bloom lab. Snakemake is a workflow management system that uses a a Python-based language to define rules, inputs, outputs, and commands. In Snakemake, you define a series of rules in a file called a Snakefile. Each rule specifies:

  • Targets (Outputs): The files or results you want to produce.
  • Dependencies (Inputs): The files required to produce the targets.
  • Actions (Shell Commands or Scripts): The commands to execute.

Snakemake automatically builds a workflow based on these rules, figuring out the order of execution by analyzing the dependencies. The best way to learn Snakemake is by following the tutorial in it's documentation.

Released under the MIT License.

- + \ No newline at end of file diff --git a/sections/introduction/index.html b/sections/introduction/index.html index a536543..2afac80 100644 --- a/sections/introduction/index.html +++ b/sections/introduction/index.html @@ -19,7 +19,7 @@
Skip to content

Introduction

I made this guide for biologists with moderate programming experience who want to improve the way they work. This guide is not intended to teach you a particular coding language (here's a quick aside about that). Instead, you'll learn how to organize your projects, use remote resources, create reproducible software environments, and build computational pipelines. My goal is that you not only learn the how but also the why behind each practice.

A Quick Note

This guide was made with members of the Bloom lab in mind. In the future, I might expand this resource to apply to a more general audience.

Outline

Although each section is useful in isolation, you'll get the most out of this guide if you follow it in order.

  1. Coding Smarter with Online Tools

    Taking advantage of online resources

  2. Setting up your IDE

    Using VSCode to streamline your coding workflow

  3. Organizing your projects

    Creating clearly named, well organized projects

  4. Using remote resources

    Increasing your computational capabilities with remote resources

  5. Managing software environments

    Using conda to isolate software environments

  6. Tracking your code

    Versioning your code with git

  7. Working collaboratively

    Collaborating with GitHub

  8. Reproducible workflows and pipelines

    Building reproducible pipelines with Snakemake

  9. Coding best practices

    Writing clear code

Learning to Program

Some basic programming knowledge is necessary to get the most out of this guide. Below, I'll give a quick overview of the programming languages that are widely used by biologists so you can pick the language that best suites your needs. ß Generally, programming languages are either compiled––you write a script and an intermediate software called a compiler transforms it into an executable program––or interpreted––you write a script that can be executed at any point without a compilation step. Compiled languages are fast and easy to run on different operating systems; however, they tend to be more difficult to learn, run, and debug. Interpreted languages are usually not as fast as compiled languages, but they're much easier to learn. Both types of language have a place in biological programming. Compiled languages are great for writing programs that run quickly on huge amounts of data (like an aligner). Interpreted languages are great for data analysis and plotting. Most biologists will want to learn an interpreted language.

Common compiled programming languages used by biologists are C/C++, Java, and Rust. They're used to write short-read aligners like BWA (C/C++), genome browsers like IGV (Java), and tools for single-cell genomics like cellranger (Rust). You're likely to interact with tools written in a compiled language, but you probably won't need to write your own.

Common interpreted languages are Python, R, Javascript, and Perl. Python and R are the most widely used by biologists. Perl was common among biologists, but has fallen out favor. Javascript is the principle programming language of the web and is occasionally used by biologists to build websites and interactive dashboards. If you're a biologist who want's to learn programming, stick to with Python or R.

When choosing whether to learn Python or R, consider the following factors:

1. What are my research needs?

Python and R share most of the same core features, but they have different strengths. Python is a better general programming language with many useful libraries for biological problems. R is geared towards statistics and data analysis, so programming for these scenarios is more natural. Additionally, R has more common packages for analyzing gene expression and single-cell data. Learn the language with the best set of tools for the type of research you're doing.

2. What are people around me using?

If possible, learn the language that people around you (in your lab or institute) are using. It makes it easier to collaborate and they'll be a valuable resource while you're learning.

TIP

Learn Python if you're a member of the Bloom lab.

Suggestions

If you have any suggestions for topics that aren't covered in this guide, please suggest them in the corresponding Discussions thread.

Released under the MIT License.

- + \ No newline at end of file diff --git a/sections/managing-software-environments/index.html b/sections/managing-software-environments/index.html index f0d058a..cbd0859 100644 --- a/sections/managing-software-environments/index.html +++ b/sections/managing-software-environments/index.html @@ -30,7 +30,7 @@ - altair=5.3 - biopython=1.83 - mafft=7.525

It specifies the name of the environment, like places (channels) to search for each software, and the software (dependencies) and their versions. You activate the environment the same way described above.

Common issues

While Conda is a helpful tool, you're bound to run into issues occasionally. Below are some common issues you might encounter and how to troubleshoot them.

My environment is taking a very long time to solve

This is a common issue when Conda struggles to resolve dependencies. Here are a few ways to address this:

bash
mamba create --name myenv python=3.9

If you're using conda from an installation of miniconda, you can use mamba by installing it like any other dependency into your base environment.

I'm getting permissions issues

This is a specific issue that comes up in the Bloom lab. If you've configured Conda so that there are multiple versions available at the same time––a local version in /home/username and a Bloom lab version in /fh/fast/bloom_j/software––you will eventually run into this problem.

I can't solve my environment

Sometimes Conda won't be able to resolve package dependencies in your environment's specification. If someone else has successfully created an environment from this specification it's likely an issue with how Conda is searching channels.

Channels are repositories that Conda searches for package versions. Conda searches these in a particular order depending on how you've configured it. In most cases, the solution is to make sure you've set up 'strict channel priorities'.

bash
conda config --set channel_priority strict

This means that Conda will always search through channels in the order they're specified. If that doesn't fix the issue, change the order of the channels until the environment can solve. If that doesn't work, check that the dependencies actually exist in the channels you're searching.

- + \ No newline at end of file diff --git a/sections/organizing-your-projects/index.html b/sections/organizing-your-projects/index.html index b9112a1..e37eb94 100644 --- a/sections/organizing-your-projects/index.html +++ b/sections/organizing-your-projects/index.html @@ -19,7 +19,7 @@
Skip to content

Section 2: Organizing your projects

Organization matters in programming. Your goal should be that someone who isn't you can open your project directory and understand where things are, what things are, and why things are. Ideally, you want to front-load the organizational work for a new project. If you've ever done any significant reorganization of a mature project, you'll know how much of a headache that can be.

Below, I'll go through some high-level organizational principles that you should follow. There isn't one right way to do things, but there sure are a lot of wrong ways.

Creating a project

All of the code for a single project belongs in one directory. You should not split code for a single project among multiple distinct directories. Everything should be housed under one roof. But what is a project?

Imagine that you're researching the function of your favorite protein. You're probably taking advantage of a variety of assays and tools, each of which is analyzed in a particular fashion. It could be tempting to split each of these analyses into separate directories housed in different locations in your file system, but this is not good practice. Instead, organize them under a single directory.

Housing all of the analysis files for a project in a single directory has a variety of benefits:

  • It simplifies the process of backing up your code and analysis.
  • It makes it easy to share the relevant code when it comes time to publish.
  • It lets you track/version all of the analyses in your project with tools like git.

Generally, a good rule of thumb is a project corresponds to all of the analysis for a single paper. However, there can be exceptions for large papers that involve many distinct and involved computational pipelines. At the end of the day, you have to use your best judgment.

I'll use this project from the Bloom Lab as an example in the following sections.

Naming a project

Names should be simple and descriptive. The name should communicate the purpose of the project, but it shouldn't be so long that it's annoying to type out. Generally, you should ask yourself if the name you've chosen is:

  • Descriptive
  • Readable
  • Succinct
  • Consistent

Take this project as an example. Bernadeta Dadonaite from the Bloom Lab performed deep mutational scanning on the SARS-CoV-2 Delta variant Spike protein to identify mutations that escape neutralization by the therapeutic antibody REGN10933. All of the code for this analysis is located in a directory called 'SARS-CoV-2_Delta_spike_DMS_REGN10933'. The name clearly describes the project without being overly wordy. It also distinguishes this project from similar ones while adhering to the same naming scheme.

Organizing a project

After you've decided what your project is and how to name it, you should think about how to organize the files within. There are many correct ways to do this, but there are some general principles that you should try to follow. Below, we'll cover some of these general principles.

Project organization should be consistent. If you're working with someone opinionated about project organization, you should do your best to use their approach (or they'll be upset). Another thing to consider is whether there is an established set of best practices for the type of project you're working on. For example, a Snakemake pipeline typically adheres to a specific organizational structure. By sticking to this, you'll make it easier to develop your project and easier for others to follow along.

If you're working alone and there isn't a clear template for organizing your project, here are simple set of rules to follow:

  1. ALWAYS write a README: A 'README' is a plain text file––typically written in markdown––that describes the contents of your project. It should explain the purpose of the project, give credit to the creators, outline the basic organization of the directory, and give details on how to run the analysis. Here's the README for SARS-CoV-2_Delta_spike_DMS_REGN10933.

  2. Separate data from code: Separate the code that's responsible for the analysis from the data that you'll be analyzing. I always put data inside a directory called data/. The organization of data/'s contents is up to you.

  3. Separate results (i.e. derived data) from data: Your analysis will generate results (like plots and tables) and intermediate files (like alignments). Any files that are derived by your code should be kept separately from your input data/. I'd highly recommend storing your derived files in a directory called results/. The organization of results/ is up to you.

  4. Choose clear, explicit, and consistent names: For files and subdirectories, follow the naming advice described above.

  5. Avoid over-nesting: You can organize files and directories to a point where you have to dig through dozens of layers before you even find what you're looking for. When in doubt, flatten it out.

Summary

There are many reasonable ways to organize an analysis or project. However, the principles described above hopefully provide a general road map.

Released under the MIT License.

- + \ No newline at end of file diff --git a/sections/setting-up-your-ide/index.html b/sections/setting-up-your-ide/index.html index dae1711..1cd8e74 100644 --- a/sections/setting-up-your-ide/index.html +++ b/sections/setting-up-your-ide/index.html @@ -26,7 +26,7 @@ # see help options code --help

Summary

VS Code is a must for projects where you're editing many files of different formats. However, VS Code is not a singular solution to all code editing needs. Although you can run code interactively in a Jupyter Notebook from within VS Code, it can often be faster to open up Jupyter from the command line for quick interactive scripting. Additionally, programming languages like R have specific IDEs like Rstudio that make them the default choice for projects built around R code.

Helpful Resources

Check out this page on the VS Code documentation for a thorough walkthrough of the basics.

- + \ No newline at end of file diff --git a/sections/tracking-your-code/index.html b/sections/tracking-your-code/index.html index 6aee416..63dd8a9 100644 --- a/sections/tracking-your-code/index.html +++ b/sections/tracking-your-code/index.html @@ -22,7 +22,7 @@ cd my-project git init

Or, you can clone an existing Git repository from a remote server to your local machine.

bash
git clone <repository URL>
 cd my_project

TIP

A quick note about cloning a remote repository. There are several methods for doing this, some of which require a little set up, and those will be covered in the next section.

Check Repository Status:

After you've made some changes in your directory, you can check what these are using the following command.

bash
git status

This command displays the status of your working directory and staging area. It shows what file have been edited, what new files aren't being tracked, and more.

Add Changes to Staging Area:

bash
git add [file(s)]

Stages specific files for the next commit. Use git add . to stage all changes.

Commit Changes:

bash
git commit -m "Your commit message"

Records a snapshot of the staged changes. The commit message should briefly describe what you've done. It's important to write a good commit message.

View Commit History:

You can see a log of all commits that have been made to a project.

bash
git log

Compare Changes:

This can be useful, but you're unlikely to do it much.

bash
git diff

It compares changes between commits, branches, or your working directory and the last commit.

Push Changes to Remote Repository:

bash
git push

Uploads your local commits to a remote repository.

Pull Updates from Remote Repository:

bash
git pull

Fetches and merges updates from a remote repository into your local repository.

These commands form the backbone of Git operations. Regular use of git status and git log will help you stay informed about your project's state and history.

Branching and Merging

Branches are a powerful feature in Git that allow you to diverge from the main codebase and work independently on a set of changes. The primary branch is usually called main or sometimes master, and it's where the stable version of your project resides.

Why Use Branches?

Branching your code with Git allows for parallel development of multiple features or fixes simultaneously without interference. While you're editing the code on a branch, the main branch remains stable, and new code is only merged after it's ready. Here are some things you might use branches for:

  1. Feature Development: When adding new features, you can create a branch to isolate your work. This way, the main codebase remains unaffected until the feature is complete and tested.

  2. Bug Fixes: For fixing bugs, especially in a production environment, branches let you address issues without disrupting ongoing development work.

  3. Experimentation: If you're trying out new ideas or approaches, branches provide a safe space to experiment without the risk of breaking existing code.

See all active branches:

If you want to see what branches you have active locally, run:

bash
git branch

This will print out the branches you have in your local repo and tell you which branch you're currently on.

Switch to a Branch:

If you want to switch (or checkout) a different branch, run:

bash
git checkout [branch-name]

This switches your working directory to the specified branch.

Create and Switch to a New Branch:

To make a new branch, run:

bash
git checkout -b [branch-name]

This is a shortcut that creates a new branch and switches to it immediately.

Merge a Branch:

After you've completed work on a branch and ensured it's stable, you'll want to merge it back into the main branch. This process integrates your changes and updates the main codebase. First, checkout the branch you want to merge into:

bash
git checkout [target-branch]

Once you're in the target branch (for example, main), merge the branch with your changes by running:

bash
git merge [source-branch]

This merges the changes in the [source-branch] into [target-branch].

Note

While this approach works well for merging branches locally, in collaborative projects, you'll typically merge branches using pull requests. I'll cover this in more detail in the next section.

- + \ No newline at end of file diff --git a/sections/using-remote-resources/index.html b/sections/using-remote-resources/index.html index 3bb7104..cfbfb03 100644 --- a/sections/using-remote-resources/index.html +++ b/sections/using-remote-resources/index.html @@ -38,7 +38,7 @@ #SBATCH -c 16 snakemake -j 16 --software-deployment-method conda -s dms-vep-pipeline-3/Snakefile

The comments starting with #SBATCH configure Slurm by providing arguments like how many CPUs you need (#SBATCH -c 16). You submit the job script to Slurm using the sbatch command as follows:

bash
sbatch run_job.bash

You can check on the status of your jobs running on Gizmo with the squeue command. This command shows you the jobs you have running and how long they've been running for.

bash
squeue -u username

If you want to cancel a job that's running, use the scancel command.

bash
scancel -u username

There's a lot more to Slurm, but that's most of what you'll need to know.

Using tmux

As I mentioned above, you may run into a scenario where you want to run something interactively and you don't want the process to terminate when you log off the server. That's where tmux comes in handy. tmux (Terminal Multiplexer) is software that's pre-loaded on Rhino and Gizmo. It allows you to create multiple terminal sessions within a single window, and more importantly, lets you detach from a session while keeping it running in the background. This way, you can log off and later reconnect to the session to check on your processes or pick up where you left off.

1. Start a new session

To start a new tmux session, use the command:

bash
tmux

This opens a new terminal window with a little green bar at the bottom. You'd use this terminal as you would normally.

2. Detach from a session

To detach from a session and leave it running, press Ctrl + B, followed by D. This will bring you back to your original terminal prompt, but your tmux session will continue to run in the background.

3. List active sessions

If you want to see a list of all the running tmux sessions, use:

bash
tmux ls

4. Reattach to a session

To reattach to a session you previously detached from, use:

bash
tmux attach -t <session_id>

Replace <session_id> with the number or name (if you named it) of the session you want to reconnect to, which can be found using the tmux ls command.

5. Kill a session

When you're done, you can kill a tmux session by typing exit within the session or by pressing Ctrl + D.

- + \ No newline at end of file diff --git a/sections/working-collaboratively/index.html b/sections/working-collaboratively/index.html index a723a41..041607a 100644 --- a/sections/working-collaboratively/index.html +++ b/sections/working-collaboratively/index.html @@ -19,7 +19,7 @@
Skip to content

Section 6: Working collaboratively

In the last section, I talked about tracking your code with Git. In principle, this can all happen locally in your repository, but in many cases, you want to make a remote copy of a repository and its history. That's where services like GitHub come in. GitHub is a website that acts as a remote repository for your code, allowing you to store it online, collaborate with others, and take advantage of various tools for version control and project management.

Using GitHub as a 'Remote'

What's a Remote?

A remote in Git is simply a version of your project that's hosted on the internet or another network. It's the counterpart to your local repository on your computer. By pushing your local changes to the remote repository, you ensure that your code is backed up and accessible from other locations or by other collaborators.

Why Use GitHub?

GitHub is our preferred website for hosting Git repositories remotely. It provides a platform for:

  • Collaborative Coding: Multiple people can work on the same project simultaneously without overwriting each other's changes.
  • Version Control: Keeps track of every change made to the codebase, allowing you to revert to previous versions if needed.
  • Documentation Hosting: With GitHub Pages, you can host project documentation or even entire websites directly from your repository.

In our lab, we use GitHub to coordinate coding efforts, especially for projects that will be part of publications. By connecting your local Git repository to GitHub, you make it easier to share your work and collaborate effectively.

Bloom Lab GitHub Organizations

In the Bloom lab, we use specific GitHub organizations to manage our projects:

But how do you decide when to host your local project remotely? Generally, I host a project remotely on GitHub for two reasons:

  1. If I'm working on code that's likely to be connected to a paper: Hosting your code on GitHub early makes it easy when you need links to data and code for a manuscript.
  2. If I'm working on code with someone else: If you code is on GitHub it makes it easy to share with collaborators or work on the code alongside with else.

Connecting a local repository to a remote

As I mentioned in the previous section, Git is the version control system you use locally on your machine, while GitHub is the remote hosting service where you can store your repositories online. To connect a local repository to a GitHub remote you:

  1. Initialize a Local Repository: You can start from scratch by creating a local Git repository with git init. However, you can also connect an existing repository to a GitHub remote repository at any time.

  2. Create a Remote Repository: On GitHub, you create a new repository to host your project in the appropriate organization or account.

  3. Link Local and Remote Repositories: When you create a new GitHub repository, you'll get instructions for connecting your local repository to the GitHub repository.

  4. Push and Pull Changes: Now, you can use git push to upload your local commits to GitHub and git pull to update your local repository with changes from GitHub.

Cloning a remote repository locally

You can also clone an existing repository on GitHub onto your local machine. The local clone will be automatically connected to the remote repository it was cloned from, allowing you to pull updates or push changes.

To clone a repository, you can use the following command in your terminal, replacing [repository-url] with the URL of the repository you want to clone:

bash
git clone [repository-url]

There are two main protocols for cloning repositories: HTTP and SSH.

  • HTTP: No extra configuration, but you’ll need to enter your username and a personal access token every time you push changes.
  • SSH: A more secure and convenient method, which allows you to authenticate using SSH keys (no passwords!).

TIP

Only fools use HTTP.

Setting up SSH for GitHub

Using SSH is preferred because it’s more secure and doesn’t require you to input credentials each time you push or pull changes. Once you’ve set up your SSH keys, GitHub will trust your machine, allowing you to interact with your repositories seamlessly.

Setting up SSH for GitHub is similar to setting up SSH for Rhino. The main difference is that you'll add your public key to your account on GitHub. You should set this up for any computer you use GitHub with.

Here are detailed and up-to-date instructions for connecting to GitHub with SSH.

Collaborative Workflows

How do you work on code with others productively? It requires some structure to prevent conflicts and ensure code quality. Note, the following workflow mainly applies to projects where you're actively coding with other people. Small projects or projects in their early stages don't require this level of organization.

Best Practices

  1. Protect the Main Branch: The main branch (sometimes called master) should represent the stable, release-ready version of your code––what you'd link to in a paper. It's important to protect it by restricting direct commits. You can do this in the repository settings on GitHub.

  2. Use Feature Branches: When you're working on a new feature or addressing an issue, create a new branch from main. This keeps your changes isolated until they're ready to be merged. Use descriptive names for your branches, like sequencing-analysis-improvement or typo-correction-in-readme.

  3. Merge Changes with Pull Requests: When a feature branch is ready to become a part of main initiate a pull request on GitHub to merge the changes on the feature branch into main. Typically, these changes will be reviewed by at least one other person.

Pull Requests

A pull request (PR) is a way to propose changes you've made on a branch to be merged into another branch (usually main). It allows others to review your code, discuss potential issues, and approve the changes before they become part of the main codebase.

How to Make a Pull Request

  1. Push Your Branch to GitHub: Use git push to upload your branch.

  2. Navigate to the Repository on GitHub: You'll see a prompt to create a pull request for your recently pushed branch.

  3. Create the Pull Request: Provide a descriptive title and include any relevant information or context in the description. You can also link your pull request to specific issues so that they automatically close when the branch is merged.

  4. Review and Merge: Team members can review the pull request, leave comments, and approve it. Once approved, it can be merged into the main branch.

Merge Conflicts

When multiple collaborators edit the same parts of a file, Git may not be able to automatically merge the changes. This results in a 'merge conflict', which you'll need to resolve manually. A typical workflow for resolving these conflicts is as follows:

  1. Identify the Conflict: Git will mark the conflicting areas in the file.
  2. Choose the Correct Code: Decide which changes to keep—yours, theirs, or a combination.
  3. Edit the File: Remove the conflict markers and make the necessary edits.
  4. Commit the Resolution: After resolving the conflicts, commit the changes to complete the merge.

Generally, you're unlikely to run into many merge conflicts while coding in the Bloom Lab.

Issues and Discussions

Another valuable feature of GitHub is the ability to open 'Issues' on repository. Issues should be used for the following things:

  1. Suggesting Features: Outline a new feature you want to add to a project.
  2. Reporting Bugs: Highlight an issue with a project's code.
  3. Asking Questions: Seek clarification on how to use run code. This is something you might have to do if you're using someone else's code.

The best practices for opening Issues is to be as descriptive as possible, use labels to categorize issues, and reference related Issues or pull requests. Another important thing to note is that you should not use Slack for code-related discussions. Unlike Slack messages, Issues on GitHub are persistent, organized, and easy to keep track of.

Released under the MIT License.

- + \ No newline at end of file