Hierarchical clustering is widely used in bioinformatics. In the CB2442 course, you have encountered it both in the Sequence feature module and in the Phylogenetics module. In this lab, your task is to write a function that, given a pairwise distance matrix and a list of names of the corresponding objects (in this case, sequences), performs hierarchical clustering using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method. The function should output the result as a tree in the Newick format. Before starting, please read this thoroughly
Begin with downloading the project to your local computer by using this link.
Unzip the files into a directory and open the directory in VScode.
$ unzip 'kth-gt cb2442 main prog-p1.zip'
$ code .
In the labp4.py
file, modify the function
def upgma(dist_matr, names_list):
that takes a pairwise distance matrix (as a 2-dimensional numpy array) and a list of sequence names (as a list) as input and should return a tree in Newick format (as a string). The output tree should not include branch lengths. Also, set the list authors
in the beginning of the file to contain the group members' names.
-
Initialization:
nwk_list
: A list containing the names of the sequences (e.g.,['S1', 'S2', 'S3', ...]
).cluster_list
: A list of lists, each with a single sequence index (e.g.,[[0], [1], [2], ...]
). This is the same as thenwk_list
but it has indices instead of names.updated_dist_matr
: A copy of the original distance matrix (dist_matr
) that will be modified during clustering.
-
Iterative Clustering:
- While there is more than one cluster in
cluster_list
:- Find the pair of clusters with the smallest distance in
updated_dist_matr
. Remember that each row/column represents a cluster! - Merge Clusters:
- Create a new cluster by combining the indices of the two closest clusters.
[0,1]
- Remove the clusters that you merged from
cluster_list
.[[0], [1], [2], ...]
-->[[2], ...]
- Append the new merged cluster to
cluster_list
.[[2], [0,1], ...]
- Create a new cluster by combining the indices of the two closest clusters.
- Update
nwk_list
:- Combine the Newick strings of the two clusters (e.g.,
'(S1,S2)'
). - Remove the old entries from
nwk_list
. - Append the new Newick string to
nwk_list
.
- Combine the Newick strings of the two clusters (e.g.,
- Update
updated_dist_matr
:- Remove rows and columns of the merged clusters.
- Add a new row and column for the new cluster initialized with zeros (this step is already done for you).
- Compute the average distances between the new cluster and the remaining clusters. This way you can fill the new row and column added previously. If you find this tricky refer to the "updating the distance matrix" section.
- Find the pair of clusters with the smallest distance in
- While there is more than one cluster in
- Printing statements for the
cluster_list
, thenwk_list
, and theupdated_dist_matr
to check they are changing according to your plan. - Think about what the smallest cluster distance means. Does it make sense to take the smallest distance for
i=j
? - Think about the distance matrix being symmetric. Does it mean there is redundancy in the data?
- Make sure you know how to get array and list indices. Stack overflow can be a good source
- Make sure you know how to add and delete elements from a list/array in Python. When you delete an element, be careful with the index order!
- Be careful with the iterators in the
for
loops, and make sure they are behaving as you want. Python starts at0
and finishes atlen-1
by default, but it can be changed.
The last step is to update the distance matrix by filling the row and column initialized at values of 0
.
For each other cluster:
- Look at every sequence in the new cluster.
- Look at every sequence in the other cluster.
- Find the distance between each pair (one from the new cluster and one from the other cluster) from the original distance matrix (
dist_matr
). - Add all these distances together.
- Divide the total by the number of comparisons (i.e., the number of pairs).
This gives you the average distance between the new cluster and the other cluster.
To be more clear: If the new cluster consists of the original indices [0,1]
, you start by computing the average distance between 0
and 2
, and then between 1
and 2
(where 2
is other_cluster
). This average distance is a value that goes in the updated_dist_matr
at the index [new_cluster, other_cluster]
... Remember that the matrix is symmetric though! Repeat for every other cluster.
You can make an initial execution of your upgma
function by running the main function of the python file itself by executing the line
$ python3 labp4.py
However, the final test of the code is done by executing the runner4.py
executable, which can be executed from command line as
$ python3 runner4.py
or just
$ ./runner4.py
This executes the upgma
function in labp4.py
using a couple of test inputs and validates the results.
If you have implemented the function correctly, you will see your names appear.
Good luck with the lab!