Skip to content
Kartik Raj edited this page Mar 21, 2016 · 45 revisions

Table of Contents


About Me

Basic Information

Name: Kartik Raj
University: Indian Institute of Technology, Kanpur
Major: Computer Science and Engineering
Email: [email protected]
IRC: karrtikr
Github: karrtikr
Timezone: IST (UTC +5:30)

Background work and Programming Skills

I am a second year student of IIT Kanpur. I am pursuing a degree in Computer Science and Engineering. Mathematics has always been my favourite along with which I keep learning from diversified subjects. I prefer Ubuntu 15.04 as my primary work machine and Sublime text for development .Being proficient in C and Python I am skilled in Bash, Verilog, Octave, Sqlite and some other scripting languages.

I am familiar with the concept of version control and know how to use Git and Github.

Python has always been my first choice because of its flexible usability and designing. It has the ability to shrink lines of code to just one or two sentences. The experience of using C has always helped in taking python to the next level. What captures my interest in the featured project is I can lend a good amount of my expertise of python and data structure over here, as well as learn through the challenges to add it to my knowledge bank.

#The Project
##The Aim and Motivation ETE being a pure python library provides a swift way for people to use it because python doesn't need a compiler, so everything goes smoothly on the python interpreter. The basic motivation of this project is to develop the searching capabilities in ETE tree structures, in particular, the aim is to develop a new ETE module that allows search queries large collections of trees using regular-expression-like-queries. Permitting this would extend the applications of this framework and will enable users to perform complex queries specially in the Phylogenomics field.

##Ideas and Plan

My work can be divided into 4 phases:

####Phase 1: Developing a vocabulary of patterns which permit regular-expression-like-queries

The syntax of the pattern should be a reasonable compromise between usability and functionality. Moreover, tree patterns should allow common operators and must have a basic language to permit user defined functions and filters. Keeping these things in mind, as of now I am proposing the following vocabulary to my patterns:

@ = target node
OR   || 
AND   &&
NOT   !
OPERATORS >= > < <= != == ~= IN
custom functions
@leaf
@size
@contains
function arguments = {}

####Phase 2: Implementing the search engine

The aim in this phase will to implement the most optimal way to find matches. As of my current work plan, the matcher used to search will be recursive, i.e matching for a particular node and its children in the tree with the pattern will be done recursively. Now few improvements that I plan to make in this are:

  1. Reducing number of recursive calls: The main idea is to invoke heuristics improvements so that our matcher visits only the necessary nodes to find an optimal solution. One idea could be to scan the search pattern before checking it with the nodes, and if we know that certain conditions could never be met, we could omit such nodes. Another could be to assign some sort of priority to each node using learning techniques and visit the nodes in that order.

  2. Parallelisation: Putting simply, by parellelisation I mean that is to match our pattern with several trees simultaneously. We would be converting set of sequential instructions into a multi-threaded instruction which would utilize multiple processors simultaneously enhancing the performance many-folds. Its implementation will greatly enhance performance to scan large collections of trees as the current database host millions of those.

####Phase 3: Developing a way to auto generate patterns from a bunch of real trees

Once we have developed our vocabulary of patterns, it would be nice to be able to automatically generate patterns from group of trees which would enable one to find more trees of its kind. I have elaborated my ideas into the following points:

  • First we need to identify the attributes that make a bunch of trees distinguishable from the others. I plan to do this by encoding tree and node properties properly and then further use simple matching learning techniques to detect the attributes relevant to the purpose.

  • One trivial algorithm to do so is, by considering each subset of our attributes. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set will give the error rate of the model and then we can choose the model with minimum error rate. Although this is computationally very intensive, but it will provide the best performing feature set for that particular type of model. I could use mathematical techniques such as principal components analysis, but as per my current knowledge, the resulting features after feature extraction has taken place are of a different sort than the original features and may not easily be interpretable, so we will have to discuss and see to it.

  • One thing to note here is that the topology attribute is going to play a major role in deciding our pattern. We would want the topology to be the largest possible so our tree family for the pattern can be made more distinguishable. So, we could give our learning technique a head start by giving priority to largest possible topologies initially.

  • Once those properties are detected, for instance, for a given collection of 20k trees, and a subclass of 100 trees, we would then learn and divide my collection of 20k trees into many groups by assigning a label to each tree(using clustering in unsupervised learning), and also have a rough pattern for each group.

  • When we encounter a subclass(eg. of 100 trees), we would see in which groups it lies in, so that we can get the distinctive attributes of those 100 trees. If the trees lie in multiple groups, say 10 groups, the next thing we have to do in our algorithm is to first detect the "largest appropriate skeleton(topology) for our output pattern". We could use the rough pattern for each group to extract the final pattern. If our attributes doesn't fit appropriately to our skeleton, we will try the next most likely smaller skeleton for our bunch of trees.

The following prototype in combination with learning may help traversing to lower skeletons, although I need to construct the functionalities needed for the purpose:

class PatternExtractor(Tree):
	...
	topology=self.get_topology([tree1,tree2,tree3])
	...
	def get_topology(self,*args):
		skeletons=[]
		for tree in args:
			skeleton=get_skeleton(tree)
			#get_skeleton returns the topology for the tree, for eg. ((,),) for ((A,B),C)
			skeletons.append(skeleton)
	        while """topology do not fit appropiately""":
	                topology=find_next_largest_possible_common_substring_ensuring_topological_structure(skeletons)
		return topology

This would serve as our default attributes as long as user forces other conditions. We would want to make this technique more flexible by letting the user decide for which set of attributes he wants to generate patterns. For instance: use only topology, topology+node distances, etc. The declarative prototype will be like,

custom_attributes=[species,named_lineage]
pattern=extract_pattern([tree1,tree2,tree3],default=True,custom_attributes)
#default takes True if the basis of commonalities detected should include default attributes detected by our learning techniques, else False

####Phase 4: Developing a visualization framework to highlight tree matches and differences

regarding the visualization, I was thinking more on a face-to-face representation of two trees, where matching the partitions can be highlighted or, similarly, I layout that permits to highlight what parts of a tree pattern are found in a given tree... But this is an optional task.

####Common works

I have planned to develop the Python API based usage for my search ETE feature along with all of the phases. Additionally I'll be writing tests side by side. Solving all the bugs in the last is really a bad idea as compared to not letting them emerge. However, if any hiccups still persists, I'll try to solve them before merging the final work.

Like other ETE features, after API is ready, creating a command line tool is simple too. It just calls a function and then the process proceeds. In setup.py one can see something like entry_points which define the command line tool, it would be ete_search.py in our case.

##Timeline(Tentative)

Community Bonding Period (April 22 - May 22)

Goal: Community Bonding

  • The principle focus in this period would be studying the ETE framework in detail, making notes on to which things I can further extend my search engine.

  • I'll ask guidance from my mentor and exactly fix the syntax for my patterns so that I don't have much hesitation implementing it later.

  • If possible, I would also start coding in this phase only, so that I get a head-start

Week 1 - Week 2 (23 May - 5 June)

Goal: Developing the full vocabulary for my patterns

  • I'll initiate the creating the documentation of my vocabulary for patterns.

Week 3 - Week 4 (6 June - 19 June)

Goal:

  • I'll start first with implementing METIS. I'll take ysitu's work as my starting point.

  • In this period, I'll be writing Python wrappers of METIS graph partitioning functions.

  • I'll also be writing tests along with it

Mid term Evaluation

  • Having fully functioning add-on sytem with one nontrivial add-on working properly.

  • Fix bugs if any

Week 5 - Week 7 (27 June - 17 July)

Goal: Modifying Lemon Graph Library for use

  • Lemon Graph Library has more varied functionalities. This period will be dedicated to studying and modifying the functions to be used.

  • As soon as I'll finish modifying the Lemon graph packages, I'll head forward to write code to NetworkX.

Week 8 - Week 10 (18 July - 7 August)

Goal: Writing code to NetworkX, Writing tests

  • In this period I'll be writing Python wrappers of Lemon Graph library functions

  • Finishing the add-on after writing tests

Week 11 - Week 12 (8 August - 21 August)

Goal: Working with NetworkX-Matplotlib

  • Remove NetworkX drawing package out of the proper package and implement it as a add-on for NetworkX while it being hosted on github under the umbrella of NetworkX.

  • Finishing/fixing bugs for existing add-ons so far.

Future Work - Continue working over the add-on system forever. I'll love to see future implementations on it.

I'll be writing my notes through out the progress. Later on I'll write IPython notebook tutorials for the work I'll have done.

I would be able to devote 40 - 50 hours a week during the project, since I have no other big project devoted for the summer. My summer vacations start by 29 April and I'll not be engaged in vacations. My academic year would begin by July 17, but for the first month I would still be able to devote around 40 hours, since there would be no tests/exams.


Notes

  • I have no commitment during summer which means I'm free to work completely on my project.

  • I am very enthusiastic about my work being beneficial to other people in the open source community. I'll keep on seeing the work done by me and fixing issues if they emerge in future.


References

Clone this wiki locally