Tutorial: explain how to represent ARGs #43

jeromekelleher · 2021-01-29T11:32:05Z

We can simulate some ARGs in msprime, explain how we can losslessly represent ARGs and get all the information you could want.

jeromekelleher · 2021-02-12T18:12:08Z

Reminder to myself: I wrote a presentation about this for the PopSim workshop which could be adapted for a first pass on this. Some good examples in there which would be a good starting point.

hyanwong · 2021-02-17T20:11:14Z

Could you possibly post the presentation, or a link to it, in this issue @jeromekelleher ?

a-ignatieva · 2021-02-17T21:46:24Z

Hello, just adding a suggestion as discussed with @hyanwong . It would be handy to see how to extract the number of recombination events from the ARG representation, and if possible split these into those that are undetectable / change only the branch lengths / are detectable with the right mutations. This would be useful for assessing the performance of ARG inference methods.

jeromekelleher · 2021-02-18T09:15:51Z

Here it is: arg-ts-presentation.pdf

This was the presentation I gave at the PopSim consortium meeting in Aussois, Oct 2019, so some of the information will be a little out of date.

jeromekelleher · 2021-02-18T09:30:28Z

Hello, just adding a suggestion as discussed with @hyanwong . It would be handy to see how to extract the number of recombination events from the ARG representation, and if possible split these into those that are undetectable / change only the branch lengths / are detectable with the right mutations. This would be useful for assessing the performance of ARG inference methods.

Hi @a-ignatieva, welcome! 👋 I'm sure we can figure out how to compute the things you're interested in. Here's an example:

ts = msprime.sim_ancestry(
    2, recombination_rate=0.1, sequence_length=8,
    record_full_arg=True, random_seed=1234)
print(ts.draw_text())

num_re_nodes = 0
for node in ts.nodes():
    if node.flags == msprime.NODE_IS_RE_EVENT:
        num_re_nodes += 1
# Divide by 2 because each RE event has *two* nodes
print("num RE events = ", num_re_nodes / 2)
print(ts.tables.nodes)

giving

3.45┊         ┊         ┊    18   ┊         ┊  
    ┊         ┊         ┊   ┏━┻━┓ ┊         ┊  
2.76┊    17   ┊         ┊   ┃  17 ┊    17   ┊  
    ┊   ┏━┻━┓ ┊         ┊   ┃   ┃ ┊   ┏━┻━┓ ┊  
2.37┊  16   ┃ ┊         ┊   ┃   ┃ ┊  16   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
2.34┊   ┃   ┃ ┊         ┊  14   ┃ ┊  15   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.93┊  12   ┃ ┊         ┊  13   ┃ ┊  13   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.44┊   ┃  11 ┊         ┊   ┃  11 ┊   ┃  11 ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.35┊  10   ┃ ┊    10   ┊  10   ┃ ┊  10   ┃ ┊  
    ┊   ┃   ┃ ┊   ┏━┻━┓ ┊  ┏┻━┓ ┃ ┊  ┏┻━┓ ┃ ┊  
0.79┊   ┃   8 ┊   ┃   9 ┊  ┃  9 ┃ ┊  ┃  9 ┃ ┊  
    ┊   ┃   ┃ ┊   ┃   ┃ ┊  ┃  ┃ ┃ ┊  ┃  ┃ ┃ ┊  
0.70┊   7   ┃ ┊   7   ┃ ┊  7  ┃ ┃ ┊  7  ┃ ┃ ┊  
    ┊  ┏┻━┓ ┃ ┊  ┏┻━┓ ┃ ┊ ┏┻┓ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┊  
0.19┊  6  ┃ ┃ ┊  6  ┃ ┃ ┊ 6 ┃ ┃ ┃ ┊ 6 ┃ ┃ ┃ ┊  
    ┊ ┏┻┓ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊  
0.08┊ ┃ 4 ┃ ┃ ┊ ┃ 4 ┃ ┃ ┊ ┃ ┃ ┃ 5 ┊ ┃ ┃ ┃ 5 ┊  
    ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊  
0.00┊ 0 1 2 3 ┊ 0 1 2 3 ┊ 0 2 3 1 ┊ 0 2 3 1 ┊  
  0.00      1.00      4.00      6.00      8.00 

num RE events = 4.0
id      flags   population      individual      time    metadata
0       1       0       0       0.00000000000000
1       1       0       0       0.00000000000000
2       1       0       1       0.00000000000000
3       1       0       1       0.00000000000000
4       131072  0       -1      0.07592809191144
5       131072  0       -1      0.07592809191144
6       0       0       -1      0.19108191052930
7       0       0       -1      0.69575449796515
8       131072  0       -1      0.79179484017695
9       131072  0       -1      0.79179484017695
10      0       0       -1      1.35455502875617
11      262144  0       -1      1.43685779200617
12      131072  0       -1      1.93338594556679
13      131072  0       -1      1.93338594556679
14      131072  0       -1      2.33538022397736
15      131072  0       -1      2.33538022397736
16      262144  0       -1      2.36985359356004
17      0       0       -1      2.75957664444117
18      0       0       -1      3.44661056408756

So I guess the RE event in nodes 14/15 is undetectable because the trees below both they just change the branch length, but the event in 8/9 is detectable because the trees change "above" these nodes? (I'm just trying to get a precise idea of what the condition is so we can write down an efficient way to detecting the events in question)

a-ignatieva · 2021-02-18T13:00:01Z

I think here the only recombination changing the unrooted topology is 4/5, while 8/9, 12/13 and 14/15 keep the unrooted topologies the same but change the branch lengths (and the third type would be where neither the topology nor the branch lengths change for local trees around the breakpoint).

Is there anything ready-made in tskit for "gluing" these tree sequences into an ARG? Just wondering, as sometimes it's nice to see what's going on by looking at the graph (though easy to do by hand for small examples).

hyanwong · 2021-02-18T13:18:18Z

It would be nice to have a way to "convert" (or perhaps simply display) the tree-sequence-with-recombination-nodes as a graph. This is not dissimilar to the animations I did with record_full_arg simulations (below)

Incorporating this into tskit itself would mean solving tskit-dev/tskit#1209 so that tskit had an idea of what a recombination node meant, rather than it being only a specific extension provided by msprime. We might, however, be able to throw together some code that would work on an msprime-produced TS.

For larger args, to avoid lots of crossing lines, this might require some sort of graph layout package. Not only is that rather a burdensome dependency, but we have had problems locating a suitable one (see discussions at tskit-dev/tskit#621). @a-ignatieva - do you use any specific viz packages for looking at ARG graphs?

SPR.mov

hyanwong · 2021-02-18T13:24:50Z

Also, have you tried just feeding the edges into e.g. graphviz? I'm not sure if that would work: perhaps you can experiment? You can get the parent and child node IDs for each edge very easily:

for e in ts.edges():
    print(e.parent, e.child)

a-ignatieva · 2021-02-18T13:53:53Z

I just use graphviz (dot), as KwARG doesn't care about the y-positions of the nodes.

jeromekelleher · 2021-02-19T09:11:48Z

Hmmm. So, one way we can get at topology changes is to look at the tree ranks in the simplified tree sequence

sts = ts.simplify()
for tree in sts.trees():
    print(tree.draw_text())
    print(tree.interval, tree.rank())

giving

    7  
  ┏━┻━┓
  5   ┃
 ┏┻━┓ ┃
 4  ┃ ┃
┏┻┓ ┃ ┃
0 1 2 3

Interval(left=0.0, right=1.0) (3, 11)
    6  
  ┏━┻━┓
  5   ┃
 ┏┻━┓ ┃
 4  ┃ ┃
┏┻┓ ┃ ┃
0 1 2 3

Interval(left=1.0, right=4.0) (3, 11)
    8  
  ┏━┻━┓
  6   ┃
 ┏┻━┓ ┃
 5  ┃ ┃
┏┻┓ ┃ ┃
0 2 3 1

Interval(left=4.0, right=6.0) (3, 5)
    7  
  ┏━┻━┓
  6   ┃
 ┏┻━┓ ┃
 5  ┃ ┃
┏┻┓ ┃ ┃
0 2 3 1

Interval(left=6.0, right=8.0) (3, 5)

So, we have two distinct topologies, which change at position 4, and this corresponds to the RE node 4/5. Would this definition work for you @a-ignatieva? If so, I'm sure we can work backwards from knowing where the RE event occured along the sequence to pin-pointing it to the event itself.

a-ignatieva · 2021-02-22T16:15:58Z

Yes, that works I think! Though I guess is it complicated when we have two recombination events with the same breakpoint? Eg it looks like 12/13 could also have a breakpoint at position 4 here.
If one of the recombination events was completely "invisible" (no branch length or topology change), then am I right in thinking we would see just one tree instead of two to either side of the breakpoint in the simplified tree sequence?

jeromekelleher · 2021-02-23T09:40:21Z

If one of the recombination events was completely "invisible" (no branch length or topology change), then am I right in thinking we would see just one tree instead of two to either side of the breakpoint in the simplified tree sequence?

Yes, exactly. You're right that things get more complicated if there are multiple recombinations at the same genome position, but I think that's something we assume is negligible in most models. It depends on exactly what you're looking for - if we just want a count of the detectable recombinations, then I think the number of distinct topologies is as good a definition as you'll get (?). Pinpointing the actual events that give detectable topology changes will make use do a bit more work, but I'm sure we could figure it out.

a-ignatieva · 2021-02-27T10:06:23Z

I'm happy with just having a count, so this works for me! Thanks.

hyanwong · 2021-02-27T10:21:52Z

When you are plotting out using Graphviz, or whatever, do you somehow merge the 2 corresponding recombination nodes into a single graphviz node, @a-ignatieva ? Or do you have them as 2 separate ones, and have the recombination node below? Have you got it working, with an image you could post here?

a-ignatieva · 2021-02-27T12:25:43Z

This is what I got by throwing in the nodes and edges (I've edited this to add node ranks to keep the correct time ordering):

from graphviz import Digraph

dot = Digraph(strict=False)

with dot.subgraph() as s:
    s.attr(rank = 'same')
    s.attr('node', shape='doublecircle')
    for n in ts.samples(): 
        s.node(str(n))

dot.attr('node', shape='circle')
itr = iter(range(ts.num_samples, ts.num_nodes))
for i in itr:
    n = ts.node(i) 
    if n.flags == msprime.NODE_IS_RE_EVENT:
        # Assign same rank to recombination nodes
        with dot.subgraph() as s:
            s.attr(rank = 'same')
            s.node(str(n.id))
            s.node(str(ts.node(i+1).id))
            next(itr, None)
    else:
        dot.node(str(n.id))
    # Add invisible edges to fix ranks
    dot.edge(str(i), str(i-1), _attributes={'style': 'invis'})

check = [0,0]
for e in ts.edges():
    # Preventing duplicate edges
    if check == [e.parent, e.child]:
        continue
    check = [e.parent, e.child]
    if ts.node(e.parent).flags==msprime.NODE_IS_RE_EVENT:
        dot.edge(str(e.parent), str(e.child), label="["+str(int(e.left))+","+str(int(e.right))+")")
    else:
        dot.edge(str(e.parent), str(e.child))
        
dot.render('pic', format='png', view=True)

It might be nice to add separate recombination nodes and label them by the recombination breakpoint, like the below (which is what we get with kwarg), but I wasn't sure how to extract the breakpoint positions (eg for the 12/13 recombination into node 10) - though it probably doesn't matter, as long as the label is between the two new intervals.

a-ignatieva · 2021-02-27T19:37:33Z

This is a bit tidier, merging pairs of recombination nodes together:

from graphviz import Digraph

dot = Digraph(strict=False)

# Add sample nodes
with dot.subgraph() as s:
    s.attr(rank = 'same')
    for n in ts.samples(): 
        s.node(str(n), **{'shape': 'doublecircle'})

# Internal nodes
itr = iter(range(ts.num_samples, ts.num_nodes))
for i in itr:
    if ts.node(i).flags == msprime.NODE_IS_RE_EVENT:
        # Only add one of the recombination nodes and make the other one invisible
        with dot.subgraph() as s:
            s.attr(rank = 'same')
            s.node(str(i), **{'shape': 'rect'}, label = str(i) + "/" + str(i+1))
            s.node(str(i + 1),  **{'style': 'invis'})
            next(itr, None)
    else:
        dot.node(str(i), **{'shape': 'circle'})
    # Add invisible edges to fix the ranks
    dot.edge(str(i), str(i-1), **{'style': 'invis'})

# Add edges
check = [0,0]
for e in ts.edges():
    ch = e.child
    pa = e.parent
    
    # Preventing duplicate edges
    if check == [pa, ch]:
        continue
    check = [pa, ch]
    
    # Check which edge to draw if there is a recombination
    if e.parent >= 1 and ts.node(pa).time == ts.node(pa-1).time and ts.node(pa).flags == ts.node(pa-1).flags == msprime.NODE_IS_RE_EVENT:
        ch = pa = -1
    elif e.child >= 1 and ts.node(ch).time == ts.node(ch-1).time and ts.node(ch).flags == ts.node(ch-1).flags == msprime.NODE_IS_RE_EVENT:
        ch -= 1
        
    lab = ''
    if ch >= 0 and ts.node(ch).flags == msprime.NODE_IS_RE_EVENT:
        lab = "[" + str(int(e.left)) + "," + str(int(e.right)) + ")"
    
    if(ch >= 0 and pa >= 0):
        dot.edge(str(pa), str(ch), label=lab)
        
dot.render('pic', format='png', view=True)

hyanwong · 2021-02-27T20:05:10Z

This is a bit tidier, merging pairs of recombination nodes together:

Oh yes, really nice. Thanks @a-ignatieva . We should have something like that in the ARG tut I think.

jeromekelleher · 2021-02-28T08:41:35Z

Yes, this is really nice, thanks @a-ignatieva.

hyanwong · 2021-04-02T20:03:44Z

@a-ignatieva: you might be interested in my recent PR: tskit-dev/tskit#1296. This should allow us to (easily) import a tree sequence into the Python graph software NetworkX. Not sure if this actually helps with viz, but I guess it might be useful for other things? I'd be interested to see how it copes with msprime's record_full_arg option.

See tskit-dev#43

hyanwong · 2023-11-09T13:38:30Z

A simple example from graphviz and a more complex one of reading it back into networkx are now at https://tskit.dev/tutorials/viz.html#graph-representations. The ARG tutorial contains the code for networkx conversion (https://tskit.dev/tutorials/args.html#sec-args-other-analysis) so I think we can close this issue.

hyanwong mentioned this issue Feb 17, 2021

Move msprime.NODE_IS_RE_EVENT into tskit? tskit-dev/tskit#1209

Closed

hyanwong added a commit to hyanwong/tutorials that referenced this issue Jun 22, 2021

Add a stub for the ARG tut

f29bfc3

See tskit-dev#43

hyanwong mentioned this issue Jun 22, 2021

Add a stub for the ARG tut #89

Merged

hyanwong mentioned this issue Nov 10, 2022

Fill out most of ARG tute #199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: explain how to represent ARGs #43

Tutorial: explain how to represent ARGs #43

jeromekelleher commented Jan 29, 2021

jeromekelleher commented Feb 12, 2021

hyanwong commented Feb 17, 2021

a-ignatieva commented Feb 17, 2021

jeromekelleher commented Feb 18, 2021 •

edited

Loading

jeromekelleher commented Feb 18, 2021 •

edited

Loading

a-ignatieva commented Feb 18, 2021

hyanwong commented Feb 18, 2021 •

edited

Loading

hyanwong commented Feb 18, 2021

a-ignatieva commented Feb 18, 2021

jeromekelleher commented Feb 19, 2021

a-ignatieva commented Feb 22, 2021

jeromekelleher commented Feb 23, 2021

a-ignatieva commented Feb 27, 2021

hyanwong commented Feb 27, 2021

a-ignatieva commented Feb 27, 2021 •

edited

Loading

a-ignatieva commented Feb 27, 2021 •

edited

Loading

hyanwong commented Feb 27, 2021

jeromekelleher commented Feb 28, 2021

hyanwong commented Apr 2, 2021

hyanwong commented Nov 9, 2023 •

edited

Loading

Tutorial: explain how to represent ARGs #43

Tutorial: explain how to represent ARGs #43

Comments

jeromekelleher commented Jan 29, 2021

jeromekelleher commented Feb 12, 2021

hyanwong commented Feb 17, 2021

a-ignatieva commented Feb 17, 2021

jeromekelleher commented Feb 18, 2021 • edited Loading

jeromekelleher commented Feb 18, 2021 • edited Loading

a-ignatieva commented Feb 18, 2021

hyanwong commented Feb 18, 2021 • edited Loading

hyanwong commented Feb 18, 2021

a-ignatieva commented Feb 18, 2021

jeromekelleher commented Feb 19, 2021

a-ignatieva commented Feb 22, 2021

jeromekelleher commented Feb 23, 2021

a-ignatieva commented Feb 27, 2021

hyanwong commented Feb 27, 2021

a-ignatieva commented Feb 27, 2021 • edited Loading

a-ignatieva commented Feb 27, 2021 • edited Loading

hyanwong commented Feb 27, 2021

jeromekelleher commented Feb 28, 2021

hyanwong commented Apr 2, 2021

hyanwong commented Nov 9, 2023 • edited Loading

jeromekelleher commented Feb 18, 2021 •

edited

Loading

jeromekelleher commented Feb 18, 2021 •

edited

Loading

hyanwong commented Feb 18, 2021 •

edited

Loading

a-ignatieva commented Feb 27, 2021 •

edited

Loading

a-ignatieva commented Feb 27, 2021 •

edited

Loading

hyanwong commented Nov 9, 2023 •

edited

Loading