Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add samples to a distance matrix #45

Open
wasade opened this issue Sep 20, 2017 · 0 comments
Open

Add samples to a distance matrix #45

wasade opened this issue Sep 20, 2017 · 0 comments

Comments

@wasade
Copy link
Member

wasade commented Sep 20, 2017

Take as input two BIOM tables, the background and the samples to add, and as input the distance matrix of the background. Represent the background as condensed form (better yet, load from a condensed form representation on disk). How distances are computed is partial values from stripes and discussed below. How computed distances are added is as follows:

  • a new sample, foo, is to be added
  • distances between foo and all background samples are computed in background distance matrix order
  • the foo sample id is pushed on to the 0th position of the sample id array of the condensed form matrix
  • foo sample distance values are pushed into the front of the condensed form vector of values

This works because:

# the distance matrix
# 0 A A A
# A 0 B B
# A B 0 C
# A B C 0

# is in CF
# A A A B B C

A new sample can be expressed as a new row in the distance matrix. If that new row corresponds to the first row in the distance matrix then we are in effect pushing into the cf representation:

# 0 x x x x
# x 0 A A A
# x A 0 B B
# x A B 0 C
# x A B C 0

# is

# x x x x A A A B B C

Computing the distance of a set of samples corresponds to partial stripe compute. The indexing gets janky, and half of it is easy. For the first half of x distances in the above, they are the 0th value of each stripe. The remaining distances are in effect a negative diagonal along the stripes starting at the right most position of the 0th stripe, then 2nd from right in the next stripe, etc -- annoying to determine but feasible. The hard part is only computing those values, doing so efficiently, and reasonably in the present framework. Still thinking about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant